Some unifying notation: Probabilistic Classifiers

Table of contents

Notation
Setting
Definitions

One big challenge while learning, well… anything really, lies in demystifying jargon. And in a field like Machine Learning, which has roots in math, statistical theory, probability theory (bayesian and frequentist) and computer science, this becomes even harder. Each sub-field comes with its own jargon even when talking about the exact same thing!

Here’s a starting point to unify some notation and definitions when dealing with Probabilistic Classifiers.

Notation

\(\mathbf{x} \in \mathbb{R}^d\) : feature inputs or covariates (usually high dimensional), described by the random variable \(X\)
\(y \in \{1,2,...,C\}\) : classes (labels), described by the random variable \(Y\)
Each instance of data is drawn from a join probability distribution \(p^*(X,Y)\)
\(p^*(\mathbf{x},y)\) : True distribution of \(\mathbf{x}\) and \(y\). Also denoted by \(p^*(X,Y)\) or \(p^*(X= \mathbf{x}, Y = y)\)
\(\mathcal{D} = \{(\mathbf{x}_n, y_n)\}_{n = 1}^{N}\) : Dataset \(\mathcal{D}\) with \(N\) i.i.d (independant and identically distributed) samples from \(p^*\)

Figure 1: Common probabilistic classification notation

Setting

Classification — task of assigning a class to a given instance of data defined by a set of features
Probabilistic Classification — stricter task of assigning probabilities that the given instance of data (features) belongs to each possible class. The probabilities indicate the “confidence” in that class being correct for the given instance
Note: We refer to probabilistic classification as classification below unless explicitly specified
\(C\)-class Classification problem setting:
- true distribution is assumed to be a discrete distribution over \(C\) classes
- observed \(y\) is a sample from conditional distribution \(p^*(y \vert \mathbf{x})\) or \(p^*(Y \vert X=\mathbf{x})\)
Neural networks (discriminative classifiers) try and estimate \(p_\theta(y \vert \mathbf{x})\) by fitting \(\theta\) using \(\mathcal{D}\) (training dataset)
- During deployment, the NN is evaluated using a dataset \(\mathcal{T}\), sampled from a distribution \(q(\mathbf{x},y)\) or \(q(X,Y)\)

Figure 2: A neural network with a softmax classifier

Definitions

Logits: the activations from the last layer of the NN are termed as “logits”, \(z(\mathbf{x}) \in \mathbb{R}^C\)
- They are fed into a function that converts the \(C\) logits into \(C\) probabilities, \(p_i\)
- Let’s assume softmax (Fig 2) here.

New

For other such functions, check out my post on sigmoid vs softmax

Confidence: the maximum \(p_i\) is the assosciated “confidence” with the prediction
Prediction: the class corresponding to the maximum \(p_i\) is the prediction
Components of the joint probability distribution \(p(X, Y)\):
- Evidence: \(p(X)\)
- Likelihood: \(p(X \vert Y)\)
- Prior: \(p(Y)\)
- Posterior: \(p(Y \vert X)\)

\[p(Y=y_i|X) = \frac{p(X|Y=y_i)\, p(Y=y_i)}{p(X)} = \frac{p(X|Y=y_i)\, p(Y=y_i)}{\sum_{j \in classes}p(X|Y=y_j)\, p(Y=y_j)}\]

Bayes’ optimal classifier: A classifier that predicts the true posterior probability distribution, \(p(Y \vert X=\mathbf{x})\), for every input instance \(\mathbf{x}\) is a Bayes’ optimal classifer

If you want to use parts of the text, any of the figures or share the article, please cite it as:

Some unifying notation: Probabilistic Classifiers

Notation

Setting

Definitions

Share this: