Table of contents
One big challenge while learning, well… anything really, lies in demystifying jargon. And in a field like Machine Learning, which has roots in math, statistical theory, probability theory (bayesian and frequentist) and computer science, this becomes even harder. Each sub-field comes with its own jargon even when talking about the exact same thing!
Here’s a starting point to unify some notation and definitions when dealing with Probabilistic Classifiers.
Notation
- \(\mathbf{x} \in \mathbb{R}^d\) : feature inputs or covariates (usually high dimensional), described by the random variable \(X\)
\(y \in \{1,2,...,C\}\) : classes (labels), described by the random variable \(Y\)
Each instance of data is drawn from a join probability distribution \(p^*(X,Y)\)
\(p^*(\mathbf{x},y)\) : True distribution of \(\mathbf{x}\) and \(y\). Also denoted by \(p^*(X,Y)\) or \(p^*(X= \mathbf{x}, Y = y)\)
- \(\mathcal{D} = \{(\mathbf{x}_n, y_n)\}_{n = 1}^{N}\) : Dataset \(\mathcal{D}\) with \(N\) i.i.d (independant and identically distributed) samples from \(p^*\)
Figure 1: Common probabilistic classification notation
Setting
- Classification — task of assigning a class to a given instance of data defined by a set of features
Probabilistic Classification — stricter task of assigning probabilities that the given instance of data (features) belongs to each possible class. The probabilities indicate the “confidence” in that class being correct for the given instance
Note: We refer to probabilistic classification as classification below unless explicitly specified
- \(C\)-class Classification problem setting:
- true distribution is assumed to be a discrete distribution over \(C\) classes
- observed \(y\) is a sample from conditional distribution \(p^*(y \vert \mathbf{x})\) or \(p^*(Y \vert X=\mathbf{x})\)
- Neural networks (discriminative classifiers) try and estimate \(p_\theta(y \vert \mathbf{x})\) by fitting \(\theta\) using \(\mathcal{D}\) (training dataset)
- During deployment, the NN is evaluated using a dataset \(\mathcal{T}\), sampled from a distribution \(q(\mathbf{x},y)\) or \(q(X,Y)\)
Figure 2: A neural network with a softmax classifier
Definitions
- Logits: the activations from the last layer of the NN are termed as “logits”, \(z(\mathbf{x}) \in \mathbb{R}^C\)
- They are fed into a function that converts the \(C\) logits into \(C\) probabilities, \(p_i\)
- Let’s assume softmax (Fig 2) here.
New
For other such functions, check out my post on sigmoid vs softmax
- Confidence: the maximum \(p_i\) is the assosciated “confidence” with the prediction
Prediction: the class corresponding to the maximum \(p_i\) is the prediction
- Components of the joint probability distribution \(p(X, Y)\):
- Evidence: \(p(X)\)
- Likelihood: \(p(X \vert Y)\)
- Prior: \(p(Y)\)
- Posterior: \(p(Y \vert X)\)
- Bayes’ optimal classifier: A classifier that predicts the true posterior probability distribution, \(p(Y \vert X=\mathbf{x})\), for every input instance \(\mathbf{x}\) is a Bayes’ optimal classifer