# Machine Learning Intro = Machines that *learn* to perform a task from *experience* 3 forms of learning based on labels availability: - Yes -> Supervised learning - Some -> Semi-supervised learning - No -> Unsupervised learning # Supervised Learning Training data has labels $\mathcal{D} = \{(x_1, t_1), \dots, (x_N, t_N\}$ Goal: learn a *predictive* function that yields good performance on *unseen* data Data may need to be preprocessed to handle - Missing/wrong values - Outliers - Inconsistencies # Features Feature extraction = process that creates descriptive vectors from samples - Features should be invariant to irrelevant input variations - Selecting the *right* features! - Usually encode some domain knowledge - Higher-dimensional features are more discriminative Curse of dimensionality: complexity increases *exponentially* with number of dimensions # Terms, Concepts, Notation Mostly based on statistics and probability theory Notation: - Scalar $x \in \mathbb{R}$ - Vector-valued $\text{x} \in \mathbb{R}$ - Datasets $\mathcal{X} \in \mathbb{R}$ - Labelled datasets $\mathcal{D} = \{(x_1, t_1), \dots, (x_N, t_N\}$ - Matrices $\text{M} \in \mathbb{R}^{m \times n}$ - Dot product $\text{w}^\text{T}\text{x} = \sum_{j=1}^D w_j x_j$ # Probability Basics Over random variables: - Discrete case: $p(X = x_j) = \frac{n_j}{N}$ - Continuous case: $p(X \in (x_1, x_2)) = \int_{x_1}^{x_2}p(x)\, dx$ where $p(x)$ is the probability desnity function (pdf) of $x$ Some formulas: Let $A \in \{a_i\}, B \in \{b_j\}$ Consider $N$ trials: - $n_{ij} = \# \{A = a_i \land B =b_j\}$ - $c_i = \#\{A=a_i\}$ - $r_j = \#\{B=b_j\}$ Then we get: - Joint probability $p(A=a_i, B=b_j) = \frac{n_{ij}}{N}$ - Marginal probability $p(A=a_i) = \frac{c_i}{N}$ - Conditional probability $P(B=b_j | A=a_i)=\frac{n_{ij}}{c_i}$ - Sum rule $p(A=a_i) = \frac{1}{N}\sum_j n_{ij} = \sum_{b_j}p(A=a_i,B=b_j)$ - Product rule $P(A=a_i, B =b_j) = \frac{n_{ij}}{c_i}\cdot \frac{c_i}{N} = p(B=b_j |A=a_i)\cdot p(A=a_i)$ In short: - Sum rule: $p(A) = \sum_Bp(A,b)$ - Product rule: $p(A,B) = p(B|A)p(A)$ - Bayes' Theorem: $p(A|B)= \frac{p(B|A)p(A)}{\sum_Ap(B|A)p(A)}$