# Notes on Machine Learning 8: Naive Bayes

by **장승환**

#### (ML 8.1) Naive Bayes classification

Naive Bayes is a family of models that is not necessarily a “Bayesian” method!

Setup: Given data $D = ((x^{(1)}, y_1), \ldots, (x^{(n)}, y_n))$

with $x^{(i)} = (x_1^{(1)}, \ldots, x_1^{(d)}) \in \mathbb{R}^d$ and $y_i \in \mathcal{Y} = \{1, \ldots, m\}$.

Assume a family of joint distributions $p_\theta$ such that

Let $(X^{(1)}, Y_1) , \ldots, (X^{(n)}, Y_n) \sim p_\theta$, iid, for some $\theta$.

(If $(X, Y) \sim p_\theta$, then $X_1, \ldots, X_d$ are (conditionally) independent given $Y$.)

Goal: For new $x \in \mathbb{R}^d$, pridict it’s $y$.

Algorithm:

- Estimate $\hat{\theta}$ from $D$.
- Then compute

#### (ML 8.2) More about Naive Bayes

Assume that $(X, Y) \sim p_\theta$ with $Y \in \mathscr{Y} \in \{1, \ldots, m\}.$

**How to choose $p_\theta$?**

- $p_\theta(y) = p_\theta(Y = y) := \pi$ with $\pi = (\pi_1, \ldots, \pi_m)$.
- $p_\theta(x_i\vert y) = p_\theta(X_i = x \vert Y = y)$

$\theta =$ (all parameters of the distributions)

If $X_i \in \{1, \ldots, N\}$ then e.g. $p_\theta(x_i\vert y) := q(x_i, y)$.

If $X_i \in \{1, 2, \ldots\}$ then e.g. Poisson or Geometric or whatever.

If $X_i \in \mathbb{R}$ then e.g. Gaussian, Gamma or whatever.

**How to estimate $\theta$?**

MLE, or MAP (assuming priro on $\theta$)

Another option: “Bayesian” naive Nayes

**Why make conditional independence assumption?**

Can estimate $\theta$ more accurately with less data.

“Wrong but simple model can be better than correct but complicated model!”

#### (ML 8.3) (ML 8.4) Bayesian Naive Bayes

Assume that $(X, y) \sim p(x,y\vert \theta)$.

$D = ((x^{(1)}, y_1), \ldots, (x^{(n)},y))$ where

$x^{(i)}= (x_1^{(i)}, \ldots, x_d^{(i)})$ and *the $j$-th feature* $x_j^{(i)} \in A_j$
and $y \in \mathscr{Y} = \{1, \ldots, m\}$

$(X^{(1)}, Y_1), \ldots, (X^{(n)}, Y_n) \sim p(x,y\vert \theta)$ independent given $\theta = (\pi, \{r_jy\})$

Naive Bayes assumption says that the features are independent given the class & the parameter

- $p(y\vert \theta) = \pi(y)$ where $\pi = (\pi(1), \ldots, \pi(m))$
- $p(x_j\vert y, \theta) = P(X_J = x_j\vert Y=y, \theta) = r_{jy}(x_j)$

($\sum \pi(y)=1, \sum_{k \in A_j}r_{jy} \forall j, y$)

**Let’s put some prior: a natural choice is Dirichlet.**

(Dirichlet is a conjugate prior for categorical distribution.)

…

#### (ML 11.1) Estimators

Assume the data $D = (X_1, \ldots, X_n)$ are given as random variables.

**Definition.** A *statistic* is a random variable $S = f(D)$ that is a function of the data $D$.

**Terminology.** An *estimator* is a statistic intended to approximate a parameter governing the distribution of $D$.

**Notation.**

- $\hat{\theta}$ denotes an estimator of a parameter $\theta$.
- $\hat{\theta}_n$ emphasize (the dependence on) $n$

**Example.** $X_1, \ldots, X_n \sim N(\mu, \sigma^2)$ iid

(Sample mean) $\,\,$ $\hat{\mu} = \bar{X} = \frac{1}{n}\sum_{i=1}^nX_i$ $\,\,$ /cf. $\sigma^2 = \mathbb{E}((X - \mu)^2)$

(“Biased” sample variance) $\,\,$ $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(X_i -\bar{X})^2$

(“unbiased” sample variance) $\,\,$ $s^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i -\bar{X})^2$

**Definition.**

- The
*bias*of an estimator $\hat{\theta}$ is $\,$ ${\rm bias}(\hat{\theta}) = \mathbb{E}(\hat{\theta}) - \theta$. - An estimator $\hat{\theta}$ is
*unbiased*if $\,$ ${\rm bias}(\theta) = 0$.

**Example.**

- $\hat{\mu}$ is unbiased: $\mathbb{E}(\hat{\mu}) = \mathbb{E}(\frac{1}{n}\sum_{i=1}^nX_i) =\frac{1}{n}\sum \mathbb{E}(X_i) = \frac{1}{n}\sum \mu = \mu$
- $\hat{\sigma}^2$ is biased. (Exercise)
- $s^2$ is unbiased. (Exercise)

#### (ML 11.2) Decision theory terminology in different contexts

General |
Estimators |
$^*$Regression/Classification |

Decision rule $\delta$ | $^*$Estimator function $g$ | Prediction function $f$ |

State $s$ (unknown) | Parameter $\theta$ (unknown) | Target value $Y$ (unknown) |

$^*$Data $D$ (observed) | Data $D$ (observed) | Point $X$ (observed) |

Action $a = \delta(D)$ | Estimator/Estimate $\hat{\theta}=g(D)$ | Prediction $\hat{Y} = f(X)$ |

Loss $L(s, a)$ | Loss $L(\theta, \hat{\theta})$ | Loss $L(Y, \hat{Y})$ |

**Example. (Estimators)**

An estimator is a random rariable: $\hat{\mu} = \frac{1}{n} \sum_{i=1}^n X_i$.

An estimate is a number: $\hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i = 2.3$.

(In some situation, the procedure $g$ is refered to as an estimator!)

#### (ML 11.3) Frequentist risk, Bayesian expected loss, and Bayes risk

**Loss and Risk.** Exciting session to clear up all the mud!

**Data:** $\,$ $D = (X_1, \ldots, X_n)$, $D \sim p_\theta$

**Parameter:** $\,$ $\theta \sim \pi$ $\,$, i.e., the parameter $\theta$ is a random variable.
**Estimator:** $\,$ $\hat{\theta} = f(D) = \delta(D)$

Everything begins with : $\,\,\,\,\,$ Loss $=L(\theta, f(D))$.

We wanna minimize the loss but it’s an RV!

Two option to deal with it:

- Averaging over $\theta$ given the data : $E(L(\theta, f(D)) \vert D) =:\rho(\pi, f(D))$ Bayesian expected loss
- Averaging over the data given $\theta$ : $E(L(\theta, f(D)) \vert \theta) =: R(\theta, f)$ (Frequentist) risk

#### (ML 11.4) Choosing a decision rule - Bayesian and frequentist

**How to choose $f$.**

**Bayesian:** Assume $\pi$

Case 1. Know $D$. Choose $f(D)$ to minimize $\rho(\pi, f(D))$

Case 2. Don’t know $D$. Choose $f$ to minimize $r(\pi, f)$

**Frequentist:** Introduce a furthere principle to guide your choice.

(a) Unbiasedness

(b) Admissibility
(c) Minimax

(d) Invariance

#### (ML 11.5) Bias-Variance decomposition (MSE $=$ bias$^2$ + var)

“A super impportant port of ML is what’s called model selection and a tool for model selection is the bias-variance decomposition.”

Almost trivial identity but extremely handy.

**Definition.** Let $D$ be random data. The MSE of an estimator $\hat{\theta} = f(D)$ for $\theta$ is

Put $\vert \theta$ emphasizing we’re not averagning over $\theta$ here (we don’t have a distribution over $\theta$). We’re just averaging over the data.

MSE$\theta$ is nothing but the risk $R(\theta, f)$ under square loss, i.e., when the loss function is the square of the deifference.

**Recall.** bias$(\hat{\theta}) = E(\theta) -\theta$.

**Proposition.** MSE$(\theta) = bias(\hat{\theta})^2 + {\rm var}(\hat{\theta})$

Proof:

**Silly example.**
$X \sim N(\theta, 1)$
$\theta$ nonrandom & unknown
$D = X

“Natural” estimate of $\theta$: $\delta_1(D) = X \leadsto$ bias$^2 = 0$, var$ = 1$, MSE$ =1$

“Silly” estimate of $\theta$: $\delta_0(D) = X \leadsto$ bias$^2 = \theta^2$, var$ = 0$, MSE$ = \theta^2$

cf. Shrinkage, Stein’s paradox

#### (ML 11.8) Bayesian decision theory

**Terminology.(Informal)**

- A
*generalized Bayes rule*is a decision rule $\delta$ that minimizes $\rho(\pi, \delta(D))$ for each $D$. - A
*Bayes rule*minimizes is a decision rule $\delta$ that minimizes $r(\pi, \delta)$.

**Remark**

- GBR $\Rightarrow$ BR, but BR $\not\Rightarrow$ GBR.
- If $r(\pi, \delta) = \infty$ for all $\delta$, then anything is a BR, but a GBR still makes sense.
- On a set of $\pi$-measure $0$, BR is arbitrary bit GBR is still sensible.

**Complete Class Theorems.** Under mild conditions:

- Every GBR (for a proper $\pi$) is admissible;
- Every admissible decision rule is a GBR for some (possibly improper) $\pi$

*Check out*:

- Complete Class Theorems for the Simplest Empirical Bayes Decision Problems
- A Complete Class Theorem for Statistical Problems with Finite Sample Spaces

#### (ML 12.1) Model selection - introduction and examples

**“Model” selection** really means “complexity” selection!

Here, *complexity* $\approx$ flexibility to fit/explain data

**Example** (Linaer regression with MLE for $w$) $f(x) = w^T\varphi(x)$

Given data $x \in \mathbb{R}$, consider polynomial basis $\varphi(x) = x^k$, $\varphi = (\varphi_0, \varphi_1, \ldots, \varphi_B)$

Turns out $B =$ “complexity parameter”

**Example** (Bayesia linear regression or MAP)

**Example** ($k$NN)

$k$ “controls” decesion boundaties.

#### (ML 12.2) Bias-variance in model selection

Bias-variance trade-off, as they say.

MSE $=$ bias$^2 +$ var / $\in$MSE $=$ $\int$bias$^2 +$ $\int$var (only applies for square loss)

#### (ML 15.1) Newton’s method (for optimization) - intuition

2nd order method!

(Gradient descent $x_{t+1} = x_t - \alpha_t \nabla f(x_t)$ : 1st order method)

**Analogy (1D).**

- zero-finding: $x_{t+1} = x_t - \frac{f(x_t)}{f’(x_t)}$

- min./maximizing: $x_{t+1} = x_t - \frac{f’(x_t)}{f’‘(x_t)}$

#### (ML 15.2) Newton’s method (for optimization) in multiple dimensions

Idea: “Make a 2nd order approximation and minimize tha.”

Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ be (sufficiently) smooth.

**Taylor’s theorem:** for $x$ near a, letting $g = \nabla f(a)$ and $H = \nabla^2f(a) = \left(\frac{\partial^2}{\partial x_i \partial x_j}f(a)\right)_{ij}$,

Minimize: $0 = \nabla q = Hx + b \Rightarrow x = -H^{-1}b = a -H^{-1}g$

Critical to check: $\nabla^2 q = H$ $\Rightarrow$ minimum if $H$ is PSD.

**Algorithm.**

- Initialize $x \in \mathbb{R}^n$
- Iterate: $x_{t+1} = x_t - H^{-1}g$ where $g = \nabla f(x_t), H = \nabla^2 f(x_t)$

**Issues.**

- $H$ may fail to be PSD. (Option: switch gradient descent. A smart way to do it: Levenberg–Marquardt algorithm)
- Rather than invert $H$, sove $Hy = g$ for $y$, then use $x_{t+1} = x_t - y$. (More robust approach)
- $x_{t+1} = x_t - \alpha_t y$. (small “step size” $\alpha_t>0$)

#### (ML 17.1) Sampling methods - why sampling, pros and cons

Why sampling?

- For approximate expectations (estimate statistics / posterior infernce i.e. computing probability)
- For visualization

Why expectations?

- Any probability is an expectation: $P[X \in A] = E[I(X \in A)]$.
- Approximation is needed for intractable sums/integrals (can be expressed as expectations)

Pros.

- Easy (both to implement and understand)
- General purpose

Cons.

- Too easy - used inappropriately
- Slow
- Getting “good” samples may be dificult
- Difficult to assess

#### (ML 17.2) Monte Carlo methods - A little history

#### (ML 17.3) Monte Carlo approximation

Goal: Aprroximate $E[f(X)]$, when intractable.

Definition (Monte Carlo estimator): If $X_1, \ldots, X_n \sim p$ iid then

is a (basic) *Monte Carlo estimator* of $E[f(X)]$ where $X \sim p$. (sample mean)

Remark

(1) $E[\hat{\mu}_n] = E[f(X)]$ (i.e. $\hat{\mu}_n$ is an unbiased estimator)

(2)

#### (ML 17.5) Importance sampling - introduction

*It’s not a sampling method but an estimation technique!*

It can be thought of as a variant of MC estimation.

**Recall.** MC estimation (by sample mean):

under the BIG assumtion that $X \sim p$ and $X_i \sim p$.

Can we do something similar by drawing samples from an alternative distribution $q$?

*Yes*, and in some cases you can do much much better!

**Density $p$ case.**

holds for all pdf $q$ with respect to wchi $q$ is *absolutely continuous*, i.e., $p(x) = 0$ whenever $q(x)= 0$.

#### (ML 17.6) Importance sampling - intuition

What makes a good/bad choice of the *proposal distrubution* $q$?

*To be added..*

**Subscribe via RSS**