0%

机器学习基石学习笔记01

Lecture 01

人类学习:

机器学习:

机器学习目的:

unknown target function:

f:XYf : \mathcal{X} \rightarrow \mathcal{Y}

机器学习过程:

A takes D and H to get g\mathcal{A} \text { takes } \mathcal{D} \text { and } \mathcal{H} \text { to get } g

A\mathcal{A} 是机器学习算法

D\mathcal{D} 是数据

H\mathcal{H} 是公式集合

gg f\approx f 的假设

Lecture 02

Vector Form of Perceptron Hyponthesis

h(x)=sign((i=1dwixi) threshold )=sign((i=1dwixi)+( threshold )w0(+1)x0)=sign(i=0dwixi)=sign(wTx)\begin{aligned} h(\boldsymbol{x}) &=\operatorname{sign}\left(\left(\sum_{i=1}^{d} w_{i} x_{i}\right)-\text { threshold }\right) \\ &=\operatorname{sign}\left(\left(\sum_{i=1}^{d} w_{i} x_{i}\right)+\underbrace{(-\text { threshold })}_{w_{0}} \cdot \underbrace{(+1)}_{x_{0}}\right) \\ &=\operatorname{sign}\left(\sum_{i=0}^{d} w_{i} x_{i}\right) \\ &=\operatorname{sign}\left(\boldsymbol{w}^{T} \boldsymbol{x}\right) \end{aligned}

Perceptrons in R2\mathbb{R}^{2}

x\boldsymbol{x}: points on the plane( or points in Rd\mathbb{R}^{d}), h{h}: lines ( or hyperplanes in Rd\mathbb{R}^{d})

h(x)=sign(w0+w1x1+w2x2)h(\boldsymbol{x})=\operatorname{sign}\left(w_{0}+w_{1} x_{1}+w_{2} x_{2}\right)

label yy: +1 or -1

 perceptrons  linear (binary) classifiers \text { perceptrons } \Leftrightarrow \text { linear (binary) classifiers }

Select g{g} from H\mathcal{H}

  • want: gg f\approx f
  • difficult: H\mathcal{H} is of infonite size
  • idea: start from some g0g_{0}

Perceptron Learning Algorithm

image-20190628194835811.png

sign(wtTxn(t))yn(t)\operatorname{sign}\left(\mathbf{w}_{t}^{T} \boldsymbol{x}_{n(t)}\right) \neq y_{n(t)}

wt+1=wt+yn(t)xn(t)\boldsymbol{w}_{t+1} = \boldsymbol{w}_{t}+y_{n(t)} \boldsymbol{x}_{n(t)}

A fault confessed is half redressed. :-) - \text {A fault confessed is half redressed. :-) }

Definition 1.1: wf\boldsymbol{w}_f satisfies yn=sign(wfTxn)y_{n}=\operatorname{sign}\left(\boldsymbol{w}_{f}^{T} \boldsymbol{x}_{n}\right)

Then: wt\boldsymbol{w}_t get more aligned with wf\boldsymbol{w}_f

(i)

yn(t)wfTxn(t)minnynwfTxn>0y_{n(t)} \boldsymbol{w}_{f}^{T} \boldsymbol{x}_{n(t)} \geq \min _{n} y_{n} \boldsymbol{w}_{f}^{T} \boldsymbol{x}_{n}>0

wfTwt+1=wfT(wt+yn(t)xn(t))wfTwt+minnynwfTxn>wfTwt+0\begin{aligned} \boldsymbol{w}_{f}^{T} \boldsymbol{w}_{t+1} &=\boldsymbol{w}_{f}^{T}\left(\boldsymbol{w}_{t}+y_{n(t)} \boldsymbol{x}_{n(t)}\right) \\ & \geq \boldsymbol{w}_{f}^{T} \boldsymbol{w}_{t}+\min _{n} y_{n} \boldsymbol{w}_{f}^{T} \boldsymbol{x}_{n} \\ &>\boldsymbol{w}_{f}^{T} \boldsymbol{w}_{t}+0 \end{aligned}

(ii)

wt changed only when mistake sign(wtTxn(t))yn(t)yn(t)wtTxn(t)0\begin{array}{c}{\boldsymbol{w}_t \text { changed only when mistake }} \\ \Leftrightarrow \operatorname{sign}\left(\mathbf{w}_{t}^{T} \boldsymbol{x}_{n(t)}\right) \neq y_{n(t)} \Leftrightarrow y_{n(t)} \boldsymbol{w}_{t}^{T} \boldsymbol{x}_{n(t)} \leq 0 \end{array}

wt+12=wt+yn(t)xn(t)2=wt2+2yn(t)wtTxn(t)+yn(t)xn(t)2wt2+0+yn(t)xn(t)2wt2+maxnynxn2\begin{aligned}\left\|\boldsymbol{w}_{t+1}\right\|^{2} &=\left\|\boldsymbol{w}_{t}+y_{n(t)} \boldsymbol{x}_{n(t)}\right\|^{2} \\ &=\left\|\boldsymbol{w}_{t}\right\|^{2}+2 y_{n(t)} \boldsymbol{w}_{t}^{T} \boldsymbol{x}_{n(t)}+\left\|y_{n(t)} \boldsymbol{x}_{n(t)}\right\|^{2} \\ & \leq\left\|\boldsymbol{w}_{t}\right\|^{2}+0+\left\|y_{n(t)} \boldsymbol{x}_{n(t)}\right\|^{2} \\ & \leq\left\|\boldsymbol{w}_{t}\right\|^{2}+\max _{n}\left\|y_{n} \boldsymbol{x}_{n}\right\|^{2} \end{aligned}

start from w0=0\boldsymbol{w}_0 = \boldsymbol{0}, after T mistake corrections,

wfTwfwTwTTminnynwfTxnwfwTTminnynwfTxnwfTmaxnxnTminnynwfTxnwfmaxnxnT constant \begin{aligned} \frac{\boldsymbol{w}_{f}^{T}}{\left\|\boldsymbol{w}_{f}\right\|} \frac{\boldsymbol{w}_{T}}{\left\|\boldsymbol{w}_{T}\right\|} & \geq \frac{T \cdot \min _{n} y_{n} \boldsymbol{w}_{f}^{T} \boldsymbol{x}_{n}}{\left\|\boldsymbol{w}_{f}\right\| \left\|\boldsymbol{w}_{T}\right\|} \\ & \geq \frac{T \cdot \min _{n} y_{n} \boldsymbol{w}_{f}^{T} \boldsymbol{x}_{n}}{\left\|\boldsymbol{w}_{f}\right\| \cdot \sqrt{T} \cdot \max _{n} \left\|\boldsymbol{x}_{n}\right\|} \\ & \geq \sqrt{T} \cdot \frac{\min _{n} y_{n} \boldsymbol{w}_{f}^{T} \boldsymbol{x}_{n}}{\left\|\boldsymbol{w}_{f}\right\| \cdot \max _{n} \left\|\boldsymbol{x}_{n}\right\|} \\ & \geq \sqrt{T} \cdot \text { constant } \end{aligned}

Pros:

  • simple to implement, fast, works in any dimension d

Cons:

  • Assumes linear separable
  • not fully sure how long halting takes

Learning with Noisy Data

Pocket Algorithm

image-20190628194716741