吴恩达·Machine Learning || chap4 Linear Regression with multiple variables 简记

4 Linear Regression with multiple variables

4-1 Multiple features

Multiple features (variables)

Notation:

$n$ = number of features
$x^{(i)}$ =input(features) of $i^{th}$ training example.
$x_j^{(i)}$ =value of feature in $i^{th}$ training example.

Hypothesis

$h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+···+\theta_nx_n$

define $x_0$ =1

$x=\begin{bmatrix}x_0\\x_1\\x_2\\\vdots\\x_n\end{bmatrix}$ $\theta=\begin{bmatrix}\theta_0\\\theta_1\\\theta_2\\\vdots\\\theta_n\end{bmatrix}$

$h_\theta (x)=\theta^Tx$

4-2 Gradient descent for multiple variable

Cost function
$J(\theta)=J(\theta_0,\theta_1,\cdots,\theta_n)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}-y^{(i)})^2$
Gradient descent

$\theta_j := \theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)$

Previously(n=1):

Repeat{
$\begin{cases}\theta_0 := \theta_0-\alpha\frac{1}{m}\sum_{i=1}^{m}{(h_\theta(x^{(i)}-y^{(i)})} \\\theta_1 := \theta_1-\alpha\frac{1}{m}\sum_{i=1}^{m}{(h_\theta(x^{(i)}-y^{(i)})}.x^{(i)} \end{cases}$
}

New algorithm(n $\ge1$ )

Repeat{
$\theta_j := \theta_j-\alpha\frac{1}{m}\sum_{i=1}^{m}{(h_\theta(x^{(i)}-y^{(i)})x^{i}_j}$
}

4-3 Gradient descent in practice I: Feature Scaling

Feature Scaling 特征缩放

Ideal ：make sure features are on a similar scale

Get every feature into approximately a $0\le x_i \le 1$ ( $x_0=1$ ) range

Mean normalization

Replace $x_i$ with $x_i-\mu_i$ to make features have approximately zero mean (Do not apply to $x_0=1$ )
$x_i\longleftarrow \frac{x_i-\mu_i}{s_i} \\$
$x_i$ : feature number

$\mu_i$ :average value

$s_i$ : range(max-min)

4-4 Gradient descent in practice I: Learning rate

Gradient descent

$\theta_j := \theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)$

Debugging: make sure gradient descent is working correctly
How to choose learning rate $\alpha$
Convergence test:
Declare convergence if $J(\theta)$ decreases by less than $10^{-3}$ ( $\epsilon$ )in one iteration

Summary

For sufficiently small $\alpha$ , $J(\theta)$ should decrease on every iteration.
But if Q is too small, gradient descent can be slow to converge

To choose $\alpha$ ,try

steps of ten

···,0.001,0.003,0.01,0.03,···

4-5 Features and polynomial regression

housing price prediction

Polynomial regression

$h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3$

$\longrightarrow$ $h_\theta(x)=\theta_0+\theta_1(size)+\theta_2(size)^2+\theta_3(size)^3$

Choice of features

$h_\theta(x)=\theta_0+\theta_1x_1+\theta_2\sqrt{size}$

4-6 Normal equation

Intuition: ( $\theta\in R$ )( $\theta\in R^{n+1}$ )
$\theta=(X^TX)^{-1}X^Ty$

pinv(x'*x)*x'*y		%octave

m examples ( $(x^{(1)},y^{(1)}),\cdots,(x^{(m)},y^{(m)})$ ),n features

$x^{(i)}=\begin{bmatrix} x_0^{(i)}\\ x_1^{(i)}\\x_2^{(i)}\\\vdots\\x_n^{(i)}\end{bmatrix}$ X= $\begin{bmatrix}\cdots & （x^{(1)})^T & \cdots \\ \cdots & （x^{(2)})^T & \cdots \\&\vdots&\\\cdots & （x^{(m)})^T & \cdots \end{bmatrix}$

$y=\begin{bmatrix}y^{(1)}\\y^{(2)}\\\vdots\\y^{(m)}\end{bmatrix}$

Gradient Descent O(n)	Normal Equation O( $n^3$ )
Need to choose $\alpha$	No need to choose $\alpha$
Needs many iterations	Don’t need to iterate
Works well even when n is large	Need to compute $X^TX)^{-1}$
	Slow if n is very large

4-7 Normal equation and non-invertibility(optional)

singular or degenerate matrices

What if $X^TX$ is non-invertible ?

Redundant features (linear dependent)
e.g. $x_1$ =size in $feet^2$
$x_2$ =size in $m^2$
Too many features(e.g. m $\le$ n)
-Delete some features,or use regulariztion

**Octave Tutorial

Working on and submitting programming exercises

原文链接：https://blog.csdn.net/qq_46203130/article/details/119277422