Linear Regression

A simple house price prediction
- Assume 3 features: $x_1 = #beds$, $x_2 = #baths$, $x_3 = #living\ sqft$
- The predicted price is $\hat{y} = w_1x_1 + w_2x_2 + w_3x_3 + b$
- Weights $w_1, w_2, w_3$ and bias $b$ will be learnt from training data
In general, given data $x = [x_1, x_2, \dots, x_p]$, linear regression predicts $\hat{y} = <\mathbf{w}, \mathbf{x}> + b$

Objective Function

Objective: minimize the mean square error (MSE)
- 正式房价 - 预测房价，取平方，相加，然后求平均

Use linear regression for classification problem

这里预测的是一个实数！就不太好，不好调整！我们希望概率靠近结果就可以了！！！实数的话，我希望我的模型输出直接就是类别结果，“太关注别的类”了。

Regression: 实数范围内的continuous output
Multi-class classification:
- One-hot label $y = [y_1, y_2, \dots, y_m]$，where $y_i = 1$ if $i = y$ otherwise 0
- $\hat{y} = 0$ where i-th output $o_i$ is the confidence score for class i
- Learn a liner model for each class $o_i = <\mathbf{x}, \mathbf{w_i}> + b$
- Minimize MSE loss $\frac{1}{m}||o - y||_{2}^{2}$
- Predict label $argmax_i{o_i}_{i=1}^{m}$
Waste model capacity on pushing $o_i$ near 0 for off labels

Softmax Regression

这里就好一些，我别的类可以有预测结果，但是加起来“别太大”。主要的结果预测的概率够大就行昂！

One-hot label $y = [y_1, y_2, \dots, y_m]$，where $y_i = 1$ if $i = y$ otherwise 0
$\hat{y} = softmax(o)\ where\ \hat{y_i} = \frac{exp(o_i)}{\sum_{k=1}^{m}exp(o_k)}$
- Turns confidence scores into probabilities (non-negative, sum to 1)
- Ideally we want $\hat{y} = one-hot(argmax_io_i)$, softmax is a continuous approximate to that
- Still a linear model, decision made on linear transformation of the input, as $argmax_i\hat{y_i} = argmax_io_{i}$
Cross-entropy loss between two distributions $\hat{y}$ and $y$: $H(y, \hat{y}) = \sum_{i} - y_i log(\hat{y_i}) = -log\ \hat{y}_y$
- when label class is i, assigns less penalty on $o_{j}$ as long as $o_j << o_i$
Exercise: think about how to handle examples with multi labels?

Mini-batch Stochastic gradient descent (SGD)

Train by mini-batch SGD (by various other ways as well)
- model param, batch size, learning rate at time t
- randomly initialize model param
- Repeat t = 1, 2, …until converge
  - Randomly samples
  - Update model params
Pros: solve all objectives in this course except for trees
Cons: sensitive to hyper-parameters batch size and learning rate

Code

Train a linear regression model with min-batch SGD
Hyperparameters
- batch_size
- learning_rate
- num_epochs
Code fragment

# `features` shape is (n, p), `labels` shape is (n, 1)
def data_iter(batch_size, features, labels):
	num_examples = len(features)
	indices = list(range(num_examples))
	random.shuffle(indices) # read examples at random
	for i in range(0, num_examples, batch_size):
		batch_indices = torch.tensor(
    	indices[i:min(i + batch_size, num_examples)])
		yield features[batch_indices], labels[batch_indices]

w = torch.normal(0, 0.01, size=(p, 1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)

for epoch in range(num_epochs):
	for X, y in data_iter(batch_size, features, labels):
		y_hat = X @ w + b
		loss = ((y_hat - y)**2 / 2).mean()
		loss.backward()
		for param in [w, b]:
      param -= learning_rate * param.grad
      param.grad.zero_()

Full code at: http://d2l.ai/chapter_linearnetworks/linear-regression-scratch.html

Summary

Linear methods linearly combine inputs to obtain predictions
Linear regression uses MSE as the loss function
Softmax regression is used for multiclass classification
- Turn predictions into probabilities and use cross-entropy as loss
- Cross entropy loss between two probability distribution
Mini-batch SGD can learn both models (and later neural networks as well)

References

slides

#研0自学

Stanford Pratical Machine Learning-线性模型

https://alexanderliu-creator.github.io/2023/08/24/stanford-pratical-machine-learning-xian-xing-mo-xing/

作者

Alexander Liu

发布于

2023年8月24日

许可协议

Stanford Pratical Machine Learning-神经网络上一篇

Stanford Pratical Machine Learning-决策树下一篇