Stanford Pratical Machine Learning-线性模型

本文最后更新于:1 年前

这一章主要介绍线性模型昂!

Linear Regression

  • A simple house price prediction
    • Assume 3 features: $x_1 = #beds$, $x_2 = #baths$, $x_3 = #living\ sqft$
    • The predicted price is $\hat{y} = w_1x_1 + w_2x_2 + w_3x_3 + b$
    • Weights $w_1, w_2, w_3$ and bias $b$ will be learnt from training data
  • In general, given data $x = [x_1, x_2, \dots, x_p]$, linear regression predicts $\hat{y} = <\mathbf{w}, \mathbf{x}> + b$

Objective Function

  • Objective: minimize the mean square error (MSE)
    • 正式房价 - 预测房价,取平方,相加,然后求平均

Use linear regression for classification problem

这里预测的是一个实数!就不太好,不好调整!我们希望概率靠近结果就可以了!!!实数的话,我希望我的模型输出直接就是类别结果,“太关注别的类”了。

  • Regression: 实数范围内的continuous output
  • Multi-class classification:
    • One-hot label $y = [y_1, y_2, \dots, y_m]$,where $y_i = 1$ if $i = y$ otherwise 0
    • $\hat{y} = 0$ where i-th output $o_i$ is the confidence score for class i
    • Learn a liner model for each class $o_i = <\mathbf{x}, \mathbf{w_i}> + b$
    • Minimize MSE loss $\frac{1}{m}||o - y||_{2}^{2}$
    • Predict label $argmax_i{o_i}_{i=1}^{m}$
  • Waste model capacity on pushing $o_i$ near 0 for off labels

Softmax Regression

这里就好一些,我别的类可以有预测结果,但是加起来“别太大”。主要的结果预测的概率够大就行昂!

  • One-hot label $y = [y_1, y_2, \dots, y_m]$,where $y_i = 1$ if $i = y$ otherwise 0

  • $\hat{y} = softmax(o)\ where\ \hat{y_i} = \frac{exp(o_i)}{\sum_{k=1}^{m}exp(o_k)}$

    • Turns confidence scores into probabilities (non-negative, sum to 1)
    • Ideally we want $\hat{y} = one-hot(argmax_io_i)$, softmax is a continuous approximate to that
    • Still a linear model, decision made on linear transformation of the input, as $argmax_i\hat{y_i} = argmax_io_{i}$
  • Cross-entropy loss between two distributions $\hat{y}$ and $y$: $H(y, \hat{y}) = \sum_{i} - y_i log(\hat{y_i}) = -log\ \hat{y}_y$

    • when label class is i, assigns less penalty on $o_{j}$ as long as $o_j << o_i$
  • Exercise: think about how to handle examples with multi labels?

Mini-batch Stochastic gradient descent (SGD)

  • Train by mini-batch SGD (by various other ways as well)
    • model param, batch size, learning rate at time t
    • randomly initialize model param
    • Repeat t = 1, 2, …until converge
      • Randomly samples
      • Update model params
  • Pros: solve all objectives in this course except for trees
  • Cons: sensitive to hyper-parameters batch size and learning rate

Code

  • Train a linear regression model with min-batch SGD
  • Hyperparameters
    • batch_size
    • learning_rate
    • num_epochs
  • Code fragment
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# `features` shape is (n, p), `labels` shape is (n, 1)
def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
random.shuffle(indices) # read examples at random
for i in range(0, num_examples, batch_size):
batch_indices = torch.tensor(
indices[i:min(i + batch_size, num_examples)])
yield features[batch_indices], labels[batch_indices]

w = torch.normal(0, 0.01, size=(p, 1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)

for epoch in range(num_epochs):
for X, y in data_iter(batch_size, features, labels):
y_hat = X @ w + b
loss = ((y_hat - y)**2 / 2).mean()
loss.backward()
for param in [w, b]:
param -= learning_rate * param.grad
param.grad.zero_()

Summary

  • Linear methods linearly combine inputs to obtain predictions
  • Linear regression uses MSE as the loss function
  • Softmax regression is used for multiclass classification
    • Turn predictions into probabilities and use cross-entropy as loss
    • Cross entropy loss between two probability distribution
  • Mini-batch SGD can learn both models (and later neural networks as well)

References

  1. slides

Stanford Pratical Machine Learning-线性模型
https://alexanderliu-creator.github.io/2023/08/24/stanford-pratical-machine-learning-xian-xing-mo-xing/
作者
Alexander Liu
发布于
2023年8月24日
许可协议