Data Transformation

ML algorithms prefer well defined fixed length, well-conditioned, nicely distributed input
Next, data transformation methods for different data types

Normalization for Real Value Columns

Our previous web scraping will scrape 15 TB images for a year
5 millions houses sold in US per year, ~20 images/house, ~153KB per image, ~1041x732 resolution
cropping, downsampling, compression
- Save storage cost, faster loading at training
  - At ~320x224 resolution, 15 TB -> 1.4TB
- ML is good at low-resolution images
- Be aware of lossy compression
  - Medium (80%-90%) jpeg compression may lead to 1% acc drop in ImageNet

数据质量和数据大小，必须做一个权衡。数据大小直接关系到数据存储成本，数据质量直接关系到模型训练精度。

Encode和Decode有的时候也要权衡，压缩好存储小，但是反向解码和处理开销也大捏！

Stemming and lemmatization: a word a common base form
- E.g. am, are, is -> be car, cars, car’s, cars’ -> car
- Example: Topic modeling
Tokenization: text string -> a list of tokens (smallest unit to ML algorithms)
- By word: text.split(‘ ‘)
- By char: text.split(‘’)
- By subwords:
  - e.g. “a new gpu!” -> “a”, “new”, “gp”, “##u”, “!”
  - Custom vocabulary learned from the text corpus (Unigram, WordPiece)

字典构造 & 词元化

Transform data into formats preferred by ML algorithms
- Tabular: normalize real value features
- Images: cropping, downsampling, whitening
- Videos: clipping, sampling frames
- Text: stemming, lemmatization, tokenization
Need to balance storage, quality, and loading speed

#研0自学

Stanford Pratical Machine Learning-数据变换

https://alexanderliu-creator.github.io/2023/08/24/stanford-pratical-machine-learning-shu-ju-bian-huan/

作者

Alexander Liu

发布于

2023年8月24日

许可协议