Stanford Pratical Machine Learning-数据清理
本文最后更新于:1 年前
这一章主要介绍数据清理
Data Errors
无法确认,是否能够通过数据清洗,来提高模型的精度,让模型做的更好。不去处理的话,更有问题,把不正确的模型放到线上是有问题的!!!
- Data often have errors - the mismatch with ground truth (missing, erroneous or extreme values)
- Good ML models are robust to errors
- DNN trained with SGD VS Decision trees
- Consequences:
- The training may still converge, but slower
- Accuracy degradation, could be hard to detect
- Deploying these models may impact the quality of the new collected data
- e.g. positive examples generated by poor recommendation / search results
Types of Data Errors
- Outliers: data values that significantly deviate from other observations
- outliers VS under sampled rare events
- Rule violations: data values violate integrity constraints such as “Not Null” and “Must be unique” and “Non negative”
- Pattern violations: data values violate syntactic and semantic constraints such as formatting, misspelling
Outlier Detection
- 查看数据,并且做一些数据处理
- 比如类别名字类似但是拼写不同,手动处理就好。要么删掉,要么归到某个特定的类别。
- 手动根据数据范围,估计最大最小值,在这个范围之外的,全部扔掉(认为Outlier,可以根据boxplot去决定)
Rule-based Detection
- Design rules to identify erroneous records
- Functional dependencies: x -> y means a value determines a unique value
- E.g. zip code -> state, EIN -> company name
- Denial constraints: specified with more flexible first-order logic
- Phone number is not empty if vendor has an EIN
- If two captures of the same animal indicated by the same tag number, then the first one must be marked as original
指定一些规则,进行数据错误处理
Pattern-based Detection
根据语义之类的规则,交互进行处理!
- Syntactic patterns
- e.g. Map a column to the most prominent data type and identify values do not fit
- eng, en, english -> English
- Semantic patterns
- e.g. Add rules through knowledge graph
- Values in column “Country” need have capitals, so a value “Stanford” is invalid
- e.g. Add rules through knowledge graph
Summary
- Types of data errors: outliers, rule violations, pattern violations
- Detect errors and fix them by: mode distribution, integrity constraints(functional dependencies, denial constraints), syntactic/semantic patterns
- Multiple tools exist to help data cleaning
- Graphic interface for interactive cleaning
- Automatically detect and fix
References
Stanford Pratical Machine Learning-数据清理
https://alexanderliu-creator.github.io/2023/08/24/stanford-pratical-machine-learning-shu-ju-qing-li/