Data Errors

无法确认，是否能够通过数据清洗，来提高模型的精度，让模型做的更好。不去处理的话，更有问题，把不正确的模型放到线上是有问题的！！！

Data often have errors - the mismatch with ground truth (missing, erroneous or extreme values)
Good ML models are robust to errors
- DNN trained with SGD VS Decision trees
Consequences:
- The training may still converge, but slower
- Accuracy degradation, could be hard to detect
- Deploying these models may impact the quality of the new collected data
  - e.g. positive examples generated by poor recommendation / search results

Types of Data Errors

Outliers: data values that significantly deviate from other observations
- outliers VS under sampled rare events
Rule violations: data values violate integrity constraints such as “Not Null” and “Must be unique” and “Non negative”
Pattern violations: data values violate syntactic and semantic constraints such as formatting, misspelling

查看数据，并且做一些数据处理
- 比如类别名字类似但是拼写不同，手动处理就好。要么删掉，要么归到某个特定的类别。
- 手动根据数据范围，估计最大最小值，在这个范围之外的，全部扔掉（认为Outlier，可以根据boxplot去决定）

Design rules to identify erroneous records
Functional dependencies: x -> y means a value determines a unique value
- E.g. zip code -> state, EIN -> company name
Denial constraints: specified with more flexible first-order logic
- Phone number is not empty if vendor has an EIN
- If two captures of the same animal indicated by the same tag number, then the first one must be marked as original

指定一些规则，进行数据错误处理

根据语义之类的规则，交互进行处理！

Syntactic patterns
- e.g. Map a column to the most prominent data type and identify values do not fit
- eng, en, english -> English
Semantic patterns
- e.g. Add rules through knowledge graph
  - Values in column “Country” need have capitals, so a value “Stanford” is invalid

Types of data errors: outliers, rule violations, pattern violations
Detect errors and fix them by: mode distribution, integrity constraints(functional dependencies, denial constraints), syntactic/semantic patterns
Multiple tools exist to help data cleaning
- Graphic interface for interactive cleaning
- Automatically detect and fix

#研0自学

Stanford Pratical Machine Learning-数据清理

https://alexanderliu-creator.github.io/2023/08/24/stanford-pratical-machine-learning-shu-ju-qing-li/

作者

Alexander Liu

发布于

2023年8月24日

许可协议