Stanford Pratical Machine Learning-数据清理

本文最后更新于:1 年前

这一章主要介绍数据清理

Data Errors

无法确认,是否能够通过数据清洗,来提高模型的精度,让模型做的更好。不去处理的话,更有问题,把不正确的模型放到线上是有问题的!!!

  • Data often have errors - the mismatch with ground truth (missing, erroneous or extreme values)
  • Good ML models are robust to errors
    • DNN trained with SGD VS Decision trees
  • Consequences:
    • The training may still converge, but slower
    • Accuracy degradation, could be hard to detect
    • Deploying these models may impact the quality of the new collected data
      • e.g. positive examples generated by poor recommendation / search results

Types of Data Errors

  • Outliers: data values that significantly deviate from other observations
    • outliers VS under sampled rare events
  • Rule violations: data values violate integrity constraints such as “Not Null” and “Must be unique” and “Non negative”
  • Pattern violations: data values violate syntactic and semantic constraints such as formatting, misspelling

Outlier Detection

  • 查看数据,并且做一些数据处理
    • 比如类别名字类似但是拼写不同,手动处理就好。要么删掉,要么归到某个特定的类别。
    • 手动根据数据范围,估计最大最小值,在这个范围之外的,全部扔掉(认为Outlier,可以根据boxplot去决定)

Rule-based Detection

  • Design rules to identify erroneous records
  • Functional dependencies: x -> y means a value determines a unique value
    • E.g. zip code -> state, EIN -> company name
  • Denial constraints: specified with more flexible first-order logic
    • Phone number is not empty if vendor has an EIN
    • If two captures of the same animal indicated by the same tag number, then the first one must be marked as original

指定一些规则,进行数据错误处理

Pattern-based Detection

根据语义之类的规则,交互进行处理!

  • Syntactic patterns
    • e.g. Map a column to the most prominent data type and identify values do not fit
    • eng, en, english -> English
  • Semantic patterns
    • e.g. Add rules through knowledge graph
      • Values in column “Country” need have capitals, so a value “Stanford” is invalid

Summary

  • Types of data errors: outliers, rule violations, pattern violations
  • Detect errors and fix them by: mode distribution, integrity constraints(functional dependencies, denial constraints), syntactic/semantic patterns
  • Multiple tools exist to help data cleaning
    • Graphic interface for interactive cleaning
    • Automatically detect and fix

References


Stanford Pratical Machine Learning-数据清理
https://alexanderliu-creator.github.io/2023/08/24/stanford-pratical-machine-learning-shu-ju-qing-li/
作者
Alexander Liu
发布于
2023年8月24日
许可协议