Stanford Pratical Machine Learning-数据获取

本文最后更新于:7 个月前

这一节主要介绍数据获取的内容

Flow chart for data acquisition

  • Start a ML application
    • Have enough data?
      • Yes -> Will cover later
      • No -> Are there external datasets?
        • Yes -> Discover or integrate data
        • No -> Have data generation methods?
          • Yes -> Generate Data

image-20230823205418161

Discover what data is available

  • Identify existing datasets
  • Find benchmark datasets to evaluate a new idea
    • E.g. A diverse set of small to medium datasets for a new hyperparameter tuning algorithm
    • E.g. Large scale datasets for a very big deep neural network
  • Collect new data
    • E.g. driving videos covering different driving scenarios

Popular ML datasets

  • MNIST: digits written by employees of the US Census Bureau
  • ImageNet: millions of images from image search engines
  • AudioSet: YouTube sound clips for sound classification
  • LibriSpeech: 1000 hours of English speech from audiobook
  • Kinetics: YouTube videos clips for human actions classification
  • KITTI: traffic scenarios recorded by cameras and other sensors
  • Amazon Review: customer reviews and from Amazon online shopping
  • SQuAD: question-answer pairs derived from Wikipedia
  • More at https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research

Where to Find Datasets

  • Paperswithcodes Datasets: academic datasets with leaderboard
  • Kaggle Datasets: ML datasets uploaded by data scientists
  • Google Dataset search: search datasets in the Web
  • Various toolkits datasets: tensorflow, huggingface
  • Various conference/company ML competitions
  • Open Data on AWS: 100+ large-scale raw data
  • Data lakes in your own organization

Datasets Comparison

Pros Cons
Academic datasets Clean, proper difficulty Limited choices, too simplified, usually small scale
Competition datasets Closer to real ML applications Still simplified, and only available for hot topics
Raw Data Great flexibility Needs a lot of effort to process
  • You often need to deal with raw data in industrial settings
  • Data curation can be a big projection involving multiple teams. Processing pipeline, storage, legal issue, privacy,…

工业界中Raw Data是比较多的昂!!!

Data Integration

数据零散,各有特色,做数据融合的时候,就需要花费很多的精力,使用很多不同的策略。

  • Combine data from multiple sources into a coherent dataset
  • Product data is often stored in multiple tables
    • E.g. a table for house information, a table for sales, a table for listing agents
  • Join tables by keys, which are often entity IDs
  • Key issues: identify IDs, missing rows, redundant columns, value conflicts

Generate Synthetic Data

找不到数据集,我们可以生成数据集!

  • Use GANs
  • Data augmentations
    • Image augmentation
    • Back Translation

References

  1. Slides

Stanford Pratical Machine Learning-数据获取
https://alexanderliu-creator.github.io/2023/08/23/stanford-pratical-machine-learning-shu-ju-huo-qu/
作者
Alexander Liu
发布于
2023年8月23日
许可协议