Flow chart for data acquisition

Discover what data is available

Identify existing datasets
Find benchmark datasets to evaluate a new idea
- E.g. A diverse set of small to medium datasets for a new hyperparameter tuning algorithm
- E.g. Large scale datasets for a very big deep neural network
Collect new data
- E.g. driving videos covering different driving scenarios

MNIST: digits written by employees of the US Census Bureau
ImageNet: millions of images from image search engines
AudioSet: YouTube sound clips for sound classification
LibriSpeech: 1000 hours of English speech from audiobook
Kinetics: YouTube videos clips for human actions classification
KITTI: traffic scenarios recorded by cameras and other sensors
Amazon Review: customer reviews and from Amazon online shopping
SQuAD: question-answer pairs derived from Wikipedia
More at https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research

	Pros	Cons
Academic datasets	Clean, proper difficulty	Limited choices, too simplified, usually small scale
Competition datasets	Closer to real ML applications	Still simplified, and only available for hot topics
Raw Data	Great flexibility	Needs a lot of effort to process

You often need to deal with raw data in industrial settings
Data curation can be a big projection involving multiple teams. Processing pipeline, storage, legal issue, privacy,…

工业界中Raw Data是比较多的昂！！！

数据零散，各有特色，做数据融合的时候，就需要花费很多的精力，使用很多不同的策略。

Combine data from multiple sources into a coherent dataset
Product data is often stored in multiple tables
- E.g. a table for house information, a table for sales, a table for listing agents
Join tables by keys, which are often entity IDs
Key issues: identify IDs, missing rows, redundant columns, value conflicts