Stanford Pratical Machine Learning-探索性数据分析
本文最后更新于:1 年前
这一章主要介绍探索性数据分析
Exploratory data analysis for house sales
- import libraries and data
1 |
|
1 |
|
- check the data shape and the first a few examples
1 |
|
1 |
|
- drop columns that at least 30% values are null to simplify our EDA
丢掉缺失信息太多的列
1 |
|
1 |
|
- we check the data types
1 |
|
- Convert currency from string format such as
$1,000,000
to float.
把💰从string -> float
1 |
|
- Also convert areas from string format such as
1000 sqft
and1 Acres
to float as well.
1 |
|
- Now we can check values of the numerical columns. You could see the min and max values for several columns do not make sense
1 |
|
- filter out houses whose living areas are too small or too hard to simplify the visualization later
1 |
|
- Let’s check the histogram of the
'Sold Price'
, which is the target we want to predict.
1 |
|
- A house has different types. Here are the top types:
1 |
|
- Price density for different house types.
1 |
|
- Another important measurement is the sale price per living sqft. Let’s check the differences between different house types.
1 |
|
Boxplot可以很好表示不同分布间的对比
- We know the location affect the price. Let’s check the price for the top 20 zip codes.
1 |
|
- we visualize the correlation matrix of several columns.
1 |
|
Summary
- This notebook demonstrates the basic technologies for EDA, including
- Understanding column data types, values, and distributions
- Understanding the interactions between columns
- We only explored a small aspect of the data. You are welcome to dive deep into more details.
References
Stanford Pratical Machine Learning-探索性数据分析
https://alexanderliu-creator.github.io/2023/08/24/stanford-pratical-machine-learning-tan-suo-xing-shu-ju-fen-xi/