Exploratory data analysis for house sales

import libraries and data

# !pip install seaborn pandas matplotlib numpy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython import display
display.set_matplotlib_formats('svg')
# Alternative to set svg for newer versions
# import matplotlib_inline
# matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

1	`data = pd.read_csv('house_sales.zip')`

check the data shape and the first a few examples

1	`data.shape`

1	`data.head()`

drop columns that at least 30% values are null to simplify our EDA

丢掉缺失信息太多的列

1 2	`null_sum = data.isnull().sum() data.columns[null_sum < len(data) * 0.3] # columns will keep`

1	`data.drop(columns=data.columns[null_sum > len(data) * 0.3], inplace=True)`

we check the data types

1	`data.dtypes`

Convert currency from string format such as $1,000,000 to float.

把💰从string -> float

currency = ['Sold Price', 'Listed Price', 'Tax assessed value', 'Annual tax amount']
for c in currency:
    data[c] = data[c].replace(
        r'[$,-]', '', regex=True).replace(
        r'^\s*$', np.nan, regex=True).astype(float)

Also convert areas from string format such as 1000 sqft and 1 Acres to float as well.

areas = ['Total interior livable area', 'Lot size']
for c in areas:
    acres = data[c].str.contains('Acres') == True
    col = data[c].replace(r'\b sqft\b|\b Acres\b|\b,\b','', regex=True).astype(float)
    col[acres] *= 43560
    data[c] = col

Now we can check values of the numerical columns. You could see the min and max values for several columns do not make sense

1	`data.describe()`

filter out houses whose living areas are too small or too hard to simplify the visualization later

1
2
3

abnormal = (data[areas[1]] < 10) | (data[areas[1]] > 1e4)
data = data[~abnormal]
sum(abnormal)

Let’s check the histogram of the 'Sold Price', which is the target we want to predict.

ax = sns.histplot(np.log10(data['Sold Price']))
ax.set_xlim([3, 8])
ax.set_xticks(range(3, 9))
ax.set_xticklabels(['%.0e'%a for a in 10**ax.get_xticks()]);

A house has different types. Here are the top types:

1	`data['Type'].value_counts()[0:20]`

Price density for different house types.

types = data['Type'].isin(['SingleFamily', 'Condo', 'MultiFamily', 'Townhouse'])
sns.displot(pd.DataFrame({'Sold Price':np.log10(data[types]['Sold Price']),
                          'Type':data[types]['Type']}),
            x='Sold Price', hue='Type', kind='kde');

Another important measurement is the sale price per living sqft. Let’s check the differences between different house types.

1
2
3

data['Price per living sqft'] = data['Sold Price'] / data['Total interior livable area']
ax = sns.boxplot(x='Type', y='Price per living sqft', data=data[types], fliersize=0)
ax.set_ylim([0, 2000]);

Boxplot可以很好表示不同分布间的对比

We know the location affect the price. Let’s check the price for the top 20 zip codes.

d = data[data['Zip'].isin(data['Zip'].value_counts()[:20].keys())]
ax = sns.boxplot(x='Zip', y='Price per living sqft', data=d, fliersize=0)
ax.set_ylim([0, 2000])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);

we visualize the correlation matrix of several columns.

1
2
3

_, ax = plt.subplots(figsize=(6,6))
columns = ['Sold Price', 'Listed Price', 'Annual tax amount', 'Price per living sqft', 'Elementary School Score', 'High School Score']
sns.heatmap(data[columns].corr(),annot=True,cmap='RdYlGn', ax=ax);

Summary

This notebook demonstrates the basic technologies for EDA, including
- Understanding column data types, values, and distributions
- Understanding the interactions between columns
We only explored a small aspect of the data. You are welcome to dive deep into more details.

References

slides

#研0自学

Stanford Pratical Machine Learning-探索性数据分析

https://alexanderliu-creator.github.io/2023/08/24/stanford-pratical-machine-learning-tan-suo-xing-shu-ju-fen-xi/

作者

Alexander Liu

发布于

2023年8月24日

许可协议

Stanford Pratical Machine Learning-数据清理上一篇

Stanford Pratical Machine Learning-数据标注下一篇