Stanford Pratical Machine Learning-网页数据抓取

本文最后更新于:1 年前

这一章主要介绍网页数据爬取

Web Scraping

  • ßThe goal is to extract data from website
    • Noisy, weak labels, can be spammy
    • Available at scale
    • E.g. price comparison/tracking website
  • Many ML datasets are obtained by web scraping
    • E.g. ImageNet, Kinetics
  • Web crawling VS scrapping
    • Crawling: indexing whole pages on Internet
    • Scraping: scraping particular data from web pages of a website

Tools

  • “curl” often doesn’t work
    • Website owners use various ways to stop bots
  • Use headless browser: a web browser without a GUI
  • You need a lot of new IPs, easy to get through public clouds
    • In all IPv4 IPs, AWS owns 1.75%, Azure 0.55%, GCP 0.25%

可以使用selenium,进行数据抓取!

1
2
3
4
5
6
7
from selenium import webdriver

chrome_options = webdriver.ChromeOptions()
chrome_options.headless = True
chrome = webdriver.Chrome(chrome_options=chrome_options)

page = chrome.get(url)

Case Study

Craw individual pages

  • Get the house IDs from the index pages
1
2
3
4
page = BeautifulSoup(open(html_path, 'r'))
links = [a['href'] for a in page.find_all(
'a', 'list-card-link')]
ids = [l.split('/')[-2].split('_')[0] for l in links]

id进行url拼接,就可以找到每个page的链接和信息

Extract data

  • Identify the HTML elements through Inspect

在Chrome开发者模式中,看一看页面的结构,然后从网页页面中,把我们要用的内容给提取出来。

1
2
3
4
5
6
7
8
sold_items = [a.text for a in page.find(
'div', 'ds-home-details-chip').find('p').find_all('span')]

for item in sold_items:
if 'Sold:' in item:
result['Sold Price'] = item.split(' ')[1]
if 'Sold on' in item:
result['Sold On'] = item.split(' ')[-1]
  • Repeat the previous process to extract other field data

Cost

  • Use AWS EC2 t3.small (2GB memory, 2 vCPUs, $0.02 per hour)
    • 2GB is necessary as the browser needs a lot memory, CPU and bandwidth are usually not an issue
    • Can use spot instance to reduce the price
  • The cost to crawl 1M houses is $16.6
    • The speed is about 3s per page.
    • 8.3 hours if using 100 instances.
    • The extra cost includes storage, restart instances when IP is banned.

Cost is low

Crawl Images

  • Get all image URLs
1
2
3
4
p = r'https:\\/\\/photos.zillowstatic.com\\/fp\\/([\d\w\-\_]+).jpg'

ids = [s.split('-')[0] for a in re.findall(p, html)]
urls = [f'https://photos.zillowstatic.com/fp/{id}-uncropped_scaled_within_1536_1152.jpg' for id in ids]
  • A house listing has ~20 images
    • The crawling cost is still reasonable: ~$300
    • Storing these images is expensive: ~$300 per month
      • You can reduce the image resolutions, or send data back

Legal Considerations

  • Web scraping isn’t illegal by itself
  • But you should
    • NOT scrape data have sensitive information (E.g. private data involving username/password, personal health/medical information)
    • NOT scape copyrighted data (E.g. YouTube videos, Flickr photos)
    • Follow the Terms of Service that explicitly prohibits web scraping
  • Consult a lawyer if you are doing it for profit

References

  1. slides

Stanford Pratical Machine Learning-网页数据抓取
https://alexanderliu-creator.github.io/2023/08/23/stanford-pratical-machine-learning-wang-ye-shu-ju-zhua-qu/
作者
Alexander Liu
发布于
2023年8月23日
许可协议