Scraping one web site for information is easy, scraping 10000 different sites is hard. Beyond page-specific scraping, how do you build a program than can extract the publication date of (almost) any news article online, no matter the web site?
We’ll cover when to use machine learning vs. humans or heuristics for data extraction, the different steps of how to phrase the problem in terms of machine learning, including feature selection on HTML documents, and issues that arise when turning research into production code.
I am looking for editors/curators to help with branches of the tree. Please send me an email if you are interested.