Art of Web Scraping

Author

jpt

Introduction

I’ve found that web scraping can be quite polarizing among developers. A lot of people hate it, finding it frustrating and tedious. On the other hand, many get a sense of satisfaction out of tackling the challenges it presents and the end result of having machine-readable data where it didn’t exist before.

For thirteen years (2009-2022), I led the Open States project. Open States is a public resource which provides legislative data for all 50 states (and DC and Puerto Rico!) that updates regularly throughout the day. The project relies on around 200 custom scrapers, and then uses that data to power a variety of services including a public API and website. Millions of individuals have used Open States over the years to find their representatives & track legislation, and the data is widely used by researchers, journalists, and nonprofits.

So, it makes sense that I’m definitely in the latter camp, I enjoy writing web scrapers & seeing the difference the final product can make. I also believe that a lot of people that hate web scraping hate it because their experiences were terrible. The commonly used libraries are not ergonomic, most scraper code is ugly and hard to maintain, and the write-run-wait-debug cycle is slow and painful. Additionally, many people approach web scraping with the same mindset they use for writing web applications or other software, which no doubt leads to a frustrating experience.

I believe these factors are exacerbated by the fact that a lot of web scrapers are built as throwaway one-offs (whether or not they actually are is a different story), and as a result not a lot of thought is put into their design. This makes sense, but creates a sort of feedback loop where most scraper code is fragile and hard to maintain, so people looking to write more reliable scrapers wind up learning from code that is fragile and hard to maintain. Over the years I onboarded dozens of new contributors, and often wished for better resources on web scraping that reflected best practices. My hope is that this resource can help to fill that gap.

Given the way Data Engineering has taken off as a field in recent years, I think we’re overdue to take a second look at web scraping as a discipline. The scrapers are often the most fragile, least maintainable part of the pipeline. We’ll take a look at why that is, and how we can improve it.

About This Book

Note

This is a work in progress, the table of contents is aspirational.

If you’re interested in knowing when new chapters are ready, you can follow me on Mastodon.

I’ve also set up a mailing list for updates. I’ll send out an email when new chapters are ready.

One of the goals of this project is to provide a resource for people looking to write reliable scrapers. While code examples will be in Python, many of the concepts should be applicable regardless of language. Part 1 in particular will focus on high-level concepts and design patterns, and be broadly applicable.

Part 2 is explicitly focused on the Python scraping ecosystem. Whatever language you use, picking the right tools is essential. For a lot of people 2-3 Python libraries seem like the end-all-be-all, but there are a lot of options out there, and I think it’s worth exploring them. We’ll also take a look at some oft-neglected parts of the scraping stack, like data validation and logging.

Right now I’m also grouping more advanced topics into Part 3. That’s the least certain part of the plan right now. If you have suggestions of things you’d like to see covered, please let me know.

Part 1: Understanding Web Scraping

This section aims to cover broad topics in web scraping. This isn’t particularly focused on Python, but rather on the general concepts of web scraping.

Web Scraping 101 - An overview of the basics of web scraping.
Scraping Philosophy - A discussion of the philosophy behind writing scrapers.
Best Practices - Writing resilient & sustainable scrapers.
Ethical & Legal Guidelines - A discussion of ethical & legal guidelines for scraping.

Part 2: Python Scraping Ecosystem

These chapters are focused on the Python scraping ecosystem. Each chapter will compare a few libraries, and discuss the pros & cons of each.

Making Requests - Various libraries for making HTTP requests.
Parsing HTML - Comparing various libraries for parsing HTML.
Other Libraries - Other libraries that are useful for scraping.

Part 3: Advanced Concepts

These chapters discuss more advanced topics in web scraping. Not every scraper will need to use these, but they’re useful to know about as your scrapers grow more advanced.

Deploying Scrapers - A discussion of deploying scrapers.

Appendices

Appendices that will contain reference information useful to people that are writing scrapers.

Appendix A: XPath & CSS Selectors