AI-assisted code generation
When using LLMs to write Python code for web scraping, these are the most reasonable approaches to consider:
- Plain Python functions or classes
Pros: Simple, dependency-free, and easy for LLMs to produce.
Cons: You must define your own conventions and testing practices; integration across teams and tools can be ad-hoc.
Use when you need a quick extractor or the logic is small and unlikely to be reused.
- Scrapy spiders
Pros: Built-in crawling, request scheduling, retries and many utilities.
Cons: Large surface area for AI generation. Spiders mix crawling, error handling and extraction, which makes testing extraction in isolation difficult.
Avoid generating full spiders with an LLM; prefer generating extraction logic separately.
- web-poet page objects
Pros: Small, standard contract for extraction with field-level decomposition, first-class testing support, and framework integration.
Cons: Requires adopting web-poet idioms and a small framework cost, which can be unnecessary for trivial scripts.
Use when you want maintainability, testability, and a predictable contract that can be used by tools and teams.
Note
scrapy-poet provides a great way to use web-poet page objects within Scrapy spiders, giving you the benefits of both approaches.