Description
This package provides kind of a framework and a lot of ready to use, so-called steps, that you can combine to build your own crawlers or scrapers with.
Library for Rapid (Web) Crawler and Scraper Development alternatives and similar libraries
Based on the "Scraping" category.
Alternatively, view crawler alternatives based on common mentions on social networks and blogs.
InfluxDB - Power Real-Time Data Analytics at Scale
Do you think we are missing an alternative of Library for Rapid (Web) Crawler and Scraper Development or a related project?
README
Library for Rapid (Web) Crawler and Scraper Development
This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.
To give you an overview, here's a list of things that it helps you with:
- Crawler Politeness 😇 (respecting robots.txt, throttling,...)
- Load URLs using
- a (PSR-18) HTTP client (default is of course Guzzle)
- or a headless browser (chrome) to get source after Javascript execution
- Get absolute links from HTML documents 🔗
- Get sitemaps from robots.txt and get all URLs from those sitemaps
- Crawl (load) all pages of a website 🕷
- Use cookies (or don't) 🍪
- Use any HTTP methods (GET, POST,...) and send any headers or body
- Iterate over paginated list pages 🔁
- Extract data from:
- Extract schema.org structured data in JSON-LD format from HTML documents
- Keep memory usage low by using PHP Generators 💪
- Cache HTTP responses during development, so you don't have to load pages again and again after every code change
- Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)
- And a lot more...
Documentation
You can find the documentation at crwlr.software.
Contributing
If you consider contributing something to this package, read the [contribution guide (CONTRIBUTING.md)](CONTRIBUTING.md).