Description
This package provides kind of a framework and a lot of ready to use, so-called steps, that you can combine to build your own crawlers or scrapers with.
Library for Rapid (Web) Crawler and Scraper Development alternatives and similar libraries
Based on the "Scraping" category.
Alternatively, view crawler alternatives based on common mentions on social networks and blogs.
InfluxDB - Purpose built for real-time analytics at any scale.
Do you think we are missing an alternative of Library for Rapid (Web) Crawler and Scraper Development or a related project?
Popular Comparisons
-
Library for Rapid (Web) Crawler and Scraper DevelopmentvsSymfony Panther
-
Library for Rapid (Web) Crawler and Scraper DevelopmentvsGoogle Search Results in PHP
-
Library for Rapid (Web) Crawler and Scraper DevelopmentvsChrome PHP
-
Symfony PanthervsChrome PHP
-
Google Search Results in PHPvsSymfony Panther
README
Library for Rapid (Web) Crawler and Scraper Development
This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.
To give you an overview, here's a list of things that it helps you with:
- Crawler Politeness 😇 (respecting robots.txt, throttling,...)
- Load URLs using
- a (PSR-18) HTTP client (default is of course Guzzle)
- or a headless browser (chrome) to get source after Javascript execution
- Get absolute links from HTML documents 🔗
- Get sitemaps from robots.txt and get all URLs from those sitemaps
- Crawl (load) all pages of a website 🕷
- Use cookies (or don't) 🍪
- Use any HTTP methods (GET, POST,...) and send any headers or body
- Iterate over paginated list pages 🔁
- Extract data from:
- Extract schema.org structured data in JSON-LD format from HTML documents
- Keep memory usage low by using PHP Generators 💪
- Cache HTTP responses during development, so you don't have to load pages again and again after every code change
- Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)
- And a lot more...
Documentation
You can find the documentation at crwlr.software.
Contributing
If you consider contributing something to this package, read the [contribution guide (CONTRIBUTING.md)](CONTRIBUTING.md).