Library for Rapid (Web) Crawler and Scraper Development v0.6.0 Release Notes

Release Date: 2022-10-03 // over 1 year ago
  • โž• Added

    • ๐Ÿ†• New step Http::crawl() (class HttpCrawl extending the normal Http step class) for conventional crawling. It loads all pages of a website (same host or domain) by following links. There's also a lot of options like depth, filtering by paths, and so on.
    • ๐Ÿ†• New steps Sitemap::getSitemapsFromRobotsTxt() (GetSitemapsFromRobotsTxt) and Sitemap::getUrlsFromSitemap() (GetUrlsFromSitemap) to get sitemap (URLs) from a robots.txt file and to get all the URLs from those sitemaps.
    • ๐Ÿ†• New step Html::metaData() to get data from meta tags (and title tag) in HTML documents.
    • ๐Ÿ†• New step Html::schemaOrg() (SchemaOrg) to get schema.org structured data in JSON-LD format from HTML documents.
    • The abstract DomQuery class (parent of the CssSelector and XPathQuery classes) now has some methods to narrow the selected matches further: first(), last(), nth(n), even(), odd().

    ๐Ÿ”„ Changed

    • ๐Ÿ’ฅ BREAKING: Removed PoliteHttpLoader and traits WaitPolitely and CheckRobotsTxt. Converted the traits to classes Throttler and RobotsTxtHandler which are dependencies of the HttpLoader. The HttpLoader internally gets default instances of those classes. The RobotsTxtHandler will respect robots.txt rules by default if you use a BotUserAgent and it won't if you use a normal UserAgent. You can access the loader's RobotsTxtHandler via HttpLoader::robotsTxt(). You can pass your own instance of the Throttler to the loader and also access it via HttpLoader::throttle() to change settings.

    ๐Ÿ›  Fixed

    • Getting absolute links via the GetLink and GetLinks steps and the toAbsoluteUrl() method of the CssSelector and XPathQuery classes, now also look for <base> tags in HTML when resolving the URLs.
    • The SimpleCsvFileStore can now also save results with nested data (but only second level). It just concatenates the values separated with a |.

Previous changes from v0.5.0

  • โž• Added

    • ๐Ÿ’ป You can now call the new useHeadlessBrowser method on the HttpLoader class to use a headless Chrome browser to load pages. This is enough to get HTML after executing javascript in the browser. For more sophisticated tasks a separate Loader and/or Steps should better be created.
    • ๐Ÿšš With the maxOutputs() method of the abstract Step class you can now limit how many outputs a certain step should yield at max. That's for example helpful during development, when you want to run the crawler only with a small subset of the data/requests it will actually have to process when you eventually remove the limits. When a step has reached its limit, it won't even call the invoke() method any longer until the step is reset after a run.
    • With the new outputHook() method of the abstract Crawler class you can set a closure that'll receive all the outputs from all the steps. Should be only for debugging reasons.
    • The extract() method of the Html and Xml (children of Dom) steps now also works with a single selector instead of an array with a mapping. Sometimes you'll want to just get a simple string output e.g. for a next step, instead of an array with mapped extracted data.
    • In addition to uniqueOutputs() there is now also uniqueInputs(). It works exactly the same as uniqueOutputs(), filtering duplicate input values instead. Optionally also by a key when expected input is an array or an object.
    • In order to be able to also get absolute links when using the extract() method of Dom steps, the abstract DomQuery class now has a method toAbsoluteUrl(). The Dom step will automatically provide the DomQuery instance with the base url, presumed that the input was an instance of the RespondedRequest class and resolve the selected value against that base url.

    ๐Ÿ”„ Changed

    • โœ‚ Remove some not so important log messages.
    • ๐Ÿ‘Œ Improve behavior of group step's combineToSingleOutput(). When steps yield multiple outputs, don't combine all yielded outputs to one. Instead, combine the first output from the first step with the first output from the second step, and so on.
    • When results are not explicitly composed, but the outputs of the last step are arrays with string keys, it sets those keys on the Result object instead of setting a key unnamed with the whole array as value.

    ๐Ÿ›  Fixed

    • The static methods Html::getLink() and Html::getLinks() now also work without argument, like the GetLink and GetLinks classes.
    • When a DomQuery (CSS selector or XPath query) doesn't match anything, its apply() method now returns null (instead of an empty string). When the Html(/Xml)::extract() method is used with a single, not matching selector/query, nothing is yielded. When it's used with an array with a mapping, it yields an array with null values. If the selector for one of the methods Html(/Xml)::each(), Html(/Xml)::first() or Html(/Xml)::last() doesn't match anything, that's not causing an error any longer, it just won't yield anything.
    • โœ‚ Removed the (unnecessary) second argument from the Loop::withInput() method because when keepLoopingWithoutOutput() is called and withInput() is called after that call, it resets the behavior.
    • Issue when date format for expires date in cookie doesn't have dashes in d-M-Y (so d M Y).