All Versions
6
Latest Version
Avg Release Cycle
32 days
Latest Release
717 days ago
Changelog History
Changelog History
-
v0.6.0 Changes
October 03, 2022โ Added
- ๐ New step
Http::crawl()
(classHttpCrawl
extending the normalHttp
step class) for conventional crawling. It loads all pages of a website (same host or domain) by following links. There's also a lot of options like depth, filtering by paths, and so on. - ๐ New steps
Sitemap::getSitemapsFromRobotsTxt()
(GetSitemapsFromRobotsTxt
) andSitemap::getUrlsFromSitemap()
(GetUrlsFromSitemap
) to get sitemap (URLs) from a robots.txt file and to get all the URLs from those sitemaps. - ๐ New step
Html::metaData()
to get data from meta tags (and title tag) in HTML documents. - ๐ New step
Html::schemaOrg()
(SchemaOrg
) to get schema.org structured data in JSON-LD format from HTML documents. - The abstract
DomQuery
class (parent of theCssSelector
andXPathQuery
classes) now has some methods to narrow the selected matches further:first()
,last()
,nth(n)
,even()
,odd()
.
๐ Changed
- ๐ฅ BREAKING: Removed
PoliteHttpLoader
and traitsWaitPolitely
andCheckRobotsTxt
. Converted the traits to classesThrottler
andRobotsTxtHandler
which are dependencies of theHttpLoader
. TheHttpLoader
internally gets default instances of those classes. TheRobotsTxtHandler
will respect robots.txt rules by default if you use aBotUserAgent
and it won't if you use a normalUserAgent
. You can access the loader'sRobotsTxtHandler
viaHttpLoader::robotsTxt()
. You can pass your own instance of theThrottler
to the loader and also access it viaHttpLoader::throttle()
to change settings.
๐ Fixed
- Getting absolute links via the
GetLink
andGetLinks
steps and thetoAbsoluteUrl()
method of theCssSelector
andXPathQuery
classes, now also look for<base>
tags in HTML when resolving the URLs. - The
SimpleCsvFileStore
can now also save results with nested data (but only second level). It just concatenates the values separated with a|
.
- ๐ New step
-
v0.5.0 Changes
September 03, 2022โ Added
- ๐ป You can now call the new
useHeadlessBrowser
method on theHttpLoader
class to use a headless Chrome browser to load pages. This is enough to get HTML after executing javascript in the browser. For more sophisticated tasks a separate Loader and/or Steps should better be created. - ๐ With the
maxOutputs()
method of the abstractStep
class you can now limit how many outputs a certain step should yield at max. That's for example helpful during development, when you want to run the crawler only with a small subset of the data/requests it will actually have to process when you eventually remove the limits. When a step has reached its limit, it won't even call theinvoke()
method any longer until the step is reset after a run. - With the new
outputHook()
method of the abstractCrawler
class you can set a closure that'll receive all the outputs from all the steps. Should be only for debugging reasons. - The
extract()
method of theHtml
andXml
(children ofDom
) steps now also works with a single selector instead of an array with a mapping. Sometimes you'll want to just get a simple string output e.g. for a next step, instead of an array with mapped extracted data. - In addition to
uniqueOutputs()
there is now alsouniqueInputs()
. It works exactly the same asuniqueOutputs()
, filtering duplicate input values instead. Optionally also by a key when expected input is an array or an object. - In order to be able to also get absolute links when using the
extract()
method of Dom steps, the abstractDomQuery
class now has a methodtoAbsoluteUrl()
. The Dom step will automatically provide theDomQuery
instance with the base url, presumed that the input was an instance of theRespondedRequest
class and resolve the selected value against that base url.
๐ Changed
- โ Remove some not so important log messages.
- ๐ Improve behavior of group step's
combineToSingleOutput()
. When steps yield multiple outputs, don't combine all yielded outputs to one. Instead, combine the first output from the first step with the first output from the second step, and so on. - When results are not explicitly composed, but the outputs of the last step are arrays with string keys, it sets those keys on the Result object instead of setting a key
unnamed
with the whole array as value.
๐ Fixed
- The static methods
Html::getLink()
andHtml::getLinks()
now also work without argument, like theGetLink
andGetLinks
classes. - When a
DomQuery
(CSS selector or XPath query) doesn't match anything, itsapply()
method now returnsnull
(instead of an empty string). When theHtml(/Xml)::extract()
method is used with a single, not matching selector/query, nothing is yielded. When it's used with an array with a mapping, it yields an array with null values. If the selector for one of the methodsHtml(/Xml)::each()
,Html(/Xml)::first()
orHtml(/Xml)::last()
doesn't match anything, that's not causing an error any longer, it just won't yield anything. - โ Removed the (unnecessary) second argument from the
Loop::withInput()
method because whenkeepLoopingWithoutOutput()
is called andwithInput()
is called after that call, it resets the behavior. - Issue when date format for expires date in cookie doesn't have dashes in
d-M-Y
(sod M Y
).
- ๐ป You can now call the new
-
v0.4.1 Changes
May 10, 2022๐ Fixed
- The
Json
step now also works with Http responses as input.
- The
-
v0.4.0 Changes
May 06, 2022โ Added
- The
BaseStep
class now haswhere()
andorWhere()
methods to filter step outputs. You can set multiple filters that will be applied to all outputs. When setting a filter usingorWhere
it's linked to the previously added Filter with "OR". Outputs not matching one of the filters, are not yielded. The available filters can be accessed through static methods on the newFilter
class. Currently available filters are comparison filters (equal, greater/less than,...), a few string filters (contains, starts/ends with) and url filters (scheme, domain, host,...). - The
GetLink
andGetLinks
steps now have methodsonSameDomain()
,notOnSameDomain()
,onDomain()
,onSameHost()
,notOnSameHost()
,onHost()
to restrict the which links to find. - ๐ฒ Automatically add the crawler's logger to the
Store
so you can also log messages from there. This can be breaking as theStoreInterface
now also requires theaddLogger
method. The new abstractStore
class already implements it, so you can just extend it.
๐ Changed
- The
Csv
step can now also be used without defining a column mapping. In that case it will use the values from the first line (so this makes sense when there are column headlines) as output array keys.
- The
-
v0.3.0 Changes
April 27, 2022โ Added
- ๐ฒ By calling
monitorMemoryUsage()
you can tell the Crawler to add log messages with the current memory usage after every step invocation. You can also set a limit in bytes when to start monitoring and below the limit it won't log memory usage.
๐ Fixed
- ๐ Previously the use of Generators actually didn't make a lot of sense, because the outputs of one step were only iterated and passed on to the next step, after the current step was invoked with all its inputs. That makes steps with a lot of inputs bottlenecks and causes bigger memory consumption. So, changed the crawler to immediately pass on outputs of one step to the next step if there is one.
- ๐ฒ By calling
-
v0.2.0 Changes
April 25, 2022โ Added
uniqueOutputs()
method to Steps to get only unique output values. If outputs are array or object, you can provide a key that will be used as identifier to check for uniqueness. Otherwise, the arrays or objects will be serialized for comparison which will probably be slower.runAndTraverse()
method to Crawler, so you don't need to manually traverse the Generator, if you don't need the results where you're calling the crawler.- Implement the behaviour for when a
Group
step should add something to the Result usingsetResultKey()
oraddKeysToResult()
, which was still missing. For groups this will only work when usingcombineToSingleOutput
.