- automatically convert google docs share urls to csv download ones in request list (#1174)
- use puppeteer emulating scrolls instead of
window.scrollBy(#1170) - warn if apify proxy is used in proxyUrls (#1173)
- fix
YOUTUBE_REGEX_STRINGbeing too greedy (#1171) - add
purgeLocalStorageutility method (#1187) - catch errors inside request interceptors (#1188, #1190)
- add support for cgroups v2 (#1177)
- fix incorrect offset in
fixUrlfunction (#1184) - support channel and user links in YouTube regex (#1178)
- fix: allow passing
requestsFromUrltoRequestListOptionsin TS (#1191) - allow passing
forceClouddown to the KV store (#1186), closes #752 - merge cookies from session with user provided ones (#1201), closes #1197
- use
ApifyClientv2 (full rewrite to TS)
- Fix casting of int/bool environment variables (e.g.
APIFY_LOCAL_STORAGE_ENABLE_WAL_MODE), closes #956 - Fix incognito pages and user data dir (#1145)
- Add
@ts-ignorecomments to imports of optional peer dependencies (#1152) - Use config instance in
sdk.openSessionPool()(#1154) - Add a breaking callback to
infiniteScroll(#1140)
- Fix deprecation messages logged from
ProxyConfigurationandCheerioCrawler. - Update
got-scrapingto receive multiple improvements.
- Fix error handling in puppeteer crawler
- Use
sessionTokenwithgot-scraping
- BREAKING IN EDGE CASES - We removed
forceUrlEncodinginrequestAsBrowserbecause we found out that recent versions of the underlying HTTP clientgotalready encode URLs andforceUrlEncodingcould lead to weird behavior. We think of this as fixing a bug, so we're not bumping the major version. - Limit
handleRequestTimeoutMillisto max valid value to prevent Node.js fallback to1. - Use
got-scraping@^3.0.1 - Disable SSL validation on MITM proxie
- Limit
handleRequestTimeoutMillisto max valid value
- Fix serialization issues in
CheerioCrawlercaused by parser conflicts in recent versions ofcheerio.
- Use
got-scraping2.0.1 until fully compatible.
- BREAKING: Require Node.js >=15.10.0 because HTTP2 support on lower Node.js versions is very buggy.
- BREAKING: Bump
cheerioto1.0.0-rc.10fromrc.3. There were breaking changes incheeriobetween the versions so this bump might be breaking for you as well. - Remove
LiveViewServerwhich was deprecated before release of SDK v1. - We no longer tag beta releases.
- Fix issues with TS builds caused by incomplete
browser-poolrewrite
- Fix public URL getter of key-value stores
- Fix
headerGeneratorOptionsnot being passed togot-scrapinginrequestAsBrowser.
- Fix client
/v2duplication inapiBaseUrl.
CheerioCrawler downloads the web pages using the requestAsBrowser utility function.
As opposed to the browser based crawlers that are automatically encoding the URLs, the
requestAsBrowser function will not do so. We either need to manually encode the URLs
via encodeURI() function, or set forceUrlEncoding: true in the requestAsBrowserOptions,
which will automatically encode all the URLs before accessing them.
We can either use
forceUrlEncodingor encode manually, but not both - it would result in double encoding and therefore lead to invalid URLs.
We can use the preNavigationHooks to adjust requestAsBrowserOptions:
preNavigationHooks: [
(crawlingContext, requestAsBrowserOptions) => {
requestAsBrowserOptions.forceUrlEncoding = true;
}
]
Adds two new named exports:
Configurationclass that serves as the main configuration holder, replacing explicit usage of environment variables.Apifyclass that allows configuring the SDK. Env vars still have precedence over the SDK configuration.
When using the Apify class, there should be no side effects.
Also adds new configuration for WAL mode in ApifyStorageLocal.
As opposed to using the global helper functions like main, there is an alternative approach using Apify class.
It has mostly the same API, but the methods on Apify instance will use the configuration provided in the constructor.
Environment variables will have precedence over this configuration.
const { Apify } = require('apify'); // use named export to get the class
const sdk = new Apify({ token: '123' });
console.log(sdk.config.get('token')); // '123'
// the token will be passed to the `call` method automatically
const run = await sdk.call('apify/hello-world', { myInput: 123 });
console.log(`Received message: ${run.output.body.message}`);Another example shows how the default dataset name can be changed:
const { Apify } = require('apify'); // use named export to get the class
const sdk = new Apify({ defaultDatasetId: 'custom-name' });
await sdk.pushData({ myValue: 123 });is equivalent to:
const Apify = require('apify'); // use default export to get the helper functions
const dataset = await Apify.openDataset('custom-name');
await dataset.pushData({ myValue: 123 });- Add
Configurationclass andApifynamed export, see above. - Fix
proxyUrlwithout a port throwing an error when launching browsers. - Fix
maxUsageCountof aSessionnot being persisted. - Update
puppeteerandplaywrightto match stable Chrome (90). - Fix support for building TypeScript projects that depend on the SDK.
- add
taskTimeoutSecsto allow control over timeout ofAutoscaledPooltasks - add
forceUrlEncodingtorequestAsBrowseroptions - add
preNavigationHooksandpostNavigationHookstoCheerioCrawler - deprecated
prepareRequestFunctionandpostResponseFunctionmethods ofCheerioCrawler - Added new event
abortingfor handling gracefully aborted run from Apify platform.
- Fix
requestAsBrowserbehavior with various combinations ofjson,payloadlegacy options.
This release brings the long awaited HTTP2 capabilities to requestAsBrowser. It could make HTTP2 requests even before, but it was not very helpful in making browser-like ones. This is very important for disguising as a browser and reduction in the number of blocked requests. requestAsBrowser now uses got-scraping.
The most important new feature is that the full set of headers requestAsBrowser uses will now be generated using live data about browser headers that we collect. This means that the "header fingeprint" will always match existing browsers and should be indistinguishable from a real browser request. The header sets will be automatically rotated for you to further reduce the chances of blocking.
We also switched the default HTTP version from 1 to 2 in requestAsBrowser. We don't expect this change to be breaking, and we took precautions, but we're aware that there are always some edge cases, so please let us know if it causes trouble for you.
- Replace the underlying HTTP client of
utils.requestAsBrowser()withgot-scraping. - Make
useHttp2trueby default withutils.requestAsBrowser(). - Fix
Apify.call()failing with emptyOUTPUT. - Update
puppeteerto8.0.0andplaywrightto1.10.0with Chromium 90 in Docker images. - Update
@apify/ps-treeto support Windows better. - Update
@apify/storage-localto support Node.js 16 prebuilds.
- DEPRECATED:
utils.waitForRunToFinishplease use theapify-clientpackage and itswaitForFinishfunctions. Sorry, forgot to deprecate this with v1 release. - Fix internal
requirethat broke the SDK withunderscore1.13 release. - Update
@apify/storage-localto v2 written in TypeScript.
- Fix
SessionPoolOptionsnot being correctly used inBrowserCrawler. - Improve error messages for missing
puppeteerorplaywrightinstallations.
In this minor release we focused on the SessionPool. Besides fixing a few bugs, we added one important feature: setting and getting of sessions by ID.
// Now you can add specific sessions to the pool,
// instead of relying on random generation.
await sessionPool.addSession({
id: 'my-session',
// ... some config
});
// Later, you can retrieve the session. This is useful
// for example when you need a specific login session.
const session = await sessionPool.getSession('my-session');- Add
sessionPool.addSession()function to add a new session to the session pool (possibly with the provided options, e.g. with specific session id). - Add optional parameter
sessionIdtosessionPool.getSession()to be able to retrieve a session from the session pool with the specific session id. - Fix
SessionPoolnot working properly in bothPuppeteerCrawlerandPlaywrightCrawler. - Fix
Apify.call()andApify.callTask()output - make it backwards compatible with previous versions of the client. - Improve handling of browser executable paths when using the official SDK Docker images.
- Update
browser-poolto fix issues with failing hooks causing browsers to get stuck in limbo. - Removed
proxy-chaindependency because now it's covered inbrowser-pool.
- Add the ability to override
ProxyConfigurationstatus check URL with theAPIFY_PROXY_STATUS_URLenv var. - Fix inconsistencies in cookie handling when
SessionPoolwas used. - Fix TS types in multiple places. TS is still not a first class citizen, but this should improve the experience.
- Fix
dataset.pushData()validation which would not allow other than plain objects. - Fix
PuppeteerLaunchContext.stealththrowing when used inPuppeteerCrawler.
After 3.5 years of rapid development and a lot of breaking changes and deprecations, here comes the result - Apify SDK v1. There were two goals for this release. Stability and adding support for more browsers - Firefox and Webkit (Safari).
The SDK has grown quite popular over the years, powering thousands of web scraping and automation projects. We think our developers deserve a stable environment to work in and by releasing SDK v1, we commit to only make breaking changes once a year, with a new major release.
We added support for more browsers by replacing PuppeteerPool with
browser-pool. A new library that we created
specifically for this purpose. It builds on the ideas from PuppeteerPool and extends
them to support Playwright. Playwright is
a browser automation library similar to Puppeteer. It works with all well known browsers
and uses almost the same interface as Puppeteer, while adding useful features and simplifying
common tasks. Don't worry, you can still use Puppeteer with the new BrowserPool.
A large breaking change is that neither puppeteer nor playwright are bundled with
the SDK v1. To make the choice of a library easier and installs faster, users will
have to install the selected modules and versions themselves. This allows us to add
support for even more libraries in the future.
Thanks to the addition of Playwright we now have a PlaywrightCrawler. It is very similar
to PuppeteerCrawler and you can pick the one you prefer. It also means we needed to make
some interface changes. The launchPuppeteerFunction option of PuppeteerCrawler is gone
and launchPuppeteerOptions were replaced by launchContext. We also moved things around
in the handlePageFunction arguments. See the
migration guide
for more detailed explanation and migration examples.
What's in store for SDK v2? We want to split the SDK into smaller libraries, so that everyone can install only the things they need. We plan a TypeScript migration to make crawler development faster and safer. Finally, we will take a good look at the interface of the whole SDK and update it to improve the developer experience. Bug fixes and scraping features will of course keep landing in versions 1.X as well.
- BREAKING: Removed
puppeteerfrom dependencies. If you want to use Puppeteer, you must install it yourself. - BREAKING: Removed
PuppeteerPool. Usebrowser-pool. - BREAKING: Removed
PuppeteerCrawlerOptions.launchPuppeteerOptions. UselaunchContext. - BREAKING: Removed
PuppeteerCrawlerOptions.launchPuppeteerFunction. UsePuppeteerCrawlerOptions.preLaunchHooksandpostLaunchHooks. - BREAKING: Removed
args.autoscaledPoolandargs.puppeteerPoolfromhandle(Page/Request)Functionarguments. Useargs.crawler.autoscaledPoolandargs.crawler.browserPool. - BREAKING: The
useSessionPoolandpersistCookiesPerSessionoptions of crawlers are nowtrueby default. Explicitly set them tofalseto override the behavior. - BREAKING:
Apify.launchPuppeteer()no longer acceptsLaunchPuppeteerOptions. It now acceptsPuppeteerLaunchContext.
- DEPRECATED:
PuppeteerCrawlerOptions.gotoFunction. UsePuppeteerCrawlerOptions.preNavigationHooksandpostNavigationHooks.
- BREAKING: Removed
Apify.utils.puppeteer.enqueueLinks(). Deprecated in 01/2019. UseApify.utils.enqueueLinks(). - BREAKING: Removed
autoscaledPool.(set|get)MaxConcurrency(). Deprecated in 2019. UseautoscaledPool.maxConcurrency. - BREAKING: Removed
CheerioCrawlerOptions.requestOptions. Deprecated in 03/2020. UseCheerioCrawlerOptions.prepareRequestFunction. - BREAKING: Removed
Launch.requestOptions. Deprecated in 03/2020. UseCheerioCrawlerOptions.prepareRequestFunction.
- Added
Apify.PlaywrightCrawlerwhich is almost identical toPuppeteerCrawler, but it crawls with theplaywrightlibrary. - Added
Apify.launchPlaywright(launchContext)helper function. - Added
browserPoolOptionstoPuppeteerCrawlerto configureBrowserPool. - Added
crawlertohandle(Request/Page)Functionarguments. - Added
browserControllertohandlePageFunctionarguments. - Added
crawler.crawlingContextsMapwhich includes all runningcrawlingContexts.
- Fix issues with
Apify.pushData()andkeyValueStore.forEachKey()by updating@apify/storage-localto1.0.2.
- Fix
puppeteerPoolmissing in handle page arguments.
- Pinned
cheerioto1.0.0-rc.3to avoid install problems in some builds. - Increased default
maxEventLoopOverloadedRatioinSystemStatusOptionsto 0.6. - Updated packages and improved docs.
This is the last major release before SDK v1.0.0. We're committed to deliver v1 at the
end of 2020 so stay tuned. Besides Playwright integration via a new BrowserPool,
it will be the first release of SDK that we'll support for an extended period of time.
We will not make any breaking changes until 2.0.0, which will come at the end of
2021. But enough about v1, let's see the changes in 0.22.0.
In this release we've changed a lot of code, but you may not even notice.
We've updated the underlying apify-client package which powers all communication with
the Apify API to version 1.0.0. This means a completely new API for all internal calls.
If you use Apify.client calls in your code, this will be a large breaking change for you.
Visit the client docs
to see what's new in the client, but also note that we removed the default client
available under Apify.client and replaced it with Apify.newClient() function.
We think it's better to have separate clients for users and internal use.
Until now, local emulation of Apify Storages has been a part of the SDK. We moved the logic
into a separate package @apify/storage-local which shares interface with apify-client.
RequestQueue is now powered by SQLite3 instead of file system, which improves
reliability and performance quite a bit. Dataset and KeyValueStore still use file
system, for easy browsing of data. The structure of apify_storage folder remains unchanged.
After collecting common developer mistakes, we've decided to make argument validation stricter.
You will no longer be able to pass extra arguments to functions and constructors. This is
to alleviate frustration, when you mistakenly pass useChrome to PuppeteerPoolOptions
instead of LaunchPuppeteerOptions and don't realize it. Before this version, SDK wouldn't
let you know and would silently continue with Chromium. Now, it will throw an error saying
that useChrome is not an allowed property of PuppeteerPoolOptions.
Based on developer feedback, we decided to remove --no-sandbox from the default Puppeteer
launch args. It will only be used on Apify Platform. This gives you the chance to use
your own sandboxing strategy.
LiveViewServer and puppeteerPoolOptions.useLiveView were never very user-friendly
or performant solutions, due to the inherent performance issues with rapidly taking many
screenshots in Puppeteer. We've decided to remove it. If you need similar functionality,
try the devtools-server NPM package, which utilizes the Chrome DevTools Frontend for
screen-casting live view of the running browser.
Full list of changes:
-
BREAKING: Updated
apify-clientto1.0.0with a completely new interface. We also removed theApify.clientproperty and replaced it with anApify.newClient()function that creates a newApifyClientinstance. -
BREAKING: Removed
--no-sandboxfrom default Puppeteer launch arguments. This will most likely be breaking for Linux and Docker users. -
BREAKING: Function argument validation is now more strict and will not accept extra parameters which are not defined by the functions' signatures.
-
DEPRECATED:
puppeteerPoolOptions.useLiveViewis now deprecated. Use thedevtools-serverNPM package instead. -
Added
postResponseFunctiontoCheerioCrawlerOptions. It allows you to override properties on the HTTP response before processing byCheerioCrawler. -
Added HTTP2 support to
utils.requestAsBrowser(). SetuseHttp2totrueinRequestAsBrowserOptionsto enable it. -
Fixed handling of XML content types in
CheerioCrawler. -
Fixed capitalization of headers when using
utils.puppeteer.addInterceptRequestHandler. -
Fixed
utils.puppeteer.saveSnapshot()overwriting screenshots with HTML on local. -
Updated
puppeteerto version5.4.1with Chrom(ium) 87. -
Removed
RequestQueueLocalin favor of@apify/storage-localAPI emulator. -
Removed
KeyValueStoreLocalin favor of@apify/storage-localAPI emulator. -
Removed
DatasetLocalin favor of@apify/storage-localAPI emulator. -
Removed the
userDataoption fromApify.utils.enqueueLinks(deprecated in Jun 2019). UsetransformRequestFunctioninstead. -
Removed
instanceKillerIntervalMillisandkillInstanceAfterMillis(deprecated in Feb 2019). UseinstanceKillerIntervalSecsandkillInstanceAfterSecsinstead. -
Removed the
memoryoption fromApify.calloptionswhich was (deprecated in 2018). UsememoryMbytesinstead. -
Removed
delete()methods fromDataset,KeyValueStoreandRequestQueue(deprecated in Jul 2019). Use.drop(). -
Removed
utils.puppeteer.hideWebDriver()(deprecated in May 2019). UseLaunchPuppeteerOptions.stealth. -
Removed
utils.puppeteer.enqueueRequestsFromClickableElements()(deprecated in 2018). Useutils.puppeteer.enqueueLinksByClickingElements. -
Removed
request.doNotRetry()(deprecated in June 2019) Userequest.noRetry = true. -
Removed
RequestListOptions.persistSourcesKey(deprecated in Feb 2020) UsepersistRequestsKey. -
Removed the
The function passed to Apify.main() threw an exceptionerror message, because it was confusing to users. -
Removed automatic injection of
charset=utf-8inkeyValueStore.setValue()to thecontentTypeoption.
- Technical release, see 0.22.1
- Pinned
cheerioto1.0.0-rc.3to avoid install problems in some builds.
- Bump Puppeteer to 5.5.0 and Chrom(ium) 88.
- Fix various issues in
stealth. - Fix
SessionPoolnot retiring sessions immediately when they become unusable. It fixes a problem wherePuppeteerPoolwould not retire browsers wit bad sessions.
- Make
PuppeteerCrawlersafe against malformed Puppeteer responses. - Update default user agent to Chrome 86
- Bump Puppeteer to 5.3.1 with Chromium 86
- Fix an error in
PuppeteerCrawlercaused bypage.goto()randomly returningnull.
It appears that CheerioCrawler was correctly retiring sessions on timeouts
and blocked status codes (401, 403, 429), whereas PuppeteerCrawler did not.
Apologies for the omission, this release fixes the problem.
- Fix sessions not being retired on blocked status codes in
PuppeteerCrawler. - Fix sessions not being marked bad on navigation timeouts in
PuppeteerCrawler. - Update
apify-sharedto version0.5.0.
This is a very minor release that fixes some issues that were preventing use of the SDK with Node 14.
- Update the request serialization process which is used in
RequestListto work with Node 10+. - Update some TypeScript types that were preventing build due to changes in typed dependencies.
The statistics that you may remember from logs are now persisted in key-value store, so you won't lose count when your actor restarts. We've also added a lot of useful stats in there which can be useful to you after a run finishes. Besides that, we fixed some bugs and annoyances and improved the TypeScript experience a bit.
- Add persistence to
Statisticsclass and automatically persist it inBasicCrawler. - Fix issue where inaccessible Apify Proxy would cause
ProxyConfigurationto throw a timeout error. - Update default user agent to Chrome 85
- Bump Puppeteer to 5.2.1 which uses Chromium 85
- TypeScript: Fix
RequestAsBrowserOptionsmissing some values and addRequestQueueInfoas a return value fromrequestQueue.getInfo()
- Fix useless logging in Session.
- Fix cookies with leading dot in domain (as extracted from Puppeteer) not being correctly added to Sessions.
We fixed some bugs, improved a few things and bumped Puppeteer to match latest Chrome 84.
- Allow
Apify.createProxyConfigurationto be used seamlessly with the proxy component of Actor Input UI. - Fix integration of plugins into
CheerioCrawlerwith thecrawler.use()function. - Fix a race condition which caused
RequestQueueLocalto fail handling requests. - Fix broken debug logging in
SessionPool. - Improve
ProxyConfigurationerror message for missing password / token. - Update Puppeteer to 5.2.0
- Improve docs, update packages and so on.
This release comes with breaking changes that will affect most, if not all of your projects. See the migration guide for more information and examples.
First large change is a redesigned proxy configuration. Cheerio and Puppeteer crawlers
now accept a proxyConfiguration parameter, which is an instance of ProxyConfiguration.
This class now exclusively manages both Apify Proxy and custom proxies. Visit the new
proxy management guide
We also removed Apify.utils.getRandomUserAgent() as it was no longer effective
in avoiding bot detection and changed the default values for empty properties in
Request instances.
- BREAKING: Removed
Apify.getApifyProxyUrl(). To get an Apify Proxy url, useproxyConfiguration.newUrl([sessionId]). - BREAKING: Removed
useApifyProxy,apifyProxyGroupsandapifyProxySessionparameters from all applications in the SDK. UseproxyConfigurationin crawlers andproxyUrlinrequestAsBrowserandApify.launchPuppeteer. - BREAKING: Removed
Apify.utils.getRandomUserAgent()as it was no longer effective in avoiding bot detection. - BREAKING:
Requestinstances no longer initialize empty properties withnull, which means that:- empty
errorMessagesare now represented by[], and - empty
loadedUrl,payloadandhandledAtareundefined.
- empty
- Add
Apify.createProxyConfiguration()asyncfunction to createProxyConfigurationinstances.ProxyConfigurationitself is not exposed. - Add
proxyConfigurationtoCheerioCrawlerOptionsandPuppeteerCrawlerOptions. - Add
proxyInfotoCheerioHandlePageInputsandPuppeteerHandlePageInputs. You can use this object to retrieve information about the currently used proxy inPuppeteerandCheeriocrawlers. - Add click buttons and scroll up options to
Apify.utils.puppeteer.infiniteScroll(). - Fixed a bug where intercepted requests would never continue.
- Fixed a bug where
Apify.utils.requestAsBrowser()would get into redirect loops. - Fix
Apify.utils.getMemoryInfo()crashing the process on AWS Lambda and on systems running in Docker without memory cgroups enabled. - Update Puppeteer to 3.3.0.
- Add
Apify.utils.waitForRunToFinish()which simplifies waiting for an actor run to finish. - Add standard prefixes to log messages to improve readability and orientation in logs.
- Add support for
asynchandlers inApify.utils.puppeteer.addInterceptRequestHandler() - EXPERIMENTAL: Add
cheerioCrawler.use()function to enable attachingCrawlerExtensionto the crawler to modify its behavior. A plugin that extends functionality. - Fix bug with cookie expiry in
SessionPool. - Fix issues in documentation.
- Updated
@apify/http-requestto fix issue in theproxy-agentpackage. - Updated Puppeteer to 3.0.2
- DEPRECATED:
CheerioCrawlerOptions.requestOptionsis now deprecated. Please useCheerioCrawlerOptions.prepareRequestFunctioninstead. - Add
limitoption toApify.utils.enqueueLinks()for situations when full crawls are not needed. - Add
suggestResponseEncodingandforceResponseEncodingoptions toCheerioCrawlerto allow users to provide a fall-back or forced encoding of responses in situations where websites serve invalid encoding information in their headers. - Add a number of new examples and update existing ones to documentation.
- Fix duplicate file extensions in
Apify.utils.puppeteer.saveSnapshot()when used locally. - Fix encoding of multi-byte characters in
CheerioCrawler. - Fix formatting of navigation buttons in documentation.
- Fix an error where persistence of
SessionPoolwould fail if a cookie included invalidexpiresvalue. - Jumping a patch version because of an error in publishing via CI.
- BREAKING:
Apify.utils.requestAsBrowser()no longer aborts request on status code 406 or when other thantext/htmltype is received. Useoptions.abortFunctionif you want to retain this functionality. - BREAKING: Added
useInsecureHttpParseroption toApify.utils.requestAsBrowser()which istrueby default and forces the function to use a HTTP parser that is less strict than default Node 12 parser, but also less secure. It is needed to be able to bypass certain anti-scraping walls and fetch websites that do not comply with HTTP spec. - BREAKING:
RequestListnow removes all the elements from thesourcesarray on initialization. If you need to use the sources somewhere else, make a copy. This change was added as one of several measures to improve memory management ofRequestListin scenarios with very large amount ofRequestinstances. - DEPRECATED:
RequestListOptions.persistSourcesKeyis now deprecated. Please useRequestListOptions.persistRequestsKey. RequestList.sourcescan now be an array ofstringURLs as well.- Added
sourcesFunctiontoRequestListOptions. It enables dynamic fetching of sources and will only be called if persistedRequestswere not retrieved from key-value store. Use it to reduce memory spikes and also to make sure that your sources are not re-created on actor restarts. - Updated
stealthhiding ofwebdriverto avoid recent detections. Apify.utils.lognow points to an updated logger instance which prints colored logs (in TTY) and supports overriding with custom loggers.- Improved
Apify.launchPuppeteer()code to prevent triggering bugs in Puppeteer by passing more than required options topuppeteer.launch(). - Documented
BasicCrawler.autoscaledPoolproperty, and addedCheerioCrawler.autoscaledPoolandPuppeteerCrawler.autoscaledPoolproperties. SessionPoolnow persists state onteardown. Before, it only persisted state every minute. This ensures that after a crawler finishes, the state is correctly persisted.- Added TypeScript typings and typedef documentation for all entities used throughout SDK.
- Upgraded
proxy-chainNPM package from 0.2.7 to 0.4.1 and many other dependencies - Removed all usage of the now deprecated
requestpackage.
- BREAKING (EXPERIMENTAL):
session.checkStatus() -> session.retireOnBlockedStatusCodes(). SessionAPI is no longer considered experimental.- Updates documentation and introduces a few internal changes.
- BREAKING:
APIFY_LOCAL_EMULATION_DIRenv var is no longer supported (deprecated on 2018-09-11). UseAPIFY_LOCAL_STORAGE_DIRinstead. SessionPoolAPI updates and fixes. The API is no longer considered experimental.- Logging of system info moved from
requiretime toApify.main()invocation. - Use native
RegExpinstead ofxregexpfor unicode property escapes.
- Fix
SessionPoolnot automatically working inCheerioCrawler. - Fix incorrect management of page count in
PuppeteerPool.
- BREAKING
CheerioCrawlerignores ssl errors by default -options.ignoreSslErrors: true. - Add
SessionPoolimplemenation toCheerioCrawler. - Add
SessionPoolimplementation toPuppeteerPoolandPupeteerCrawler. - Fix
Requestconstructor not making a copy of objects such asuserDataandheaders. - Fix
descoption not being applied in localdataset.getData().
- BREAKING: Node 8 and 9 are no longer supported. Please use Node 10.17.0 or higher.
- DEPRECATED:
Apify.callTask()bodyandcontentTypeoptions are now deprecated. Useinputinstead. It must be ofcontent-type: application/json. - Add default
SessionPoolimplementation toBasicCrawler. - Add the ability to create ad-hoc webhooks via
Apify.call()andApify.callTask(). - Add an example of form filling with
Puppeteer. - Add
countryoption toApify.getApifyProxyUrl(). - Add
Apify.utils.puppeteer.saveSnapshot()helper to quickly save HTML and screenshot of a page. - Add the ability to pass
gotsupported options torequestOptionsinCheerioCrawlerthus supporting things such ascookieJaragain. - Switch Puppeteer to web socket again due to suspected
pipeerrors. - Fix an issue where some encodings were not correctly parsed in
CheerioCrawler. - Fix parsing bad Content-Type headers for
CheerioCrawler. - Fix custom headers not being correctly applied in
Apify.utils.requestAsBrowser(). - Fix dataset limits not being correctly applied.
- Fix a race condition in
RequestQueueLocal. - Fix
RequestListpersistence of downloaded sources in key-value store. - Fix
Apify.utils.puppeteer.blockRequests()always including default patterns. - Fix inconsistent behavior of
Apify.utils.puppeteer.infiniteScroll()on some websites. - Fix retry histogram statistics sometimes showing invalid counts.
- Added regexps for Youtube videos (
YOUTUBE_REGEX,YOUTUBE_REGEX_GLOBAL) toutils.social - Added documentation for option
jsonin handlePageFunction ofCheerioCrawler
- Bump Puppeteer to 2.0.0 and use
{ pipe: true }again because upstream bug has been fixed. - Add
useIncognitoPagesoption toPuppeteerPoolto enable opening new pages in incognito browser contexts. This is useful to keep cookies and cache unique for each page. - Added options to load every content type in CheerioCrawler.
There are new options
bodyandcontentTypeinhandlePageFunctionfor this purposes. - DEPRECATED: CheerioCrawler
htmloption inhandlePageFunctionwas replaced withbodyoptions.
- This release updates
@apify/http-requestto version 1.1.2. - Update
CheerioCrawlerto userequestAsBrowser()to better disguise as a real browser.
- This release just updates some dependencies (not Puppeteer).
- DEPRECATED:
dataset.delete(),keyValueStore.delete()andrequestQueue.delete()methods have been deprecated in favor of*.drop()methods, because thedropname more clearly communicates the fact that those methods drop / delete the storage itself, not individual elements in the storage. - Added
Apify.utils.requestAsBrowser()helper function that enables you to make HTTP(S) requests disguising as a browser (Firefox). This may help in overcoming certain anti-scraping and anti-bot protections. - Added
options.gotoTimeoutSecstoPuppeteerCrawlerto enable easier setting of navigation timeouts. PuppeteerPooloptions that were deprecated from thePuppeteerCrawlerconstructor were finally removed. Please usemaxOpenPagesPerInstance,retireInstanceAfterRequestCount,instanceKillerIntervalSecs,killInstanceAfterSecsandproxyUrlsvia thepuppeteerPoolOptionsobject.- On the Apify Platform a warning will now be printed when using an outdated
apifypackage version. Apify.utils.puppeteer.enqueueLinksByClickingElements()will now print a warning when the nodes it tries to click become modified (detached from DOM). This is useful to debug unexpected behavior.
Apify.launchPuppeteer()now acceptsproxyUrlwith thehttps,socks4andsocks5schemes, as long as it doesn't contain username or password. This is to fix Issue #420.- Added
desiredConcurrencyoption toAutoscaledPoolconstructor, removed unnecessary bound check from the setter property
- Fix error where Puppeteer would fail to launch when pipes are turned off.
- Switch back to default Web Socket transport for Puppeteer due to upstream issues.
- BREAKING CHANGE Removed support for Web Driver (Selenium) since no further updates are planned. If you wish to continue using Web Driver, please stay on Apify SDK version ^0.14.15
- BREAKING CHANGE:
Dataset.getData()throws an error if user provides an unsupported option when using local disk storage. - DEPRECATED:
options.userDataofApify.utils.enqueueLinks()is deprecated. Useoptions.transformRequestFunctioninstead. - Improve logging of memory overload errors.
- Improve error message in
Apify.call(). - Fix multiple log lines appearing when a crawler was about to finish.
- Add
Apify.utils.puppeteer.enqueueLinksByClickingElements()function which enables you to add requests to the queue from pure JavaScript navigations, form submissions etc. - Add
Apify.utils.puppeteer.infiniteScroll()function which helps you with scrolling to the bottom of websites that auto-load new content. - The
RequestQueue.handledCount()function has been resurrected from deprecation, in order to have compatible interface withRequestList. - Add
useExtendedUniqueKeyoption toRequestconstructor to includemethodandpayloadin theRequest's computeduniqueKey. - Updated Puppeteer to 1.18.1
- Updated
apify-clientto 0.5.22
- Fixes in
RequestQueueto deal with inconsistencies in the underlying data storage - BREAKING CHANGE:
RequestQueue.addRequest()now sets the ID of the newly added request to the passedRequestobject - The
RequestQueue.handledCount()function has been deprecated, please useRequestQueue.getInfo()instead.
- Fix error where live view would crash when started with concurrency already higher than 1.
- Fix
POSTrequests in Puppeteer.
Snapshotterwill now log critical memory overload warnings at most once per 10 seconds.- Live view snapshots are now made right after navigation finishes, instead of right before page close.
- Add
Statisticsclass to track crawler run statistics. - Use pipes instead of web sockets in Puppeteer to improve performance and stability.
- Add warnings to all functions using Puppeteer's request interception to inform users about its performance impact caused by automatic cache disabling.
- DEPRECATED:
Apify.utils.puppeteer.blockResources()because of negative impact on performance. Use.blockRequests()(see below). - Add
Apify.utils.puppeteer.blockRequests()to enable blocking URL patterns without request interception involved. This is a replacement for.blockResources()until performance issues with request interception resolve.
- Update
Puppeteerto 1.17.0. - Add
idempotencyKeyparameter toApify.addWebhook().
- Better logs from
AutoscaledPoolclass - Replace
cpuInfoApify event with newsystemInfoevent inSnapshotter.
- Bump
apify-clientto 0.5.17
- Bump
apify-clientto 0.5.16
- Stringification to JSON of actor input in
Apify.call(),Apify.callTask()andApify.metamorph()now also supports functions viafunc.toString(). The same holds for record body insetValue()method of key-value store. - Request queue now monitors number of clients that accessed the queue which allows crawlers to finish without 10s waiting if run was not migrated during its lifetime.
- Update Puppeteer to 1.15.0.
- Added the
stealthoptionlaunchPuppeteerOptionswhich decreases headless browser detection chance. - DEPRECATED:
Apify.utils.puppeteer.hideWebDriveruselaunchPuppeteerOptions.stealthinstead. CheerioCrawlernow parses HTML using streams. This improves performance and memory usage in most cases.
- Request queue now allows crawlers to finish quickly without waiting in a case that queue was used by a single client.
- Better logging of errors in
Apify.main()
- Fix invalid type check in
puppeteerModule.
- Made UI and UX improvements to
LiveViewServerfunctionality. launchPuppeteerOptions.puppeteerModulenow supportsObject(pre-required modules).- Removed
--enable-resource-load-scheduler=falseChromium command line flag, it has no effect. See https://bugs.chromium.org/p/chromium/issues/detail?id=723233 - Fixed inconsistency in
prepareRequestFunctionofCheerioCrawler. - Update Puppeteer to 1.14.0
- BREAKING CHANGE: Live View is no longer available by passing
liveView = truetolaunchPuppeteerOptions. - New version of Live View is available by passing the
useLiveView = trueoption toPuppeteerPool.- Only shows snapshots of a single page from a single browser.
- Only makes snapshots when a client is connected, having very low performance impact otherwise.
- Added
Apify.utils.puppeteer.addInterceptRequestHandlerandremoveInterceptRequestHandlerwhich can be used to add multiple request interception handlers to Puppeteer's pages. - Added
puppeteerModuletoLaunchPuppeteerOptionswhich enables use of other Puppeteer modules, such aspuppeteer-extrainstead of plainpuppeteer.
- Fix a bug where invalid response from
RequestQueuewould occasionally cause crawlers to crash.
- Fix
RequestQueuethrottling at high concurrency.
- Fix bug in
addWebhookinvocation.
- Fix
puppeteerPoolOptionsobject not being used inPuppeteerCrawler.
- Fix
REQUEST_QUEUE_HEAD_MAX_LIMITis not defined error.
Snapshotternow marks Apify Client overloaded on the basis of 2nd retry errors.- Added
Apify.addWebhook()to invoke a webhook when an actor run ends. Currently this only works on the Apify Platform and will print a warning when ran locally.
- BREAKING CHANGE: Added
puppeteerOperationTimeoutSecsoption toPuppeteerPool. It defaults to 15 seconds and all Puppeteer operations such asbrowser.newPage()orpuppeteer.launch()will now time out. This is to prevent hanging requests. - BREAKING CHANGE: Added
handleRequestTimeoutSecsoption toBasicCrawlerwith a 60 second default. - DEPRECATED:
PuppeteerPooloptions in thePuppeteerCrawlerconstructor are now deprecated. Please use the newpuppeteerPoolOptionsargument of typeObjectto pass them.launchPuppeteerFunctionandlaunchPuppeteerOptionsare still available as shortcuts for convenience. CheerioCrawlerandPuppeteerCrawlernow automatically sethandleRequestTimeoutSecsto 10 times theirhandlePageTimeoutSecs. This is a precaution that should keep requests from hanging forever.- Added
options.prepareRequestFunction()toCheerioCrawlerconstructor to enable modification ofRequestbefore the HTTP request is made to the target URL. - Added back the
recycleDiskCacheoption toPuppeteerPoolnow that it is supported even in headless mode (read more)
- Parameters
inputandoptionsadded toApify.callTask().
- Added oldest active tab focusing to
PuppeteerPoolto combat resource throttling in Chromium.
- Added
Apify.metamorph(), see documentation for more information. - Added
Apify.getInput()
- BREAKING CHANGE: Reduced default
handlePageTimeoutSecsfor bothCheerioCrawlerandPuppeteerCrawlerfrom 300 to 60 seconds, in order to prevent stalling crawlers. - BREAKING CHANGE:
PseudoUrlnow performs case-insensitive matching, even for the query string part of the URLs. If you need case sensitive matching, use an appropriateRegExpin place of a Pseudo URL string - Upgraded to puppeteer@1.12.2 and xregexp@4.2.4
- Added
loadedUrlproperty toRequestthat contains the final URL of the loaded page after all redirects. - Added memory overload warning log message.
- Added
keyValueStore.getPublicUrlfunction. - Added
minConcurrency,maxConcurrency,desiredConcurrencyandcurrentConcurrencyproperties toAutoscaledPool, improved docs - Deprecated
AutoscaledPool.setMinConcurrencyandAutoscaledPool.setMaxConcurrencyfunctions - Updated
DEFAULT_USER_AGENTandUSER_AGENT_LISTwith new User Agents - Bugfix:
LocalRequestQueue.getRequest()threw an exception if request was not found - Added
RequestQueue.getInfo()function - Improved
Apify.main()to provide nicer stack traces on errors Apify.utils.puppeteer.injectFile()now supports injection that survives page navigations and caches file contents.
- Fix the
keyValueStore.forEachKey()method. - Fix version of
puppeteerto prevent errors with automatic updates.
- Apify SDK now logs basic system info when
required. - Added
utils.createRequestDebugInfo()function to create a standardized debug info from request and response. PseudoUrlcan now be constructed with aRegExp.Apify.utils.enqueueLinks()now acceptsRegExpinstances in itspseudoUrlsparameter.Apify.utils.enqueueLinks()now accepts abaseUrloption that enables resolution of relative URLs when parsing a Cheerio object. (It's done automatically in browser when using Puppeteer).- Better error message for an invalid
launchPuppeteerFunctionpassed toPuppeteerPool.
- DEPRECATION WARNING
Apify.utils.puppeteer.enqueueLinks()was moved toApify.utils.enqueueLinks(). Apify.utils.enqueueLinks()now supportsoptions.$property to enqueue links from a Cheerio object.
- Disabled the
PuppeteerPoolreusePagesoption for now, due to a memory leak. - Added a
keyValueStore.forEachKey()method to iterate all keys in the store.
- Improvements in
Apify.utils.social.parseHandlesFromHtmlandApify.utils.htmlToText - Updated docs
- Fix
reusePagescausing Puppeteer to fail when used together with request interception.
- Fix missing
reusePagesconfiguration parameter inPuppeteerCrawler. - Fix a memory leak where
reusePageswould prevent browsers from closing.
- Fix missing
autoscaledPoolparameter inhandlePageFunctionofPuppeteerCrawler.
-
BREAKING CHANGE:
basicCrawler.abort(),cheerioCrawler.abort()andpuppeteerCrawler.abort()functions were removed in favor of a singleautoscaledPool.abort()function. -
Added a reference to the running
AutoscaledPoolinstance to the options object ofBasicCrawler'shandleRequestFunctionand to thehandlePageFunctionofCheerioCrawlerandPuppeteerCrawler. -
Added sources persistence option to
RequestListthat works best in conjunction with the state persistence, but can be toggled separately too. -
Added
Apify.openRequestList()function to place it in line withRequestQueue,KeyValueStoreandDataset.RequestListcreated using this function will automatically persist state and sources. -
Added
pool.pause()andpool.resume()functions toAutoscaledPool. You can now pause the pool, which will prevent additional tasks from being run and wait for the running ones to finish. -
Fixed a memory leak in
CheerioCrawlerand potentially other crawlers.
- Added
Apify.utils.htmlToText()function to convert HTML to text and removed unncessaryhtml-to-textdependency. The new function is now used inApify.utils.social.parseHandlesFromHtml(). - Updated
DEFAULT_USER_AGENT
autoscaledPool.isFinishedFunction()andautoscaledPool.isTaskReadyFunction()exceptions will now cause thePromisereturned byautoscaledPool.run()to reject instead of just logging a message. This is in line with theautoscaledPool.runTaskFunction()behavior.- Bugfix: PuppeteerPool was incorrectly overriding
proxyUrlseven if they were not defined. - Fixed an issue where an error would be thrown when
datasetLocal.getData()was invoked with an overflowing offset. It now correctly returns an emptyArray. - Added the
reusePagesoption toPuppeteerPool. It will now reuse existing tabs instead of opening new ones for each page when enabled. BasicCrawler(and therefore all Crawlers) now logs a message explaining why it finished.- Fixed an issue where
maxRequestsPerCrawloption would not be honored after restart or migration. - Fixed an issue with timeout promises that would sometimes keep the process hanging.
CheerioCrawlernow acceptsgzipanddeflatecompressed responses.
- Upgraded Puppeteer to 1.11.0
- DEPRECATION WARNING:
Apify.utils.puppeteer.enqueueLinks()now uses an options object instead of individual parameters and supports passing ofuserDatato the enqueuedrequest. Previously:enqueueLinks(page, selector, requestQueue, pseudoUrls)Now:enqueueLinks({ page, selector, requestQueue, pseudoUrls, userData }). Using individual parameters is DEPRECATED.
- Added API response tracking to AutoscaledPool, leveraging
Apify.client.statsobject. It now overloads the system when a large amount of 429 - Too Many Requests is received.
- Updated NPM packages to fix a vulnerability reported at dominictarr/event-stream#116
- Added warning if the Node.js is an older version that doesn't support regular expression syntax used by the tools in
the
Apify.utils.socialnamespace, instead of failing to start.
- Added back support for
memoryoption inApify.call(), write deprecation warning instead of silently failing
- Improvements in
Apify.utils.socialfunctions and tests
- Added new
Apify.utils.socialnamespace with function to extract emails, phone and social profile URLs from HTML and text documents. Specifically, it supports Twitter, LinkedIn, Instagram and Facebook profiles. - Updated NPM dependencies
Apify.launchPuppeteer()now sets thedefaultViewportoption if not provided by user, to improve screenshots and debugging experience.- Bugfix:
Dataset.getInfo()sometimes returned an object withitemsCountfield instead ofitemCount
- Improvements in deployment script.
- Bugfix:
Apify.call()was causing permissions error.
- Automatically adding
--enable-resource-load-scheduler=falseChrome flag inApify.launchPuppeteer()to make crawling of pages in all tabs run equally fast.
- Bug fixes and improvements of internals.
- Package updates.
- Added the ability of
CheerioCrawlerto request and download onlytext/htmlresponses. - Added a workaround for a long standing
tunnel-agentpackage error toCheerioCrawler. - Added
request.doNotRetry()function to prevent further retries of arequest. - Deprecated
request.ignoreErrorsoption. Userequest.doNotRetry. - Fixed
Apify.utils.puppeteer.enqueueLinksto allownullvalue forpseudoUrlsparam - Fixed
RequestQueue.addRequest()to gracefully handle invalid URLs - Renamed
RequestOperationInfotoQueueOperationInfo - Added
requestfield toQueueOperationInfo - DEPRECATION WARNING: Parameter
timeoutSecsofApify.call()is used for actor run timeout. For time of waiting for run to finish usewaitSecsparameter. - DEPRECATION WARNING: Parameter
memoryofApify.call()was renamed tomemoryMbytes. - Added
Apify.callTask()that enables to start actor task and fetch its output. - Added option enforcing cloud storage to be used in
openKeyValueStore(),openDataset()andopenRequestQueue() - Added
autoscaledPool.setMinConcurrency()andautoscaledPool.setMinConcurrency()
- Fix a bug in
CheerioCrawlerwhereuseApifyProxywould only work withapifyProxyGroups.
- Reworked
request.pushErrorMessage()to support any message and not throw. - Added Apify Proxy (
useApifyProxy) support toCheerioCrawler. - Added custom
proxyUrlssupport toPuppeteerPoolandCheerioCrawler. - Added Actor UI
pseudoUrlsoutput support toApify.utils.puppeteer.enqueueLinks().
- Created dedicated project page at https://sdk.apify.com
- Improved docs, texts, guides and other texts, pointed links to new page
- Bugfix in
PuppeteerPool: Pages were sometimes considered closed even though they weren't - Improvements in documentation
- Upgraded Puppeteer to 1.9.0
- Added
Apify.utils.puppeteer.cacheResponsesto enable response caching in headless Chromium.
- Fixed
AutoscaledPoolterminating before all tasks are finished. - Migrated to v 0.1.0 of
apify-shared.
- Allow AutoscaledPool to run tasks up to minConcurrency even when the system is overloaded.
- Upgraded @apify/ps-tree depedency (fixes "Error: spawn ps ENFILE"), upgraded other NPM packages
- Updated documentation and README, consolidated images.
- Added CONTRIBUTING.md
- Updated documentation and README.
- Bugfixes in
RequestQueueLocal
- Updated documentation and README.
- Optimized autoscaled pool default configuration.
- BREAKING CHANGES IN AUTOSCALED POOL
- It has been completely rebuilt for better performance.
- It also now works locally.
- see Migration Guide for more information.
- Updated to apify-shared@0.0.58
- Bug fixes and documentation improvements.
- Upgraded Puppeteer to 1.8.0
- Upgraded NPM dependencies, fixed lint errors
Apify.main()now sets theAPIFY_LOCAL_STORAGE_DIRenv var to a default value if neitherAPIFY_LOCAL_STORAGE_DIRnorAPIFY_TOKENis defined
- Updated
DEFAULT_USER_AGENTandUSER_AGENT_LIST - Added
recycleDiskCacheoption toPuppeteerPoolto enable reuse of disk cache and thus speed up browsing - WARNING:
APIFY_LOCAL_EMULATION_DIRenvironment variable was renamed toAPIFY_LOCAL_STORAGE_DIR. - Environment variables
APIFY_DEFAULT_KEY_VALUE_STORE_ID,APIFY_DEFAULT_REQUEST_QUEUE_IDandAPIFY_DEFAULT_DATASET_IDhave now default valuedefaultso there is no need to define them when developing locally.
- Added
compileScript()function toutils.puppeteerto enable use of external scripts at runtime.
- Fixed persistent deprecation warning of
pageOpsTimeoutMillis. - Moved
cheerioto dependencies. - Fixed
keepDuplicateUrlserrors with persistent RequestList.
- Added
getInfo()method to Dataset to get meta-information about a dataset. - Added CheerioCrawler, a specialized class for crawling the web using
cheerio. - Added
keepDuplicateUrlsoption to RequestList to allow duplicate URLs. - Added
.abort()method to all Crawler classes to enable stopping the crawl programmatically. - Deprecated
pageOpsTimeoutMillisoption. UsehandlePageTimeoutSecs. - Bluebird promises are being phased out of
apifyin favor ofasync-await. - Added
logtoApify.utilsto improve logging experience.
- Replaced git-hosted version of our fork of ps-tree with @apify/ps-tree package
- Removed old unused Apify.readyFreddy() function
- Improved logging of URL and port in
PuppeteerLiveViewBrowser. - PuppeteerCrawler's default page load timeout changed from 30 to 60 seconds.
- Added
Apify.utils.puppeteer.blockResources()function - More efficient implementation of
getMemoryInfofunction - Puppeteer upgraded to 1.7.0
- Upgraded NPM dependencies
- Dropped support for Node 7
- Fixed unresponsive magnifying glass and improved status tracking in LiveView frontend
- Fixed invalid URL parsing in RequestList.
- Added support for non-Latin language characters (unicode) in URLs.
- Added validation of payload size and automatic chunking to
dataset.pushData(). - Added support for all content types and their known extensions to
KeyValueStoreLocal.
- Puppeteer upgraded to 1.6.0.
- Removed
pageCloseTimeoutMillisoption fromPuppeteerCrawlersince it only affects debug logging.
- Bug where failed
page.close()inPuppeteerPoolwas causing request to be retried is fixed. - Added
memoryparameter toApify.call(). - Added
PuppeteerPool.retire(browser)method allowing retire a browser before it reaches his limits. This is useful when its IP address got blocked by anti-scraping protection. - Added option
liveView: truetoApify.launchPuppeteer()that will start a live view server proving web page with overview of all running Puppeteer instances and their screenshots. PuppeteerPoolnow kills opened Chrome instances inSIGINTsignal.
- Bugfix in BasicCrawler: native Promise doesn't have finally() function
- Parameter
maxRequestsPerCrawladded toBasicCrawlerandPuppeteerCrawlerclasses.
- Revereted back -
Apify.getApifyProxyUrl()accepts againsessionandgroupsoptions instead ofapifyProxySessionandapifyProxyGroups - Parameter
memoryadded toApify.call().
PseudoUrlclass can now contain a template forRequestobject creation andPseudoUrl.createRequest()method.- Added
Apify.utils.puppeteer.enqueueLinks()function which enqueues requests created from links mathing given pseudo-URLs.
- Added 30s timeout to
page.close()operation inPuppeteerCrawler.
- Added
dataset.detData(),dataset.map(),dataset.forEach()anddataset.reduce()functions. - Added
delete()method toRequestQueue,DatasetandKeyValueStoreclasses.
- Added
loggingIntervalMillisoptions toAutoscaledPool - Bugfix:
utils.isProductionfunction was incorrect - Added
RequestList.length()function
- Bugfix in
RequestList- skip invalid in-progress entries when restoring state - Added
request.ignoreErrorsoptions. See documentation for more info.
- Bugfix in
Apify.utils.puppeteer.injectXxxfunctions
- Puppeteer updated to v1.4.0
- Added
Apify.utilsandApify.utils.puppeteernamespaces for various helper functions. - Autoscaling feature of
AutoscaledPool,BasicCrawlerandPuppeteerCrawleris disabled on Apify platform until all issues are resolved.
- Added
Apify.isAtHome()function that returnstruewhen code is running on Apify platform andfalseotherwise (for example locally). - Added
ignoreMainProcessparameter toAutoscaledPool. Check documentation for more info. pageOpsTimeoutMillisofPuppeteerCrawlerincreased to 300 seconds.
- Parameters
sessionandgroupsofgetApifyProxyUrl()renamed toapifyProxySessionandapifyProxyGroupsto match naming of the same parameters in other classes.
RequestQueuenow caches known requests and their state to beware of unneeded API calls.
- WARNING:
disableProxyconfiguration ofPuppeteerCrawlerandPuppeteerPoolremoved. By default no proxy is used. You must either use new configurationlaunchPuppeteerOptions.useApifyProxy = trueto use Apify Proxy or provide own proxy vialaunchPuppeteerOptions.proxyUrl. - WARNING:
groupsparameter ofPuppeteerCrawlerandPuppeteerPoolremoved. UselaunchPuppeteerOptions.apifyProxyGroupsinstead. - WARNING:
sessionandgroupsparameters ofApify.getApifyProxyUrl()are now validated to contain only alphanumberic characters and underscores. Apify.call()now throws anApifyCallErrorerror if run doesn't succeed- Renamed options
abortInstanceAfterRequestCountofPuppeteerPoolandPuppeteerCrawlerto retireInstanceAfterRequestCcount - Logs are now in plain text instead of JSON for better readability.
- WARNING:
AutoscaledPoolwas completely redesigned. Check documentation for reference. It still supports previous configuration parameters for backwards compatibility but in the future compatibility will break. handleFailedRequestFunctionin bothBasicCrawlerandPuppeteerCrawlerhas now also error object available inops.error.- Request Queue storage type implemented. See documentation for more information.
BasicCrawlerandPuppeteerCrawlernow supports bothRequestListandRequestQueue.launchPuppeteer()changesUser-Agentonly when in headless mode or if not using full Google Chrome, to reduce chance of detection of the crawler.- Apify package now supports Node 7 and newer.
AutoscaledPoolnow scales down less aggresively.PuppeteerCrawlerandBasicCrawlernow allow its underlyingAutoscaledPoolfunctionisFunctionto be overriden.- New events
persistStateandmigratingadded. Check documentation ofApify.eventsfor more information. RequestListhas a new parameterpersistStateKey. If this is used thenRequestListpersists its state in the default key-value store at regular intervals.- Improved
README.mdand/examplesdirectory.
- Added
useChromeflag tolaunchPuppeteer()function - Bugfixes in
RequestList
- Removed again the --disable-dev-shm-usage flag when launching headless Chrome, it might be causing issues with high IO overheads
- Upgraded Puppeteer to version 1.2.0
- Added
finishWhenEmptyandmaybeRunPromiseIntervalMillisoptions toAutoscaledPoolclass. - Fixed false positive errors logged by
PuppeteerPoolclass.
- Added back
--no-sandboxto launch of Puppeteer to avoid issues on older kernels
- If the
APIFY_XVFBenv var is set to1, then avoid headless mode and use Xvfb instead - Updated DEFAULT_USER_AGENT to Linux Chrome
- Consolidated startup options for Chrome - use
--disable-dev-shm-usage, skip--no-sandbox, use--disable-gpuonly on Windows - Updated docs and package description
- Puppeteer updated to
1.1.1
- A lot of new stuff. Everything is backwards compatible. Check https://sdk.apify.com/ for reference
Apify.setPromiseDependency()/Apify.getPromiseDependency()/Apify.getPromisePrototype()removed- Bunch of classes as
AutoscaledPoolorPuppeteerCrawleradded, check documentation
- Renamed GitHub repo
- Changed links to Travis CI
- Changed links to Apify GitHub repo
Apify.pushData()added
- Upgraded puppeteer optional dependency to version
^1.0.0
- Initial development, lot of new stuff