Fetching

Fetching a HTTP request from Cache, Network or Broswer

When processing a FetchJob.

  • findAdapter :: URL -> Adapter
  • setupRequest :: URL -> Request
  • fetchRequest :: Request -> Content
  • parseContent :: Content -> Information
  • storeInformation :: Infomation -> Collection | FileCollection

When processing an InteractJob.

  • interactWithBrowser ::
  • storeInformation :: Infomation -> Collection | FileCollection

When fetching a Request.

  • readFromCache ::
  • readFromNetwork ::
  • readFromBrowser ::

This module provides a managed approach to fetching HTML content from over the network, implementing the following three best practices

  1. Minimal Impact – Dont spam servers asking for the same content. Cache all requests.
  2. Minimal Latency - Wherever possible use a single HTTP request for the content.
  3. Dynamic Content - Cater for pages that generate content using client-side Javascript.

When an URL is requested the module will check whether there is a copy of it stored in the MongoDB cache. When there is a copy in cache which is not stale then it will return this instead of going out over the network.

Check whether fresh copy of the contents of a HTTP request has been cached.

If a request.options.checkCacheAge has been specified then check whether there is a cached result less than this age. returns undefined when no fresh cached result or no request.options.checkCacheAge is not present.

Arguments

request Object

The HTTP method, URL and options being requested.

request.url String

The URL to retrieve, including query params. Used as an index for the cache and should be unique per page.

request.method String

The HTTP method used for the original request. Either "GET", or "POST".

request.options Object

The HTTP request options object. May contain various fields but only one is checked.

request.options.checkCacheAge Object, Number, or String

The maximum age of the cache to return. eg. { days: 7 } or { seconds: 10 } usable as a duration by moment. If there isnt anything younger than this duration then undefined will be returned.

Returns

Object or undefined

The cached content as an object. Alternatively returns undefined when it cannot find a valid cache entry.

readFromNetwork

readFromBrowser

Edit on GitHub