Web Connector: Configuration Settings

There are many settings for the Web connector. Most of the settings are for the Abot Crawler - which is used by the connector. Connector specific settings have been added in addition to the Abot settings. The plugins, which are installed separately, use the settings Custom 1 - Custom 10.

Settings

Name

Description

Data Type

Default

Remarks

Crawl Timeout In SecondsMaximum seconds before the crawl times out and stops.int0 (disabled)Great for demos if you just want to crawl some pages before halting the crawl.
Downloadable Content TypesMIME types do you want to download.string[]text/html
Http Request Max Auto RedirectsMaximum number of automatic redirects that the HTTP request follows.int3
Http Request Timeout In SecondsWhen will the HTTP request timeout (in seconds).int300 = disabled
Http Service Point Connection Limit

Number of concurrent http(s) connections can be open to the same host.

int200
Is External Page Crawling EnabledAre we allowed to crawl external pagesboolfalse
Is External Page Links Crawling EnabledWhether pages external to the root URL should have their links crawled.boolfalse"Is External Page Crawling Enabled" must be true for this setting to have any effect.
Is Forced Link Parsing Enabled

Whether the crawler should parse the page's links even if a
crawl decision determines that those links will not be crawled.

boolfalse
Is Http Request Auto Redirects EnabledWhether the request should follow redirectionbooltrue
Is Http Request Automatic Decompression EnabledWhether gzip and deflate will be automatically accepted and decompressed.boolfalse
Is Ignore Robots.Txt If Root Disallowed EnabledIf true, will ignore the robots.txt file if it disallows crawling the root uri.boolfalse
Is Respect Anchor Rel No Follow EnabledWhether the crawler should ignore links that have a <a href="whatever" rel="nofollow">booltrue
Is Respect HttpXRobots TagHeader NoFollow EnabledWhether the crawler should ignore links on pages that have an http X-Robots-Tag header of nofollowboolfalse
Is Respect Meta Robots No Follow EnabledWhether the crawler should ignore links on pages that have a <meta name="robots" content="nofollow" /> tagbooltrueSee https://en.wikipedia.org/wiki/Nofollow
Is Respect Url Named Anchor Or Hashbang EnabledWhether or not url named anchors or hashbangs are considered part of the url. If false, they will be ignored. If true, they will be considered part of the url.boolfalse
Is Respect Robots.Txt EnabledWhether the crawler should retrieve and respect the robots.txt file.booltrue
Is Uri Recrawling EnabledWhether Uris should be crawled more than once.boolfalseThis is not common and should be false for most scenarios.
Is Ssl Certificate Validation EnabledWhether or not to validate the server SSL certificate. If true, the default validation will be made. If false, the certificate validation is bypassed.booltrueThis setting is useful to crawl sites with an invalid or expired SSL certificate.
Build page if canonical is not pointing to urlUsed to skip indexing of pages where the rel canonical is not equal to the page url.booltrueIf this custom value is false, check rel canonical and don't build page if canonical is different from the page url. Can be used to avoid page duplication.
Max Concurrent ThreadsMax concurrent threads to use for http(s) requests.int5MaxThreads must be from 1 to 100
Max Crawl DepthMaximum levels below root page to crawlint100

If value is 0, the homepage will be crawled but none of its links will be crawled.

If the level is 1, the homepage and its links will be crawled but none of the links links will be crawled.

Max Links Per PageMaximum links to crawl per page.int0If value is zero, this setting has no effect
Max Memory Usage Cache Time In Seconds

The max amount of time before refreshing the value used to determine the amount of memory being used by the process that hosts the crawler instance.

int5 minutes (300)
Max Memory Usage In MBThe max amount of memory to allow the process to use.int500

If this limit is exceeded the crawler will stop prematurely.
If zero, this setting has no effect.

Max Page Size In BytesMaximum size of pageint

10 MB

(10485760)

If the page size is above this value, it will not be downloaded or processed.

If zero, this setting has no effect.

Max Pages To Crawl Per DomainMaximum number of pages to crawl per domain.int0If zero, this setting has no effect
Max Retry CountThe max number of retries for a URL if a web exception is encountered.int3If zero, no retries will be made.
Max Robots.Txt Crawl Delay In SecondsThe maximum number of seconds to respect in the robots.txt "Crawl-delay: X" directive.int1

"Is Respect Robots.Txt Enabled" setting must be true for this to have any effect.

If zero, will use whatever the robots.txt crawl delay requests no matter how high the value is.

Min Available Memory Required In MB

Uses closest multiple of 16 to the value set. If there is not at least this much memory available before starting a crawl,
throws InsufficientMemoryException.

int0
Min Crawl Delay Per Domain MillisecondsThe number of milliseconds to wait in between http requests to the same domain.int0
Min Retry Delay In MillisecondsThe minimum delay between a failed http request and the next retry.int

10 seconds

(10000)


Robots.Txt User Agent StringThe user agent string to use when checking robots.txt file for specific directives.stringMozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0Some examples of other crawler's user agent values are "googlebot", "slurp" etc...
Seed urlsThe URLs used to seed the crawlerstring[]

User Agent StringThe user agent string to use for http(s) requests.stringMozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0
Custom 1Custom setting 1 that will also be passed to plug-insstring

..........

Custom 10Custom setting 10 that will also be passed to plug-insstring

Custom Plugin PathRelative path to custom plug-ins.stringCustom
Enable Store Pages To Disk For DebuggingStores the pages it downloads to disk.boolfalseFor debugging purposes only.
Persist state to diskPersist the state of the web crawler to disk.booltrueThis reduces the memory pressure on the crawler.

ayfie