Web Connector: Configuration Settings
There are many settings for the Web connector. Most of the settings are for the Abot Crawler - which is used by the connector. Connector specific settings have been added in addition to the Abot settings. The plugins, which are installed separately, use the settings Custom 1 - Custom 10.
Settings
Name | Description | Data Type | Default | Remarks |
---|---|---|---|---|
Crawl Timeout In Seconds | Maximum seconds before the crawl times out and stops. | int | 0 (disabled) | Great for demos if you just want to crawl some pages before halting the crawl. |
Downloadable Content Types | MIME types do you want to download. | string[] | text/html | |
Http Request Max Auto Redirects | Maximum number of automatic redirects that the HTTP request follows. | int | 3 | |
Http Request Timeout In Seconds | When will the HTTP request timeout (in seconds). | int | 30 | 0 = disabled |
Http Service Point Connection Limit | Number of concurrent http(s) connections can be open to the same host. | int | 200 | |
Is External Page Crawling Enabled | Are we allowed to crawl external pages | bool | false | |
Is External Page Links Crawling Enabled | Whether pages external to the root URL should have their links crawled. | bool | false | "Is External Page Crawling Enabled" must be true for this setting to have any effect. |
Is Forced Link Parsing Enabled | Whether the crawler should parse the page's links even if a | bool | false | |
Is Http Request Auto Redirects Enabled | Whether the request should follow redirection | bool | true | |
Is Http Request Automatic Decompression Enabled | Whether gzip and deflate will be automatically accepted and decompressed. | bool | false | |
Is Ignore Robots.Txt If Root Disallowed Enabled | If true, will ignore the robots.txt file if it disallows crawling the root uri. | bool | false | |
Is Respect Anchor Rel No Follow Enabled | Whether the crawler should ignore links that have a <a href="whatever" rel="nofollow"> | bool | true | |
Is Respect HttpXRobots TagHeader NoFollow Enabled | Whether the crawler should ignore links on pages that have an http X-Robots-Tag header of nofollow | bool | false | |
Is Respect Meta Robots No Follow Enabled | Whether the crawler should ignore links on pages that have a <meta name="robots" content="nofollow" /> tag | bool | true | See https://en.wikipedia.org/wiki/Nofollow |
Is Respect Url Named Anchor Or Hashbang Enabled | Whether or not url named anchors or hashbangs are considered part of the url. If false, they will be ignored. If true, they will be considered part of the url. | bool | false | |
Is Respect Robots.Txt Enabled | Whether the crawler should retrieve and respect the robots.txt file. | bool | true | |
Is Uri Recrawling Enabled | Whether Uris should be crawled more than once. | bool | false | This is not common and should be false for most scenarios. |
Is Ssl Certificate Validation Enabled | Whether or not to validate the server SSL certificate. If true, the default validation will be made. If false, the certificate validation is bypassed. | bool | true | This setting is useful to crawl sites with an invalid or expired SSL certificate. |
Build page if canonical is not pointing to url | Used to skip indexing of pages where the rel canonical is not equal to the page url. | bool | true | If this custom value is false, check rel canonical and don't build page if canonical is different from the page url. Can be used to avoid page duplication. |
Max Concurrent Threads | Max concurrent threads to use for http(s) requests. | int | 5 | MaxThreads must be from 1 to 100 |
Max Crawl Depth | Maximum levels below root page to crawl | int | 100 | If value is 0, the homepage will be crawled but none of its links will be crawled. If the level is 1, the homepage and its links will be crawled but none of the links links will be crawled. |
Max Links Per Page | Maximum links to crawl per page. | int | 0 | If value is zero, this setting has no effect |
Max Memory Usage Cache Time In Seconds | The max amount of time before refreshing the value used to determine the amount of memory being used by the process that hosts the crawler instance. | int | 5 minutes (300) | |
Max Memory Usage In MB | The max amount of memory to allow the process to use. | int | 500 | If this limit is exceeded the crawler will stop prematurely. |
Max Page Size In Bytes | Maximum size of page | int | 10 MB (10485760) | If the page size is above this value, it will not be downloaded or processed. If zero, this setting has no effect. |
Max Pages To Crawl Per Domain | Maximum number of pages to crawl per domain. | int | 0 | If zero, this setting has no effect |
Max Retry Count | The max number of retries for a URL if a web exception is encountered. | int | 3 | If zero, no retries will be made. |
Max Robots.Txt Crawl Delay In Seconds | The maximum number of seconds to respect in the robots.txt "Crawl-delay: X" directive. | int | 1 | "Is Respect Robots.Txt Enabled" setting must be true for this to have any effect. If zero, will use whatever the robots.txt crawl delay requests no matter how high the value is. |
Min Available Memory Required In MB | Uses closest multiple of 16 to the value set. If there is not at least this much memory available before starting a crawl, | int | 0 | |
Min Crawl Delay Per Domain Milliseconds | The number of milliseconds to wait in between http requests to the same domain. | int | 0 | |
Min Retry Delay In Milliseconds | The minimum delay between a failed http request and the next retry. | int | 10 seconds (10000) | |
Robots.Txt User Agent String | The user agent string to use when checking robots.txt file for specific directives. | string | Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0 | Some examples of other crawler's user agent values are "googlebot", "slurp" etc... |
Seed urls | The URLs used to seed the crawler | string[] | ||
User Agent String | The user agent string to use for http(s) requests. | string | Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0 | |
Custom 1 | Custom setting 1 that will also be passed to plug-ins | string | ||
.... | ... | ... | ||
Custom 10 | Custom setting 10 that will also be passed to plug-ins | string | ||
Custom Plugin Path | Relative path to custom plug-ins. | string | Custom | |
Enable Store Pages To Disk For Debugging | Stores the pages it downloads to disk. | bool | false | For debugging purposes only. |
Persist state to disk | Persist the state of the web crawler to disk. | bool | true | This reduces the memory pressure on the crawler. |
ayfie