Web Connector: Configuration Settings

There are many settings for the Web connector. Most of the settings are for the Abot Crawler - which is used by the connector. Connector specific settings have been added in addition to the Abot settings. The plugins, which are installed separately, use the settings Custom 1 - Custom 10.

Settings

Name	Description	Data Type	Default	Remarks
Crawl Timeout In Seconds	Maximum seconds before the crawl times out and stops.	int	0 (disabled)	Great for demos if you just want to crawl some pages before halting the crawl.
Downloadable Content Types	MIME types do you want to download.	string[]	text/html
Http Request Max Auto Redirects	Maximum number of automatic redirects that the HTTP request follows.	int	3
Http Request Timeout In Seconds	When will the HTTP request timeout (in seconds).	int	30	0 = disabled
Http Service Point Connection Limit	Number of concurrent http(s) connections can be open to the same host.	int	200
Is External Page Crawling Enabled	Are we allowed to crawl external pages	bool	false
Is External Page Links Crawling Enabled	Whether pages external to the root URL should have their links crawled.	bool	false	"Is External Page Crawling Enabled" must be true for this setting to have any effect.
Is Forced Link Parsing Enabled	Whether the crawler should parse the page's links even if a crawl decision determines that those links will not be crawled.	bool	false
Is Http Request Auto Redirects Enabled	Whether the request should follow redirection	bool	true
Is Http Request Automatic Decompression Enabled	Whether gzip and deflate will be automatically accepted and decompressed.	bool	false
Is Ignore Robots.Txt If Root Disallowed Enabled	If true, will ignore the robots.txt file if it disallows crawling the root uri.	bool	false
Is Respect Anchor Rel No Follow Enabled	Whether the crawler should ignore links that have a <a href="whatever" rel="nofollow">	bool	true
Is Respect HttpXRobots TagHeader NoFollow Enabled	Whether the crawler should ignore links on pages that have an http X-Robots-Tag header of nofollow	bool	false
Is Respect Meta Robots No Follow Enabled	Whether the crawler should ignore links on pages that have a <meta name="robots" content="nofollow" /> tag	bool	true	See https://en.wikipedia.org/wiki/Nofollow
Is Respect Url Named Anchor Or Hashbang Enabled	Whether or not url named anchors or hashbangs are considered part of the url. If false, they will be ignored. If true, they will be considered part of the url.	bool	false
Is Respect Robots.Txt Enabled	Whether the crawler should retrieve and respect the robots.txt file.	bool	true
Is Uri Recrawling Enabled	Whether Uris should be crawled more than once.	bool	false	This is not common and should be false for most scenarios.
Is Ssl Certificate Validation Enabled	Whether or not to validate the server SSL certificate. If true, the default validation will be made. If false, the certificate validation is bypassed.	bool	true	This setting is useful to crawl sites with an invalid or expired SSL certificate.
Build page if canonical is not pointing to url	Used to skip indexing of pages where the rel canonical is not equal to the page url.	bool	true	If this custom value is false, check rel canonical and don't build page if canonical is different from the page url. Can be used to avoid page duplication.
Max Concurrent Threads	Max concurrent threads to use for http(s) requests.	int	5	MaxThreads must be from 1 to 100
Max Crawl Depth	Maximum levels below root page to crawl	int	100	If value is 0, the homepage will be crawled but none of its links will be crawled. If the level is 1, the homepage and its links will be crawled but none of the links links will be crawled.
Max Links Per Page	Maximum links to crawl per page.	int	0	If value is zero, this setting has no effect
Max Memory Usage Cache Time In Seconds	The max amount of time before refreshing the value used to determine the amount of memory being used by the process that hosts the crawler instance.	int	5 minutes (300)
Max Memory Usage In MB	The max amount of memory to allow the process to use.	int	500	If this limit is exceeded the crawler will stop prematurely. If zero, this setting has no effect.
Max Page Size In Bytes	Maximum size of page	int	10 MB (10485760)	If the page size is above this value, it will not be downloaded or processed. If zero, this setting has no effect.
Max Pages To Crawl Per Domain	Maximum number of pages to crawl per domain.	int	0	If zero, this setting has no effect
Max Retry Count	The max number of retries for a URL if a web exception is encountered.	int	3	If zero, no retries will be made.
Max Robots.Txt Crawl Delay In Seconds	The maximum number of seconds to respect in the robots.txt "Crawl-delay: X" directive.	int	1	"Is Respect Robots.Txt Enabled" setting must be true for this to have any effect. If zero, will use whatever the robots.txt crawl delay requests no matter how high the value is.
Min Available Memory Required In MB	Uses closest multiple of 16 to the value set. If there is not at least this much memory available before starting a crawl, throws InsufficientMemoryException.	int	0
Min Crawl Delay Per Domain Milliseconds	The number of milliseconds to wait in between http requests to the same domain.	int	0
Min Retry Delay In Milliseconds	The minimum delay between a failed http request and the next retry.	int	10 seconds (10000)
Robots.Txt User Agent String	The user agent string to use when checking robots.txt file for specific directives.	string	Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0	Some examples of other crawler's user agent values are "googlebot", "slurp" etc...
Seed urls	The URLs used to seed the crawler	string[]
User Agent String	The user agent string to use for http(s) requests.	string	Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0
Custom 1	Custom setting 1 that will also be passed to plug-ins	string
....	...	...
Custom 10	Custom setting 10 that will also be passed to plug-ins	string
Custom Plugin Path	Relative path to custom plug-ins.	string	Custom
Enable Store Pages To Disk For Debugging	Stores the pages it downloads to disk.	bool	false	For debugging purposes only.
Persist state to disk	Persist the state of the web crawler to disk.	bool	true	This reduces the memory pressure on the crawler.

Ayfie Knowledge Base (up until version 2.11)

Web Connector: Configuration Settings

Analytics

Settings

Related content