Locator Best practices and tips

Preface

Installing and setting up Locator is in most cases a rather straightforward process, however there are a few things that you should keep in mind when planning and performing a Locator installation.

This document is meant to give you, who is performing the installation and setup of Locator, some basic guidelines to follow to ensure things go as smoothly as possible.

Introduction

Do not index everything!

We'll start with the most important practice and tip, which everyone should follow: do NOT index everything!

Now you're probably asking yourself: why?

Well, there are many reasons to this, but we'll highlight the most important ones:

our document filters and OCR engine does not support every file type imaginable. Text will only be extracted by the file types supported by our platform.
most companies have a wide range of file types that aren't really useful, that people never search for. These are, but limited to temporary files, backup file and archived files (ZIP, ARJ, TAR etc.).
if you choose to index everything, you will get a lot of files that "clutter" up the search hit list.
it will increase the time to completion, often drastically - meaning delivery to customer will be longer.

What do all connectors require?

All connectors require a service account. This account is used to discovery and download the files/documents/objects from the data sources set up for indexing.
This service user requires full read access to all files in the source system. When the service account is created, the system administrator needs to give explicit access to this service account.
A dedicated service account is recommended per data source, in other words you should have one service account for file server, one for Exchange, one for SharePoint and so forth.

File Server Connector

Do not index all file types! In most cases, choosing the default selection in the connector wizard should work for most installations.
If the customer does have special requirements/file types they wish included that is not a part of the default file type selection, this should be added manually in the wizard.

Exchange Connector

You should set up Application Impersonation for the service account used for Exchange. This gives the service account explicit access to all mailboxes on the Exchange server, which means that no additional configuration is required. This also applies to any mailboxes that are created after the connection is set up.
If you do not wish to index all mailboxes, it's highly recommended to create a security group that holds all the accounts that you want to index. This makes for a more manageable solution for both you as the Locator administrator, but more importantly to the customers IT department as they only have to add the users to this group when they require the mailbox to be searchable.
You should enable push for the Exchange connection, as this will put significantly less pressure on the Exchange server compared to not using it. Push basically implies that once a full discovery on a mailbox has been performed, the Exchange connector will only ask the Exchange server for any changes that has been applied to the mailbox since the last discovery run, much like any modern mail client does.

SharePoint Connector

Enable the option Index All Site Collections, as this will make it easier to maintain the configuration of the SharePoint indexing. If there are site collections that you do not want to index, filter out those later in the wizard. This also has the bonus benefit of that the connector is able to detect and index any new site collections that are added at a later date.
Enable the option Use SiteData Web service to index by change sets, as this will decrease the number of API calls the connector makes to SharePoint. This works in pretty much the same way as push for Exchange, as in that once a full discovery has been performed, the connector will only ask for what has changed since the last discovery run.

The initial indexing phase

In this phase, we generally recommend that you allocate as much CPU and RAM as possible, as this will help on the time to completion. This is a good approach as most customers tend to run on virtualisation based hosts, so reallocating resources is an easy task.
It's often hard to give an estimate on the disk space required, as many file types are large in size, however once we extract the text from the actual document the size we consume is small. To give an example, you can have a PowerPoint presentation file that's 20MB, however once converted we end up with only 200KB of raw text, which is the space we consume. Again, if Locator runs on a virtual machine, allocate more than you think you need and then resize once the initial indexing is complete.
The system requires that you enough free space on the data partition used for the data files used by Locator. This to ensure that both the database service (if Postgres is used) and the SOLR Index Service has enough space for temporary files. Our general recommendation is that you follow this key:
- (SOLR Index Size * 1.0) + (Postgres database size * 0.5)
- So let's take a real life example, we have Postgres database that consumes 84GB and a SOLR index that consumes 140. The equation is then: 140 + (84 * 0.5) = 182GB.
- With the above example we would need at least 182GB of free disk space to be sure that both SOLR and Postgres has enough space to operate normally.

Terminology

Connector: a plugin to the Locator framework, which has been specifically developed to discover, download and convert data from a specific data source. This is usually done by using an API endpoint which the source system offers, but can be achieved by other methods. Example of connectors is File Server Connector, Exchange Connector and SharePoint Connector.
Data source: a definition for the system being indexed, this could be a File Server share, an Exchange Server, a SharePoint server or any other system supported by our connectors.
Connection: a connection is one working configuration for a connector, which specifies what data source is to be indexed, what file types to be included and what if any filters are required in limiting the type of files/objects indexed.
Discovery phase: this phase is when the connector performs a discovery of the documents and objects stored in the source system. This is also commonly referred to as a crawl. During the discovery phase, only documents and objects that matches the configuration and/or filters defined in the connector setup is indexed. Basic information of the documents/objects are stored in the database, such as title, location, size, created date and last modified date. New and updated documents creates a fetch request in the fetch queue, while deleted documents are marked for deletion both in the database and index.
Fetch/conversion phase: this phase is when the connector checks the fetch queue for work items, in other words documents and files stored in the source system, downloads them to a temporary location and performs a conversion of them. Once a document has been converted by our document filters or OCR engine, the extracted text is stored in the database and later updated in the index.
Initial indexing: this is the phase you enter after having set up one or more connections. When a full discovery/crawl has been performed on all connections set up, the system will start on converting all the documents in the fetch queue. Once all documents have been converted, and the system has updated the index with the converted data which can be verified from the Web Dashboard, the initial indexing phase is complete. From this point, the system will only deal with new/updated/deleted documents, which commonly requires less resources than the initial indexing phase.
Time to completion: a term we often refer to, which implies the time it takes from when Locator is installed, setup and done with both discovery, fetch and the initial indexing. In other words, the point where all documents from the data sources are searchable for end users.
Database service: Locator requires a database backend, both to store the configuration of the system, but which also keeps track of the state of the discovery and fetch states as well as the extracted text from the documents and files converted by the connectors.
Index service: whenever the end users performs a search in Locator, this search is performed by the web application and sends the search string to the Index Service, also referred to as the SOLR Index Service.
Document filter: a program that processes a file, and attempts to extract the text found within that file. Supports a wide range of file types, all documented here.
OCR: also known as Optical Character Recognition, a process where a program processes either a PDF or image file, and attempts to extract the text found within that file.
Search hit list: the search results returned by the web application when the user performs a search.