Locator best practices and tips

Preface

Installing and setting up Locator is in most cases a rather straightforward process, however there are a few things that you should keep in mind when planning and performing a Locator installation.

This document is meant to give you, who is performing the installation and setup of Locator, some basic guidelines to follow to ensure things go as smoothly as possible.

Introduction

Do not index everything!

We'll start with the most important practice and tip, which everyone should follow: do NOT index everything!

Now you're probably asking yourself: why?

Well, there are many reasons to this, but we'll highlight the most important ones:

  • our document filters and OCR engine does not support every file type imaginable. Text will only be extracted by the file types supported by our platform.
  • most companies have a wide range of file types that aren't really useful, that people never search for. These are, but limited to temporary files, backup file and archived files (ZIP, ARJ, TAR etc.).
  • if you choose to index everything, you will get a lot of files that "clutter" up the search hit list.
  • it will increase the time to completion, often drastically - meaning delivery to customer will be longer.

Anti virus software

We generally recommend that no anti virus software is installed on the server(s) that run Locator, as this often can lead to problems. Problems that can be caused by anti virus software are, but not limited to file locks on program data files used by Locator that can cause failed conversions or in a worst case scenario failed updates to both the database and/or index. Other known problems, is that anti virus software can have a huge impact on performance of core components of Locator, such as the database service and index service. Both these services really on frequent disk access, and often operates on huge data files.

As a rule of thumb, we expect that the data that is to be indexed by Locator has been virus scanned prior to being indexed.

If a company policy prohibits a server to run without anti virus software, it's extremely important that the an exclusion is added for both files and sub-folders for the Program Files and ProgramData folders used by Locator.

IMPORTANT: if you are running Trend Micro antivirus software, we highly recommend that this is completely disabled or uninstalled, this due to the way that Trend Micro scans and locks files used by Locator. We have experienced this software suite to lock files in such a way, that the end result has been data corruption.

System performance / Hardware requirements

Locator is a system resource demanding application, and as such requires that the system Locator runs on to be adequately specified in terms of CPU/RAM and disk storage. Below we'll try to highlight the key potential performance bottlenecks and how to improve them.

Database

The database is mostly bound by disk and CPU. Under normal circumstances, CPU utilization is not a problem as most servers provisioned for Locator, and CPU cycles tend to be freely available to the database service whenever required. The biggest potential bottleneck for the database service is the underlying storage medium used. All of the components that Locator consists of uses the database service, and as such, the database sees a lot of inserts, updates and deletes - requiring frequent disk access.

As such, if the storage medium used for the program data files is slow performing, you will see that this has an impact on all Locator services. It's recommended to install Locator on as fast disks as possible, with a huge preference on SSD disks with a high IOPS throughput.

SOLR index service

The SOLR index service is bound by both CPU, RAM and disk, where disk and RAM are the two critical points. Just like with the database service, the program data files SOLR uses sees a lot of read/write operations. As such, it's recommended to use as fast disks as possible with a huge preference on SSD disks with IOPS throughput. SOLR also requires a lot of RAM, where the minimum configuration of SOLR is set up to use 4GB of RAM.

However with larger installations of Locator, where the number of documents and total indexed text is high, a higher allocation is necessary to ensure that SOLR performs properly.

Connectors

The connectors are bound firstly to the performance of the database service, as this is used extensively to retrieve the current working configuration as well as update the state of discovery/fetch. So as long as the database service performs well, this part of the connector will perform as expected. Secondly, the connectors are bound by the number of CPU cores and system RAM. Each available CPU core on the system will be able to serve one fetch job, and each fetch job will at most consume up to 2GB.

On any given Locator server, you should allocate to the following formula: (CPU cores * 2GB) + 8GB for the operating system and other processes. So using a real life example, we have a system with 12 CPU cores, we would have to allocate (12 x 2GB) + 8GB = 32GB in total.

General recommendations

As previously noted, Locator is a resource demanding application, and performs better the more resources it has available. For example, if the underlying storage medium used for Locator is slow, this will have a huge overall impact on the performance of the product. Generally speaking, the more connections you add and the more data you index, the more resources you will require. If you ensure that system has enough resources, you and your users will by happy with the performance. We have some general recommendations on hardware configurations for different document volumes and number of users, however it's important to point out that no installation is completely alike another one. We often see that just looking at the raw numbers of users and number of documents alone isn't enough.

We'll try to illustrate with an example. Imagine we have installation A and installation B. Both installations have 100 users, and 15 million documents indexed. Installation A has small documents, mostly Office documents based on templates that consists of many similar words, terms and phrases and the users has a low frequency of search requests, while installation B has huge documents, many of these technical documents and large user guides and research papers and the users has a high frequency of search requests. Given that the documents are larger, this also means more extracted text, and also a lot more unique words, terms and phrases.

This means that installation A consume a lot less disk space, both for the database and the SOLR index data files - this in effect also means that SOLR consumes less memory, as the index is smaller. Seeing as the search frequency is also low, this means that both the web service, SOLR and the database service requires less resources.

If we look at installation B, which both has a larger SOLR index and database, and a higher search frequency it means that it consumes more RAM as the index is larger, more CPU as the index service, web service and database is used more frequently and often return a larger data set.

As you can see, there is no "one glow fits all" principal to follow here, and there's quite a few factors in play. But by keeping all of the above in mind, and keeping an eye on system performance, you can always allocate more or less resources as needed.

What do all connectors require?

  • All connectors require a service account. This account is used to discovery and download the files/documents/objects from the data sources set up for indexing.
  • This service user requires full read access to all files in the source system. When the service account is created, the system administrator needs to give explicit access to this service account.
  • A dedicated service account is recommended per data source, in other words you should have one service account for file server, one for Exchange, one for SharePoint and so forth.

File Server Connector

  • Do not index all file types! In most cases, choosing the default selection in the connector wizard should work for most installations.
  • If the customer does have special requirements/file types they wish included that is not a part of the default file type selection, this should be added manually in the wizard.

Exchange Connector

  • You should set up Application Impersonation for the service account used for Exchange. This gives the service account explicit access to all mailboxes on the Exchange server, which means that no additional configuration is required. This also applies to any mailboxes that are created after the connection is set up.
  • If you do not wish to index all mailboxes, it's highly recommended to create a security group that holds all the accounts that you want to index. This makes for a more manageable solution for both you as the Locator administrator, but more importantly to the customers IT department as they only have to add the users to this group when they require the mailbox to be searchable.
  • You should enable push for the Exchange connection, as this will put significantly less pressure on the Exchange server compared to not using it. Push basically implies that once a full discovery on a mailbox has been performed, the Exchange connector will only ask the Exchange server for any changes that has been applied to the mailbox since the last discovery run, much like any modern mail client does. 

SharePoint Connector

  • Enable the option Index All Site Collections, as this will make it easier to maintain the configuration of the SharePoint indexing. If there are site collections that you do not want to index, filter out those later in the wizard. This also has the bonus benefit of that the connector is able to detect and index any new site collections that are added at a later date.
  • Enable the option Use SiteData Web service to index by change sets, as this will decrease the number of API calls the connector makes to SharePoint. This works in pretty much the same way as push for Exchange, as in that once a full discovery has been performed, the connector will only ask for what has changed since the last discovery run.

The initial indexing phase

  • In this phase, we generally recommend that you allocate as much CPU and RAM as possible, as this will help on the time to completion. This is a good approach as most customers tend to run on virtualisation based hosts, so reallocating resources is an easy task.
  • It's often hard to give an estimate on the disk space required, as many file types are large in size, however once we extract the text from the actual document the size we consume is small. To give an example, you can have a PowerPoint presentation file that's 20MB, however once converted we end up with only 200KB of raw text, which is the space we consume. Again, if Locator runs on a virtual machine, allocate more than you think you need and then resize once the initial indexing is complete.
  • The system requires that you enough free space on the data partition used for the data files used by Locator. This to ensure that both the database service (if Postgres is used) and the SOLR Index Service has enough space for temporary files. Our general recommendation is that you follow this key:
    • (SOLR Index Size * 1.0) + (Postgres database size * 0.5)
    • So let's take a real life example, we have Postgres database that consumes 84GB and a SOLR index that consumes 140. The equation is then: 140 + (84 * 0.5) = 182GB.
    • With the above example we would need at least 182GB of free disk space to be sure that both SOLR and Postgres has enough space to operate normally.
  • During this phase, it's also possible to set up additional fetch servers. These are servers where you install the Locator program files, and configure the servers to act as "dumb" slave nodes. These servers are set up to use the database and license service on the primary Locator server, and their sole task is to check the fetch queue on the primary servers for fetch jobs. Whenever there are fetch jobs in the fetch queue, they download the document/file from the source system, run conversion on them to extract the text, and then feed the document text back into the database on the primary server. The IndexBuilder service running on the primary server then updates the SOLR index based on the converted text and metadata from the documents. Setting up additional fetch servers will greatly improve on the time it takes to complete conversion and indexing, which means you will be able to deliver a fully working installation quicker to the customer. The great thing about these additional fetch servers, is that once the initial indexing is complete, these can be removed, freeing up resources and cost.

Terminology

  • Connector: a plugin to the Locator framework, which has been specifically developed to discover, download and convert data from a specific data source. This is usually done by using an API endpoint which the source system offers, but can be achieved by other methods. Example of connectors is File Server Connector, Exchange Connector and SharePoint Connector.
  • Data source: a definition for the system being indexed, this could be a File Server share, an Exchange Server, a SharePoint server or any other system supported by our connectors.
  • Connection: a connection is one working configuration for a connector, which specifies what data source is to be indexed, what file types to be included and what if any filters are required in limiting the type of files/objects indexed.
  • Discovery phase: this phase is when the connector performs a discovery of the documents and objects stored in the source system. This is also commonly referred to as a crawl. During the discovery phase, only documents and objects that matches the configuration and/or filters defined in the connector setup is indexed. Basic information of the documents/objects are stored in the database, such as title, location, size, created date and last modified date. New and updated documents creates a fetch request in the fetch queue, while deleted documents are marked for deletion both in the database and index.
  • Fetch/conversion phase: this phase is when the connector checks the fetch queue for work items, in other words documents and files stored in the source system, downloads them to a temporary location and performs a conversion of them. Once a document has been converted by our document filters or OCR engine, the extracted text is stored in the database and later updated in the index.
  • Initial indexing: this is the phase you enter after having set up one or more connections. When a full discovery/crawl has been performed on all connections set up, the system will start on converting all the documents in the fetch queue. Once all documents have been converted, and the system has updated the index with the converted data which can be verified from the Web Dashboard, the initial indexing phase is complete. From this point, the system will only deal with new/updated/deleted documents, which commonly requires less resources than the initial indexing phase.
  • Time to completion: a term we often refer to, which implies the time it takes from when Locator is installed, setup and done with both discovery, fetch and the initial indexing. In other words, the point where all documents from the data sources are searchable for end users.
  • Database service: Locator requires a database backend, both to store the configuration of the system, but which also keeps track of the state of the discovery and fetch states as well as the extracted text from the documents and files converted by the connectors. The database also stores all personalised user related settings, such as labels, favourites, scopes, search context and language settings.
  • Index service: whenever the end users performs a search in Locator, this search is performed by the web application and sends the search string to the Index Service, also referred to as the SOLR Index Service.
  • Document filter: a program that processes a file, and attempts to extract the text found within that file. Supports a wide range of file types, all documented here.
  • OCR: also known as Optical Character Recognition, a process where a program processes either a PDF or image file, and attempts to extract the text found within that file.
  • Search hit list: the search results returned by the web application when the user performs a search.

ayfie