Introduction to Multi-Node Installations

Introduction

There can be several reasons for running Locator on more than one server:

Too much data to process, index and store on a single server
Too many incoming search queries for a single server to handle
Too slow search query processing due to a too large index
No failover with just a single server
Initial feeding and indexing takes too long
Operation latency due to a single job queue

This document aims at giving an understanding of what a multi-node installation entails. For detailed step by step procedures, consult the documentation under Multi-Server Environments.

This document assumes that one are familier with all the components of a single node ayfie Locator installation. If not, then first read ayfie Locator Architectural Overview.

Single Node Installation

All components of a single node installation is described in detail in ayfie Locator Architectural Overview.

Below we see a single node installation where all ayfie Locator components reside on the same host.

A single node installation does not support failover and the only way to increase capacity is by scaling up (that is, adding more resources to the host). As a rule of thumb, given a normal document type distribution (70 % email / 30 % office documents), a single node can handle up to 20 million documents. The table below shows the amount of RAM, CPU and disk required depending on the amount of documents (the right most column with the resource requirements for an extra fetch server is covered in the next section):

	Locator	Locator & Supervisor	Extra Fetch Server
Docs in Millions	< 20	< 20	< 20
CPU cores	4 - 16	8 - 24	4 - 16
RAM (GB)	16 - 64	32 - 96	16 - 64
System Drive (GB)	20	20	20
Program Files (GB)	60	60	60
Program Data (GB)	200-500	250-600	60

Those of the numbers that are given as a range all corresponds to each other. That is, if the numer of document is in the lower, middle or higher end of its range, then so are the numbers of CPU cores and amount of RAM as well within their respective ranges.

Secondary Fetch Server

The step by step procedure for the scenario described in this section, is found at Installing/Configuring an Additional Fetch Server.

Document fetching is a very CPU intensive acitivity of which the binary to text conversion is the biggest consumer, especially OCR conversion. For more information about document conversion in particular, check out Document Conversions & Estimating Initial Indexing Time.

The single node described in the previous section would normally be able to handle up to 20 million documents. However, the initial feeding, processing and indexing of that amount of data may take a very long time to complete. Depending on the size of the documents and the proportion of them being office documents that requires conversion, 20 million documents may very well take some 3-4 months. To speed up this initial upload of data, an additional fetch node can be applied as shown in the next graphics. This will add much more CPU to the operation.

The use of a secondary fetch server can either be limited to the initial round of data uploading or made permanent. In the latter case one can possibly turn off the fetching on the primary server and in that way ensure that document fetching does not have any negative impact on the query response time.

The resource config for such a secondary fetch server is shown in the right most column of the table in the previous section.

Secondary Search Server

The step by step procedure for the scenario described in this section, is found at Web Server and Solr Cloud Configuration.

Search is a RAM intensive operation and the all search node must thus be set up with enough RAM.

Here we have added a secondary server to and by that doubled the amount of documents that we are able to handle from 20 millions to 40 millions.

Everything in the table above remain the same except for the very first row:

	Locator	Locator & Supervisor
Docs in Millions	< 40	< 40

All the other rows in the previous table are per host. The document capacity repeated here is for the installation as such.

The previous graphic only showed the high level components. In the next graphic we will have peek inside the Index Services to see how documents and queries are being distributed between them:

The first thing to decide when configuring the Zookeeper, is whether one is to support failover or not. If there is to be no failover support then one can remove (= not configure) all the blue replica shards.

All incoming documents are indexed by the Index Builder and handed over to the SOLR instance on the primary server. The SOLR node consults with ZooKeeper about which shard and replica to use. Based on this information the SOLR nodes distribute the documents between them and have them inserted into the right shards and/or replicas. On the query side it the same, the Rest Service passes the queries on to the local SOLR instance which then upon the direction of ZooKeeper, distributes the query to the right shards/replicas and merge the return search results.

Large Multi-Node Setups

Below we see the standard node layout for a Locator deployment with more than 40 million documents to be indexed.

The difference here relative to the two previous examples is that we now have also placed the database on one server of its own. Here is the resource requirement for the various servers in this set-up:

	Database Node	Locator Primary Node	Locator Secondary Search Node	Locator & Supervisor Primary Node	Locator & Supervisor Secondary Search Node	Extra Fetch Server
Docs in Millions	20 - 100	20 - 100	20 - 100	20 - 100	20 - 100	20 - 100
CPU cores	32	12	Next Table	24	Next Table	32
RAM (GB)	100	100 - 128	Next Table	128 - 148	Next Table	64
System Drive (GB)	20	20	Next Table	30	Next Table	20
Program Files (GB)	60	60	Next Table	60	Next Table	60
Program Data (GB)	1500 - 3000	800 - 1500	Next Table	1000 - 1700	Next Table	100

Index Capacity, Query Performance & Failover

As the number of documents and/or the incoming query traffic increases, one single search node will not be enough. Below we see how we can add columns in order to increase data capacity and rows in order to increase query performance.

Each search index positioned in one (horizontal) row has a different set of data than the other indexes in the same row. For instance a row with 2 columns will double the amount of data that can be indexed. All search index residing within the same column will all have a copy of the same set of data. A set up with let's say 3 rows will thus triple the amount of queries Locator is able to respond to.

If one adds one more row than what one needs in order to handle peak query traffic, then that extra redundant row will ensure total failover with no drop in quality if any one search server were to go down.

Below we see a 2 x 2 search matrix example. Each search row is marked with a yellow rectangle and each search column with dark red rectangle.

And the table below has the machine spec for each of the 3 search nodes that has been added:

	Locator	Locator & Supervisor
Docs in Millions	20 - 100	20 - 100
CPU cores	12	24
RAM (GB)	100	128 - 148
System Drive (GB)	20	30
Program Files (GB)	60	60
Program Data (GB)	800 - 1500	1000 - 1700

Single Points of Failure

With the current version of the ayfie Locator, the following components cannot be duplicated and remains single point of failures:

Index Builder - The Index Builder is stateless, hence the worse case scenario is an temporary interruption in indexing of new and updated documents.
Linguistic service - The Linguistic Services (a.k.a the Lingo Services) is not part of the out of the box installation. If installed later, it will be called by the Index Builder to perform entity extraction upon data extracted from the database. Failure by this component will slow the Index Builder down and it will take it longer to produce the index as it will be waiting for a timeout for each document, and no entities will be extracted.
License Service - A License Service failure will be fatal. Without the license server, connectors will stop to operate and no user will be allowed to pass in queries.
Primary Search Node - If one operates with a search matrix to provide failover, then that will will work as long as the failing search node is not the primary server. The reason for this is that the license server resides on the primary node and if that node goes down, no Rest Service will be able to authenticate the user.
Zookeeper - A Zookeeper failure will render the search nonoperational as ZooKeeper is responsible for directing searches to the correct shards within the collection for the search node in question.
IIS - If the Windows IIS Web Server were to go down then it will no longer be possible to post queries. However, as that service is stateless, as soon as it is up again, it would be back to business as usual as no data would be lost beyond the failed queries posted while the service was down. Using a load balancer and two or more IIS node as shown in the graphic above, would not eliminate this single point of failure as both/all of them would be depending on the license server which would continue to be a single point of failure.
Database.- If the database were to go down all searches would stop working as the required user authentication depends on the database. However, it is possible to set up an external standalone PostgreSQL or Microsoft MySQL database with its own failover solution independent of the ayfie Locator.
Connector services - The connectors have two phases, discovery and fetch. Unlike fetch that can by done from any number of servers, discovery can only be done from a single server. For collections served by a connector performing discovery from a particular server, the failure of that server / that connector will result in failure to do any further updates to the index. The server assigned to perform discovery can be changed, however only 1 server can be assigned to perform discovery.