Configuring SOLR to search on non-alphanumeric characters

Introduction

ViaWorks uses SOLR/Lucene technology under the covers to make document text searchable. One important function performed by SOLR is to break the document text up into tokens before it is added to the index. SOLR provides the capability to define Analyzers which give a user fine-grained control over how the text will be tokenized, both at index time and at query time. Analyzers are associated with Field Types, which are in turn associated with Fields. Text from documents will always be associated with one or more SOLR fields. SOLR's schema.xml file is used to define and edit Analyzers, Field Types, Fields, and many other components of the SOLR engine.

When you define a Field in the SOLR schema, you must also specify a Field Type. The Field Type determines how the field text will be parsed and analyzed. Out of the box, schema.xml includes sensible defaults for many field types.

The Problem

Sometimes, however, those defaults do not support certain special requirements. For example, the default configuration utilizes Analyzers/Tokenizers that effectively strip the text of most non-alphanumeric characters, such as quote characters, parentheses, hyphen, punctuation characters, etc. In most cases this is the desired behavior, but there are some times when we wish to search on some characters that are removed with the default processing.

One example of such a case is the quote characters (', "). While their well-known standard use is to demarcate quoted passages, they may also be used in the context of describing linear measurements, as the foot (') and inch (") symbols. SOLR's default configuration strips these characters out, so for example, searches for the string 6' 2" (six feet, two inches) would fail to return any results, even if we were sure that a document containing such a string exists.

The Solution: Use the String field type

Most text fields in SOLR are defined as general-purpose field types, such as text_general. These field types are associated with analyzers/tokenizers that perform the aforementioned stripping of non-alphanumeric characters. However, a user can define fields to use other field types, some of which do not strip these characters out. One such field type is string. The string field type uses the default analyzer, which captures the entire sequence of characters verbatim and puts it all in the index.

For example, let us consider the hypothetical field "name". The "name" field would usually be defined as a text_general field type, and the typical processing (character stripping) would occur. However, if the field is redefined as of type string, no such processing would occur and all characters would be retained. Here's an example of what such a configuration might look like in schema.xml:

<field name="name" type="string" indexed="true" stored="true">
<field name="manu" type="text_general" indexed="true" stored="true">

In the first field, "name", is defined as being of type string, while the second field, "manu", uses the more general type, text_general. As a result of this, all characters from "name" will be included in the index, while the text in the "manu" field will undergo the usual analysis and processing.

Sometimes we may not want to redefine existing fields. Another approach would be to create a new field, for example "name_exact", and define the new field as of type string. Then we would add a "copyField" directive in schema.xml, instructing SOLR to copy the "name" field into "name_exact". Using this mechanism the "name" field would be processed both as text_general and as string types. This would obviate the need to modify the original "name" field.

The following example uses the "document_title" field, copying it into the "title_exact" field. The "title_exact" field had previously been defined as of type string.

<copyField source="document_title" dest="title_exact">

A note of caution: Using the "copyField" technique described above may cause the index size to grow substantially, which may impact search performance. Much depends on the fields to be copied and the amount of text contained in those fields. Performance tests should be run before and after the addition of the "copyField" entries in schema.xml to see if there was any measurable performance impact.

Restart the index service

You must restart the index service after any changes are made to the schema.xml file in order for those changes to take effect. Go to the Windows Services console (click the Start button and type "services" in the search box) and restart the item called ViaWorks Index Service.

Escaping the query input

Once the text is in the index, we can perform searches against it. In order to actually query the index for the literal quote characters (' and "), or any other non-alphanumeric characters, these must be escaped using the backslash character. Failing to do so results in an error. For example, to search for:

     6'2"

...enter the string:

     6\'2\"

Procedure

Identify the field(s) that will contain the text containing non-alpha characters.
In the schema.xml file, either
- Define the field as of type string, or
- Leave the field as is and do the following
  - Create a new field with a "_exact" suffix. This new field should be of type string.
  - Add a "copyField" entry to specify that the original field should be copied to the new "_exact" field
Restart the index service
Re-Index documents that have already been indexed. See the command line tool Via.Repository.exe and the "ReIndex" command for more details.
When performing searches, escape the non-alpha characters in the query with a backslash (\)