How to add custom Lingo fields to Locator

Preface

From time to time, it is desired to add custom fields to Locator that can be used in reports generated by Supervisor/Insight. There are a lot of pre-defined fields that come out of the box with Locator and Supervisor/Insight, but often customers have specific data that they need to be able to report on that is not covered by the extraction rules provided. By following this guide, we will give a real life example from a customer request that should give you the means to configure this yourself.

Prerequisites

You should have some experience with the rules engine, and how to access this and test your rules. You should also have an understanding of how regular expressions work and how to create these. We find this online regular expressions test tool very valuable to test regular expressions. You should also know how to make inserts into the database, and know how to use the command line tool Via.Repository.exe which is used to REINDEX the data from the database going into SOLR. You also need to make changes to the SOLR schema_overrides.xml file to add the new field(s) to SOLR, and how to apply these changes.

What is a Lingo field?

We use SOLR to index data, and SOLR requires that all the data are stored in defined fields. These fields can contain either normal text, date and time data, geographical positions or a whole range of other data types/content. For instance, the actual text of an indexed document is stored in a field called document_text.

When Supervisor/Insight is installed, we run the content of the document text through a wide range of linguistic text filters and rules to extract information that pertains to personally identifiable information or text that matches other identifiable information. This ranges from names, addresses, social security numbers, bank account numbers, city names etc.

A Lingo field is then basically a storage location in SOLR for a specific set of data we want to identify and report on.

How to create a new Lingo field

In this real life example, our customer wanted to have four custom Lingo fields added to their Locator setup.

  • Italian IBAN numbers
  • Italian drivers license numbers
  • Italian identification ID numbers
  • Italian tax identification numbers

Below is a list of the fields we needed to create, and some sample text along with the regular expressions we use to extract the data we need.

Field nameSample textRegular expression
lingo_kmit_ibanIT60 X054 2811 1010 0000 0123 456
IT28 W800 0000 2921 0064 5211 151
IT28W8000000292100645211151
IT28-W800-0000-2921-0064-5211-151
IT\d{2}[ -][a-zA-Z]\d{3}[ -]\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{3}|IT\d{2}[a-zA-Z]\d{22}
lingo_kmit_driverlic

Numero patente U1A34Y7DTA
Driving license U11A2B3C4A

([Nn][Uu][Mm][Ee][Rr][Oo] [Pp][Aa][Tt][Ee][Nn][Tt][Ee]|[Dd][Rr][Ii][Vv][Ii][Nn][Gg] [Ll][Ii][Cc][Ee][Nn][Ss][Ee])\s{1,}(U1[A-Z0-9]{7}[A-Z]|[A-Z]\d{6})
lingo_kmit_idcard

Carta di identità TD4563704
Identity card KM6557454

([Cc][Aa][Rr][Tt][Aa] [Dd][Ii] [Ii][Dd][Ee][Nn][Tt][Ii][Tt].{1}?|[Ii][Dd][Ee][Nn][Tt][Ii][Tt][Yy] [Cc][Aa][Rr][Dd])\s{1,}[A-Z]{2}\d{7}
lingo_kmit_taxcode

Codice fiscale FRNLSN78P18H501L
Tax Code ALSMNN81E64G607Y

([Cc][Oo][Dd][Ii][Cc][Ee] [Ff][Ii][Ss][Cc][Aa][Ll][Ee]|[Tt][Aa][Xx] [Cc][Oo][Dd][Ee]) [A-Z]{6}\d{2}[A-Z]\d{2}[A-Z]\d{3}[A-Z]

Important notice

All paths within this document point to a default installation of Locator, using the following paths

  • Program Files → C:\Program Files\ayfie\Locator\
  • ProgramData → C:\ProgramData\ayfie\Locator\

If you have chosen different paths for your installation of Locator, you need to adjust the paths used in the commands below.

First of, we need to make the SOLR Index Service aware of the fields we require. Supervisor will report on any SOLR field that is prefixed with lingo_, which is why all our custom fields follow this naming pattern. To add these fields to our SOLR configuration, we need to edit the file schema_overrides.xml located in %ProgramData%\ayfie\Locator\Solr\configsets\ViaWorksCloud\conf (or %ProgramData%\Konica Minolta\dokoni FIND\Solr\configsets\ViaWorksCloud\conf for dokoni FIND).

From Lingo 2.2 the default lingo fields are stored in the additional override file %ProgramData%\ayfie\Locator\Solr\configsets\ViaWorksCloud\conf\schema_overrides\schema_overrides_for_lingo.xml (%ProgramData%\Konica Minolta\dokoni FIND\Solr\configsets\ViaWorksCloud\conf\schema_overrides\schema_overrides_for_lingo.xml for dokoni FIND). Before you add the new fields to schema_overrides.xml, please check if the file schema_overrides_for_lingo.xml exits and confirm that the fields you want to add are not already stored in this file. If the same lingo field line is present in both override files, the lingo field will be duplicated in schema.xml when Locator (or dokoni FIND) is upgraded. The functionality of additional overrides files on upgrades was introduced in Locator (or dokoni FIND) 2.11 SR1.

Open this file with your favorite text editor, and add the following content inside the <diff> </diff> XML code.

  <add sel="/schema/fields">
    <field name="lingo_kmit_iban" type="string" indexed="true" stored="false" multiValued="true" docValues="true" />
	<field name="lingo_kmit_driverlic" type="string" indexed="true" stored="false" multiValued="true" docValues="true" />
	<field name="lingo_kmit_idcard" type="string" indexed="true" stored="false" multiValued="true" docValues="true" />
	<field name="lingo_kmit_taxcode" type="string" indexed="true" stored="false" multiValued="true" docValues="true" />
  </add>

Then save the file. Your end result should look something like this (depending on whether or not other custom changes have been made to SOLR).

<?xml version="1.0" encoding="UTF-8"?>
<diff>
  <add sel="/schema/fields">
    <field name="lingo_kmit_iban" type="string" indexed="true" stored="false" multiValued="true" docValues="true" />
	<field name="lingo_kmit_driverlic" type="string" indexed="true" stored="false" multiValued="true" docValues="true" />
	<field name="lingo_kmit_idcard" type="string" indexed="true" stored="false" multiValued="true" docValues="true" />
	<field name="lingo_kmit_taxcode" type="string" indexed="true" stored="false" multiValued="true" docValues="true" />
  </add>
</diff>

Now that we have our overrides file in place, we have to apply these overrides to the schema.xml file. This is done by using the command line tool Via.SolrUpdate.exe. Open up a CMD session with administrative privileges and issue the following command:

"c:\Program Files\ayfie\Locator\Tools\Via.SolrUpdate.exe" APPLY c:\ProgramData\ayfie\Locator\Solr\configsets\ViaWorksCloud\conf\schema_base.xml c:\ProgramData\ayfie\Locator\Solr\configsets\ViaWorksCloud\conf\schema_overrides.xml c:\ProgramData\ayfie\Locator\Solr\configsets\ViaWorksCloud\conf\schema.xml

If everything went smoothly, a new schema.xml file should now be ready with the required SOLR fields. To enable the new configuration, we have to upload the changes to SOLR using ZooKeeper. Again, using the already open CMD session, issue the following command:

C:\>cd "c:\Program Files\ayfie\Locator\SOLR\bin"

solr zk upconfig -z localhost:9983 -n ViaWorksCloud -d c:\ProgramData\ayfie\Locator\Solr\configsets\ViaWorksCloud\conf

This will upload the configuration to ZooKeeper, and we now have to reload the ViaWorksCloud core in the SOLR Admin Dashboard. Point your web browser to http://localhost:8983/solr/#/~collections/ViaWorksCloud and press the Reload button.

At this point, SOLR is now running with our new configuration and is aware of the new Lingo fields. Now we move over to the next step, which is to enable the fields in the database.

Enable Lingo fields in the database

We are now ready to make Locator aware that these fields should be indexed. This is done by adding the index fields to the index.index_field table in the database. Start the Postgres Admin tool located in %Program Files%\ayfie\Locator\Postgres\bin\pg3admin.exe.

Connect the Admin tool to your Locator database and issue the following SQL query.

DO $$ BEGIN
	IF NOT EXISTS(SELECT index_field_name
		FROM index.index_field
		WHERE index_field_name LIKE 'lingo_kmit_iban')
	THEN
		INSERT INTO index.index_field
		(index_field_name, enabled, can_have_multiple_values, description)
		VALUES
		('lingo_kmit_iban', true, true, 'Lingo KM IT IBAN Number');
	ELSE
		raise notice 'Index field lingo_kmit_iban already exists, not adding...';
	END IF;

	IF NOT EXISTS(SELECT index_field_name
		FROM index.index_field
		WHERE index_field_name LIKE 'lingo_kmit_driverlic')
	THEN
		INSERT INTO index.index_field
		(index_field_name, enabled, can_have_multiple_values, description)
		VALUES
		('lingo_kmit_driverlic', true, true, 'Lingo KM IT Drivers License');
	ELSE
		raise notice 'Index field lingo_kmit_driverlic already exists, not adding...';
	END IF;

	IF NOT EXISTS(SELECT index_field_name
		FROM index.index_field
		WHERE index_field_name LIKE 'lingo_kmit_idcard')
	THEN
		INSERT INTO index.index_field
		(index_field_name, enabled, can_have_multiple_values, description)
		VALUES
		('lingo_kmit_idcard', true, true, 'Lingo KM IT ID Card Number');
	ELSE
		raise notice 'Index field lingo_kmit_idcard already exists, not adding...';
	END IF;
	
	IF NOT EXISTS(SELECT index_field_name
		FROM index.index_field
		WHERE index_field_name LIKE 'lingo_kmit_taxcode')
	THEN
		INSERT INTO index.index_field
		(index_field_name, enabled, can_have_multiple_values, description)
		VALUES
		('lingo_kmit_taxcode', true, true, 'Lingo KM IT Tax Code Number');
	ELSE
		raise notice 'Index field lingo_kmit_taxcode already exists, not adding...';
	END IF;

END$$;

That takes care of the database and the framework, now to the final part - adding the rules we need to extract the data from the documents we have indexed.

Adding the Index Rule to Locator

First of all, whenever we work with the Rules Engine, we need the ID of a document that contains the data we want to extract and/or manipulate. To do this, we need to perform a search to find the relevant document. Once you have found the document, you append &debug=true to the query string in the address bar of the open browser session. This will add some extra debug information to your search hit list.

Notice that we now get an extra display item under each search hit in the green box, and what we are looking for here is the Doc ID. Make a note of this ID and point your browser to http://localhost/Dashboard/Rules/List?targetFilter=index. Then press the Test Index Rules button.

This will direct you to this page.

Enter the document ID that you made a note of previously, in our example this would be 77247 and press Next. This will present you with the following page, showing you the Index Pipeline.

Now in the Index Pipeline, press the Rules Engine Editor button to enter the Rules Engine Editor. This will load the Rules Engine editor, which by default loads up with an empty rule.

Now in the text box, you replace the existing text with the following rule.

<rules>
  <rule name="index_kmit_custom_insight_fields" >
    <actions>
      <copy from="document_text" to="temp" ignorecase="false" duplicate="false">
          <formatName />
      </copy>
      <explodematches field="temp" pattern="([Cc][Oo][Dd][Ii][Cc][Ee] [Ff][Ii][Ss][Cc][Aa][Ll][Ee]|[Tt][Aa][Xx] [Cc][Oo][Dd][Ee]) [A-Z]{6}\d{2}[A-Z]\d{2}[A-Z]\d{3}[A-Z]" />
      <copy from="temp" to="lingo_kmit_taxcode" ignorecase="false" duplicate="false">
          <replace pattern="([Cc][Oo][Dd][Ii][Cc][Ee] [Ff][Ii][Ss][Cc][Aa][Ll][Ee]|[Tt][Aa][Xx] [Cc][Oo][Dd][Ee])">
      </replace>
        <formatName />
    </copy>
    <copy from="document_text" to="temp" ignorecase="false" duplicate="false">
        <formatName />
    </copy>
    <explodematches field="temp" pattern="([Nn][Uu][Mm][Ee][Rr][Oo] [Pp][Aa][Tt][Ee][Nn][Tt][Ee]|[Dd][Rr][Ii][Vv][Ii][Nn][Gg] [Ll][Ii][Cc][Ee][Nn][Ss][Ee])\s{1,}(U1[A-Z0-9]{7}[A-Z]|[A-Z]\d{6})" />
    <copy from="temp" to="lingo_kmit_driverlic" ignorecase="false" duplicate="false">
        <replace pattern="([Nn][Uu][Mm][Ee][Rr][Oo] [Pp][Aa][Tt][Ee][Nn][Tt][Ee]|[Dd][Rr][Ii][Vv][Ii][Nn][Gg] [Ll][Ii][Cc][Ee][Nn][Ss][Ee])">
    </replace>
      <formatName />
  </copy>
  <copy from="document_text" to="temp" ignorecase="false" duplicate="false">
      <formatName />
  </copy>
  <explodematches field="temp" pattern="([Cc][Aa][Rr][Tt][Aa] [Dd][Ii] [Ii][Dd][Ee][Nn][Tt][Ii][Tt].{1}?|[Ii][Dd][Ee][Nn][Tt][Ii][Tt][Yy] [Cc][Aa][Rr][Dd])\s{1,}[A-Z]{2}\d{7}" />
  <copy from="temp" to="lingo_kmit_idcard" ignorecase="false" duplicate="false">
      <replace pattern="([Cc][Aa][Rr][Tt][Aa] [Dd][Ii] [Ii][Dd][Ee][Nn][Tt][Ii][Tt].{1}?|[Ii][Dd][Ee][Nn][Tt][Ii][Tt][Yy] [Cc][Aa][Rr][Dd])">
  </replace>
    <formatName />
</copy>
<copy from="document_text" to="lingo_kmit_iban" ignorecase="false" duplicate="false">
    <formatName />
</copy>
<explodematches field="lingo_kmit_iban" pattern="IT\d{2}[ -][a-zA-Z]\d{3}[ -]\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{3}|IT\d{2}[a-zA-Z]\d{22}" />
</actions>
</rule>
</rules>

If you've added the content of the file which we have included at the bottom of this page under Addenum, we can easily see if the rule works or not - and I will use this in our example below. To test the rule, press the Test -> button. This will now show you the Post-Engine Document, in other words how the document will be stored in the SOLR Index. If we scroll down on the page until we find our lingo_ fields, we should see the following:

This shows that the rule works as intended, and the text is extracted and added to our lingo_ fields and the rule is now ready to be saved. To do this, scroll up to the top of the current page and press the </> Temporary Rule button. This brings you back to the Rules Engine editor. We now have to enter a name for our rule, and in our example we have chosen to name it index_kmit_custom_insight_fields. The reason for this naming scheme is both to give an indication that this is an index rule, and also provide a textual high level explanation what the rule does. Once you have given the rule a name, you can press the Save New Rule button.

Now that we have saved the rule, the rule is active and enabled. However, to populate the fields in the SOLR index we need to REINDEX the data from the Postgres database to the SOLR index. To do this, you need to open a CMD session with administrative privileges and run the following command:

"c:\Program Files\ayfie\Locator\Tools\Via.Repository.exe" REINDEX /ALL

PLEASE NOTE: in the above command, we have used the command paramenter /ALL which denotes that we want to REINDEX all the documents in the database. However, if you only have one or more repositories that has data with the text you want these fields populated from, you can replace /ALL with /repoid:X where you replace X with the ID of the repository. A list of repository IDs can be found in the Repositories page in the Web Dashboard, located at http://localhost/Dashboard/Repository/List.

The system will now start to REINDEX the documents, and run them through the new rule. You can monitor the progress of the REINDEX from the Repositories page in the Web Dashboard (ref: http://localhost/Dashboard/Repository/List). Once the % Indexed column has reached a 100% for all repositories, you know this process has finished.

At this point, you should be able to generate a report on these fields using Supervisor/Insight. Log into Supervisor/Insight and create a new report. In the report wizard under Please select required fields, you should now be able to see the new fields per our example below.

Congratulations, you have now successfully added new Lingo fields to Locator which Supervisor/Insight which you can report on. Good job!

Explanation of one of the rules used

You might be wondering what goes on behind the scenes in the above rule, so I'm going to explain one of them, namely the lingo_kmit_taxcode rule. First lets look at the rule.

      <copy from="document_text" to="temp" ignorecase="false" duplicate="false">
          <formatName />
      </copy>
      <explodematches field="temp" pattern="([Cc][Oo][Dd][Ii][Cc][Ee] [Ff][Ii][Ss][Cc][Aa][Ll][Ee]|[Tt][Aa][Xx] [Cc][Oo][Dd][Ee]) [A-Z]{6}\d{2}[A-Z]\d{2}[A-Z]\d{3}[A-Z]" />
      <copy from="temp" to="lingo_kmit_taxcode" ignorecase="false" duplicate="false">
          <replace pattern="([Cc][Oo][Dd][Ii][Cc][Ee] [Ff][Ii][Ss][Cc][Aa][Ll][Ee]|[Tt][Aa][Xx] [Cc][Oo][Dd][Ee])">
      </replace>
        <formatName />
    </copy>

Here is a brief explanation of what the Rules Engine does in order of operation:

  1. First we copy the content of the document found in the field document_text to a temporary object which we call temp, this so that we do not change the document content.
  2. We then use the Rules Engine action called explodematches, which searches our temporary object temp for the text that matches our regular expression.
  3. Our regular expression is as follows: ([Cc][Oo][Dd][Ii][Cc][Ee] [Ff][Ii][Ss][Cc][Aa][Ll][Ee]|[Tt][Aa][Xx] [Cc][Oo][Dd][Ee]) [A-Z]{6}\d{2}[A-Z]\d{2}[A-Z]\d{3}[A-Z]
  4. The matches in our temporary object temp is then written to a list - if there is more than one match, this will result in a multiple value list.
  5. Seeing as our regular expression is of the greedy sort, we also end up with the text before the actual tax code - this is not something we want, so we need to remove this.
  6. We then copy the content from our list in the object temp to a new object called lingo_kmit_taxcode, where we use another Rules Engine action called replace.
  7. The replace action is instructed to look for text matching our regular expression ([Cc][Oo][Dd][Ii][Cc][Ee] [Ff][Ii][Ss][Cc][Aa][Ll][Ee]|[Tt][Aa][Xx] [Cc][Oo][Dd][Ee]) - if this text is found, we simply remove it.
  8. The rule should now have made the list of all matches, removed unwanted text, and leave us with a list of the tax codes.

Addendum

For reference I've included some sample text which you can use to create a text file or Word document, to test the above solution with.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam eros augue, pulvinar non nisi vitae, varius cursus orci. Etiam ullamcorper a neque at bibendum. Aliquam condimentum, tortor vel tincidunt imperdiet, neque massa imperdiet urna, a euismod magna ipsum non massa. Donec in mauris viverra, dignissim neque vel, laoreet purus. Phasellus aliquet enim quis nisi fermentum venenatis. Fusce ac gravida velit, et lobortis diam. Sed pharetra odio quis accumsan feugiat. In venenatis ac urna id consequat. Etiam a magna elit. Curabitur at odio libero. Nunc tincidunt faucibus lacus, at semper purus ultrices in. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Donec quis eleifend velit, ut cursus augue. Sed ut lacinia ligula, quis mattis urna. Pellentesque urna ligula, euismod molestie rutrum et, feugiat quis arcu. Interdum et malesuada fames ac ante ipsum primis in faucibus.

Codice fiscale FRNLSN78P18H501L

Tax Code ALSMNN81E64G607Y

Vestibulum at tincidunt nisl. In luctus leo a vestibulum porttitor. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Morbi vel pellentesque ipsum. Quisque fringilla, orci id condimentum laoreet, orci ligula convallis est, quis faucibus lectus magna ut enim. Curabitur est lacus, vulputate ut ante non, ultricies facilisis ex. Nulla consequat felis commodo quam tempor luctus. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Quisque tincidunt felis et sapien dictum ullamcorper. Praesent posuere augue sed maximus molestie. Aliquam erat volutpat. Nulla scelerisque eget elit vitae finibus. Vivamus tempor eget odio non imperdiet. Curabitur et sem varius, ornare lorem non, auctor ipsum. Nam ullamcorper ligula sit amet rhoncus pulvinar. Donec imperdiet pharetra malesuada.

Numero patente U1A34Y7DTA

Driving license U11A2B3C4A

Praesent at lorem varius, congue nulla eget, lacinia elit. Curabitur commodo mi ut velit blandit laoreet. Maecenas pretium placerat ante, quis pretium arcu. Donec quam tortor, posuere at nibh quis, pellentesque hendrerit sapien. Vestibulum ornare sagittis nisl id congue. Maecenas sed gravida mauris. Phasellus placerat ipsum sit amet sapien lacinia, eget imperdiet mi maximus. Phasellus sodales dui leo, eget placerat odio aliquet eu. Donec sed sem mauris. Sed purus risus, luctus et massa vitae, ullamcorper vehicula mi. Donec consequat orci vel dui consequat, et tincidunt erat malesuada. Morbi gravida quam in dolor hendrerit gravida. Fusce ac convallis nisl. Sed sed gravida lorem, id consectetur elit. Praesent aliquet fermentum dui nec porttitor.

Carta di identità TD4563704

Identity card KM6557454

In finibus, lectus id ornare varius, sem erat vehicula elit, vitae luctus elit sapien sed elit. Proin blandit metus sit amet nibh aliquet, vel consequat velit efficitur. Etiam aliquam dapibus enim a porttitor. Sed laoreet mi erat, ac commodo est ornare ac. Maecenas non porta diam. Pellentesque mi ante, sagittis et elit in, dignissim placerat sapien. Praesent dolor erat, mattis sed metus eget, bibendum malesuada lorem. Praesent mollis in nibh non vulputate. Aliquam nec luctus nisi. Fusce convallis imperdiet justo et viverra. Suspendisse turpis lorem, semper a mauris viverra, sollicitudin hendrerit erat. Nam mollis in lectus sit amet tincidunt. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eu orci iaculis, ornare augue placerat, venenatis lacus.

### start IBAN test ###

IT60 X054 2811 1010 0000 0123 456
IT28 W800 0000 2921 0064 5211 151
IT28W8000000292100645211151
IT28-W800-0000-2921-0064-5211-151

### end IBAN test ###

ayfie