Table of Contents |
---|
Preface
From time to time, it is desired to add custom fields to Locator that can be used in reports generated by Supervisor/Insight. There are a lot of pre-defined fields that come out of the box with Locator and Supervisor/Insight, but often customers have specific data that they need to be able to report on that is not covered by the extraction rules provided. By following this guide, we will give some a real life examples example from a customer request that should give you the means to configure this yourself.
...
You should have some experience with the rules engine, and how to access this and test your rules. You should also have an understanding of how regular expressions work and how to create these. We find this online regular expressions test tool very valuable to test regular expressions. You should also know how to make inserts into the database, and know how to use the command line tool Via.Repository.exe which is used to REINDEX the data from the database going into SOLR. You also need to make changes to the SOLR schema_overrides.xml file to add the new field(s) to SOLR, and how to apply these changes.
...
Info | ||
---|---|---|
|
...
In this real life example, our customer wanted to have four custom Lingo fields added to their Locator/Supervisor setup.
- Italian IBAN numbers
- Italian drivers license numbers
- Italian identification ID numbers
- Italian tax identification numbers
Below is a list of the fields we needed to create, and some sample text along with the regular expressions we needed to extract the data we need.
...
| |
We use SOLR to index data, and SOLR requires that all the data are stored in defined fields. These fields can contain either normal text, date and time data, geographical positions or a whole range of other data types/content. For instance, the actual text of an indexed document is stored in a field called document_text. When Supervisor/Insight is installed, we run the content of the document text through a wide range of linguistic text filters and rules to extract information that pertains to personally identifiable information or text that matches other identifiable information. This ranges from names, addresses, social security numbers, bank account numbers, city names etc. A Lingo field is then basically a storage location in SOLR for a specific set of data we want to identify and report on. |
How to create a new Lingo field
In this real life example, our customer wanted to have four custom Lingo fields added to their Locator setup.
- Italian IBAN numbers
- Italian drivers license numbers
- Italian identification ID numbers
- Italian tax identification numbers
Below is a list of the fields we needed to create, and some sample text along with the regular expressions we use to extract the data we need.
Field name | Sample text | Regular expression |
---|---|---|
lingo_kmit_iban | IT60 X054 2811 1010 0000 0123 456 IT28 W800 0000 2921 0064 5211 151 IT28W8000000292100645211151 IT28-W800-0000-2921-0064-5211-151 | IT\d{2}[ -][a-zA-Z]\d{3}[ -]\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{4}[ -]\d{3}|IT\d{2}[a-zA-Z]\d{22} |
lingo_kmit_driverlic | Numero patente U1A34Y7DTA | ([Nn][Uu][Mm][Ee][Rr][Oo] [Pp][Aa][Tt][Ee][Nn][Tt][Ee]|[Dd][Rr][Ii][Vv][Ii][Nn][Gg] [Ll][Ii][Cc][Ee][Nn][Ss][Ee])\s{1,}(U1[A-Z0-9]{7}[A-Z]|[A-Z]\d{6}) |
lingo_kmit_idcard | Carta di identità TD4563704 | ([Cc][Aa][Rr][Tt][Aa] [Dd][Ii] [Ii][Dd][Ee][Nn][Tt][Ii][Tt].{1}?|[Ii][Dd][Ee][Nn][Tt][Ii][Tt][Yy] [Cc][Aa][Rr][Dd])\s{1,}[A-Z]{2}\d{7} |
lingo_kmit_taxcode | Codice fiscale FRNLSN78P18H501L | ([Cc][Oo][Dd][Ii][Cc][Ee] [Ff][Ii][Ss][Cc][Aa][Ll][Ee]|[Tt][Aa][Xx] [Cc][Oo][Dd][Ee]) [A-Z]{6}\d{2}[A-Z]\d{2}[A-Z]\d{3}[A-Z] |
Info | ||
---|---|---|
| ||
All paths within document point to a default installation of Locator, using the following paths
If you have chosen different paths for your installation of Locator, you need to adjust the paths used in the commands below. |
...
within this document point to a default installation of Locator, using the following paths
If you have chosen different paths for your installation of Locator, you need to adjust the paths used in the commands below. |
First of, we need to make the SOLR Index Service aware of the fields we require. Supervisor will report on any SOLR field that is prefixed with lingo_, which is why all our custom fields follow this naming pattern. To add these fields to our SOLR configuration, we need to edit the file schema_overrides.xml located in %ProgramData%\ayfie\Locator\Solr\configsets\ViaWorksCloud\conf (or %ProgramData%\Konica Minolta\dokoni FIND\Solr\configsets\ViaWorksCloud\conf for dokoni FIND).
From Lingo 2.2 the default lingo fields are stored in the additional override file %ProgramData%\ayfie\Locator\Solr\configsets\ViaWorksCloud\conf\schema_overrides\schema_overrides_for_lingo.xml (%ProgramData%\Konica Minolta\dokoni FIND\Solr\configsets\ViaWorksCloud\conf\schema_overrides\schema_overrides_for_lingo.xml for dokoni FIND). Before you add the new fields to schema_overrides.xml, please check if the file schema_overrides_for_lingo.xml exits and confirm that the fields you want to add are not already stored in this file. If the same lingo field line is present in both override files, the lingo field will be duplicated in schema.xml when Locator (or dokoni FIND) is upgraded. The functionality of additional overrides files on upgrades was introduced in Locator (or dokoni FIND) 2.11 SR1.
Open this file with your favourite favorite text editor, and add the following content inside the <diff> </diff> XML code.
Code Block | ||||
---|---|---|---|---|
| ||||
<add sel="/schema/fields"> <field name="lingo_kmit_iban" type="string" indexed="true" stored="false" multiValued="true" docValues="true" /> <field name="lingo_kmit_driverlic" type="string" indexed="true" stored="false" multiValued="true" docValues="true" /> <field name="lingo_kmit_idcard" type="string" indexed="true" stored="false" multiValued="true" docValues="true" /> <field name="lingo_kmit_taxcode" type="string" indexed="true" stored="false" multiValued="true" docValues="true" /> </add> |
...
Now that we have our overrides file in place, we have to apply these overrides to the schema.xml file. This is done by using the command line tool Via.SolrUpdate.exe. Open up a CMD session with administrative privileges and issue the following command:
Code Block | ||||
---|---|---|---|---|
| ||||
"c:\Program Files\VirtualWorksayfie\ViaWorksLocator\Tools\Via.SolrUpdate.exe" APPLY c:\ProgramData\VirtualWorksayfie\ViaWorksLocator\Solr\configsets\ViaWorksCloud\conf\schema_base.xml c:\ProgramData\VirtualWorksayfie\ViaWorksLocator\Solr\configsets\ViaWorksCloud\conf\schema_overrides.xml c:\ProgramData\VirtualWorksayfie\ViaWorksLocator\Solr\configsets\ViaWorksCloud\conf\schema.xml |
If everything went smoothly, a new schema.xml file should now be ready with the required SOLR fields. To enable the new configuration, we have to upload the changes to SOLR using ZooKeeper. Again, using the already open CMD session, issue the following command:
Code Block | ||||
---|---|---|---|---|
| ||||
C:\>cd "c:\Program Files\VirtualWorksayfie\ViaWorksLocator\SOLR\bin" solr zk upconfig -z localhost:9983 -n ViaWorksCloud -d c:\ProgramData\VirtualWorksayfie\ViaWorksLocator\Solr\configsets\ViaWorksCloud\conf |
...
At this point, SOLR is now running with our new configuration and is aware of the new Lingo fields. Now we move over to the next step, which is to enable the fields in the database.
...
We are now ready to make Locator aware that these fields should be indexed, and to achieve this, we need to make the framework do this. This is done by adding the index fields to the index.index_field table in the database. Start the Postgres Admin tool located in %Program Files%\VirtualWorksayfie\ViaWorksLocator\Postgres\bin\pg3admin.exe.
Connect the Admin tool to your ViaWorks Locator database and issue the following SQL query.
...
If you've added the content of the file which we have included at the bottom of this page under Addenum, we can easily see if the rule works or not - and I will use this in our example below. To test the rule, press the Test -> button. This will now show you the Post-Engine Document, in other words how the document will be stored in the SOLR Index. If we scroll down on the page until we find our lingo_ fields, we should see the following:
...
This shows that the rule works as intended, and the text is extracted and added to our lingo_ fields and fields and the rule is now ready to be saved. To do this, scroll up to the top of the current page and press the </> Temporary Rule button. This brings you back to the Rules Engine editor. We now have to enter a name for our rule, and in our example we have chosen to name it index_kmit_custom_insight_fields. The reason for this naming scheme is both to give an indication that this is an index rule, and also provide a textual high level explanation what the rule does. Once you have given the rule a name, you can press the Save New Rule button.
...
Code Block | ||||
---|---|---|---|---|
| ||||
"c:\Program Files\VirtualWorksayfie\ViaWorksLocator\Tools\Via.Repository.exe" REINDEX /ALL |
...
At this point, you should be able to generate a report on these fields using Supervisor/Insight. Log into Supervisor/Insight , and create a new reporreport. In the report wizard under Please select required fields, you should now be able to see the new fields per our example below.
...
You might be wondering what goes on behind the scenes in the above rule, so I'm going to explain one of them, namely the lingo_kmit_taxcode rule. First lets look at the rule.
...
- First we copy the content of the document found in the field document_text to a temporary object which we call temp, this so that we do not change the document content.
- We then use the Rules Engine action called explodematches, which searches our temporary object temp for the text that matches our regular expression.
- Our regular expression is as follows: ([Cc][Oo][Dd][Ii][Cc][Ee] [Ff][Ii][Ss][Cc][Aa][Ll][Ee]|[Tt][Aa][Xx] [Cc][Oo][Dd][Ee]) [A-Z]{6}\d{2}[A-Z]\d{2}[A-Z]\d{3}[A-Z]
- The matches in our temporary object temp is then written to a list - if there is more than one match, this will result in a multiple value list.
- Seeing as our regular expression is of the greedy sort, we also end up with the text before the actual tax code - this is not something we want, so we need to remove this.
- We then copy the content from our list in the object temp to a new object called lingo_kmit_taxcode, where we use another Rules Engine action called replace.
- The replace action is instructed to look for text matching our regular expression ([Cc][Oo][Dd][Ii][Cc][Ee] [Ff][Ii][Ss][Cc][Aa][Ll][Ee]|[Tt][Aa][Xx] [Cc][Oo][Dd][Ee]) - if this text is found, we simply remove it.
- The rule should now have made the list of all matches, removed unwanted text, and leave us with a list of the tax codes.
...