Markdown |
---|
- [Introduction](#introduction) - [What Is a Smart Classifier?](#what-is-a-smart-classifier) - [How Does a Smart Classifier Operate?](#how-does-a-smart-classifier-operate) - [What Does a Smart Classifier Provide That a Regular Keyword Search Cannot](#what-does-a-smart-classifier-provide-that-a-regular-keyword-search-cannot) - [How Does One Train the Smart Classifier](#how-does-one-train-the-smart-classifier) - [Training Dataset Example](#training-dataset-example) - [The Inner Workings of the Smart Classifier](#the-inner-workings-of-the-smart-classifier) - [Calculating the Accuracy](#calculating-the-accuracy) - [Smart Classifier Statuses](#smart-classifier-statuses) # Introduction Ayfie Locator incorporates a licensed AI-based feature known as the Smart Classifier. This documentation provides both a conceptual introduction to its operation as well as practical setup instructions. **The** Smart Classifier is a feature, but in this documentation we will also be referring to each Smart Classifier based search filter aas **a** smart classifier. ## What Is a Smart Classifier? A smart classifier, once ready to be used, is nothing more than a search filter. And like any other search filter it allows a user to remove documents from the search result that are not of interest to the user. For instance, a Smart Classifier trained to detect contracts, can be used to filter the search result to only include contracts. ## How Does a Smart Classifier Operate? A smart classifier undergoes training to identify words, word sequences, and sentence patterns that are more prevalent in the target data than in the broader dataset. It leverages this acquired knowledge to make predictions about the category to which indexed data belongs. ## What Does a Smart Classifier Provide That a Regular Keyword Search Cannot? Imagine trying to locate a specific piece of information within a contract you've seen before. You might recall one or two of the parties involved, but simply searching for their names could yield hundreds of results from various other types of documents where those same parties are also mentioned. To narrow down the results, you might try to add some common contract-related terms to your search. However, even if successful, it is likely to take a few attempts, cost you some time and possibly some frustration. A smart classifier trained on contracts will simplify the search operation by offering a contract filter option. And the smart classifier does not depend on specific terms being present in the documents to know that it is a contract. A smart classifier knows what a contract is by the nature of it (contexts and patterns) and not by specific words alone. It will know that a contract is a contract even without the word *contract* ever being mentioned anywhere within in the document. ## How Does One Train the Smart Classifier? A smart classifier learn via examples. For that reason one needs to provide a smart classifier with the following two training sets: - **Positive examples**: Examples of the data that we are interested in. - **Negative examples**: Examples of the data that we are **not** interested in. ## The Principles and Rules of Classifier Training This is the first principle of classifier training: > No classifier performs better than the quality of its training set And here are the rules for how to go about implementing that principle: - Collectively, the examples should be representative of the entire dataset - Negative examples are as important as positive examples. - The more examples the merrier. Needless to say, misidentifying positive examples as negative examples and vice versa, is very harmful to the quality of the trained smart classifier. This is the second principle of classifier training: > To train a classifier is an iterative process Once a smart classifier is trained and put to use one will discover erroneous classifications. One can improve the smart classifier by adding some of the erroneously classified documents to the training set and then retrain the classifier. # Training Dataset Example We will in this section introduce an imaginary dataset for which we are to train a smart classifier. The overall dataset consists of newspaper articles about the following topics: - **Football** - Football match reports, team and player analyses and commentaries - **Sport (not Football)** - Match/event reporting and other commentaries from other sports than football, for instance articles about tennis or swimming - **Football Player Transfers** - Articles discussing the financial aspects of football clubs and their player trading activities. - **Non-Football Player Transfers** - Financial news about athletes that are traded in other sports than football, for instance the trading of NBA players - **Famous People Gossip (including athletes)** - Gossip about movie stars, top athletes, politicians, for instance a story about a player being divorced or seen with a new flame - **Financial News** - General financial news, for instance about stocks and company acquisitions - **Football Stadium Constructions** - Specific financial news about the constructions of football stadiums - **Local News** - Anything local, for instance the opening of a new book store or a series of recent local burglaries - **Local Weather**- Today’s temperature, tomorrow and next week’s forecast, etc. Here we see a graphical presentation of the same overall dataset: ![dataset][dataset]{width=50%} We are now to create a dataset consisting of positive and negative examples that is to be used to train a smart classifier to be able to separate the green circled football related data from the rest. Let’s see how the rules we introduced above now comes into play: - **Collectively, the examples should be representative of the entire dataset** - We must make sure that the positive examples include both pure football articles as well as financial new articles related to the trading of players - **Negative examples are as important as positive examples** - Adding local news and local weather articles as negative examples will help the smart classifier understand what distinguishes the football articles from any other article. In addition, it is important to also add borderline articles that are closer to the positive data. For instance, adding other sports articles will help the smart classifier understand that just because an article is about a team winning a match, it does not mean it is a football article; it could, for instance, be a basketball article. By for instance adding gossip articles, the smart classifier learns that just because a known football player is mentioned, the article does not necessarily have to be about football. By adding some financial news articles about the selling and purchasing of products, the smart classifier learns that not all trading articles are related to the trading of football players. And so on and so on. - **The more examples the merrier** - The more examples, positive and negative, the better. However, adding identical or very similar examples multiple times adds no value. What truly enhances the smart classifier's performance is the inclusion of a large number of diverse examples spread across the full dataset. # The Inner Workings of the Smart Classifier At this point we have provided the smart classifiers with training data consisting of positive and negative examples. In this section we will learn what a smart classifier does with it. There are many different algorithms that can be used to train a smart classifier. Depending on the type of data, some works better than others. The smart classifier knows many of them and it will try them all one by one to see which one gives the best overall result. These are the steps that are carried out: 1. The smart classifier splits the training data in two sets, one large set to be used to train the smart classifier and another small set to be used afterwards to measure the accuracy of the trained smart classifier 2. The smart classifier repeats the steps below for each algorithm that it tries: 1. The smart classifier is trained with the large training set 2. The now trained smart classifier classifies the small training set 3. It calculates the accuracy of the classification it just did 3. Once done using all the algorithms, it selects the algorithm with the highest accuracy 4. The smart classifier does a final training with the winning algorithm using all the training data ## Calculating the Accuracy In step 2.3, the accuracy of each tried algorithm is calculated in the form of a single number called the *Macro F1 Score*. The Macro F1 Score is a balance of two important aspects: how good the Smart Classifier is at not including the wrong documents (precision) and how good the Smart Classifier is at including all of the correct documents (recall). The higher the Macro F1 score, the better the system is at getting both precision and recall right. ## Smart Classifier States Once one has created a new smart classifier in the search UI, that smart classifier will display the following 5 states: - **Not Ready to Train** – This is the start state that one will remain within until one has provided enough positive and negative examples. Once that happens the state will automatically change to the next state below (Ready to Train). This does however not mean that one need or should start to train as there is no upper limit on the numbers of examples one can provide. The more, the better. - **Ready to Train** – One has now entered the minimum required number of example documents and can now choose to continue to add more examples or to start to train the smart classifier. In case of the latter, the smart classifier will enter the next state below (Training). - **Training** – The smart classifier is in the state of being trained. That is, it performs the steps described above in section **The Inner Workings of the Smart Classifier**. It will remain here until the training is complete, at which point it will automatically enter the next state (Trained). - **Trained** – The smart classifier training is complete, and one can now choose to activate the smart classifier which will bring the smart classifier into the next state below (Active). - **Active** - Once activate, the smart classifier will start to classify all or parts of the search index data depending on user selection. [//]: # (Embedded Images) [dataset]:  |
Page Comparison
General
Content
Integrations