Document Conversions & Estimating Initial Indexing Time

ViaWorks Converters

ViaWorks utilizes three different converters to extract text from different file formats:

The Text Converter
The Document Filters Converter
The Optical Character Recognition (OCR) Converter

Text Converter

The text converter is used on files that contain plain ASCII text, such as what is typically found with .txt and .csv files. Conversions are very quick, regardless of file size, therefore they comprise only a small amount of the processing time during the initial indexing period.

Document Filters Converter

The Document Filters converter is used to extract text from formatted text files such as files saved from applications like Word and Excel, and Adobe Acrobat (e.g. .doc, .docx, .xls, .xlsx, .pdf), as well as from email messages. To extract text from formatted text documents, ViaWorks utilizes Document Filters from Perceptive Software. With the current release, ViaWorks can index file formats as described in Perceptive Document Filters Supported Formats 11.1.

Currently, we are using a rate of 60 documents, per CPU core, per minute to estimate the time required to convert documents. This assumes the CPU(s) are well performing, such as recent year Intel Xeon processors. As an example, assuming there are fifty thousand documents, and assuming there will be a single ViaWorks server configured with 2 CPUs (each with 4 cores), we can calculate an estimate for the required time to convert these files as follows:

Total #Documents: 50,000

#CPU Cores: 8

Conversion Rate: 60 documents/CPU Core/Minute

Total Minutes: 104 (50,000 / 8*60)

Total Hours: 1.7

OCR Converter

The OCR converter is used to extract text from image files (e.g. .jpg, .gif, .png) as well as from .pdf & .tif files with embedded images. The OCR converter requires, by far, the most processing time of the three converters. When dealing with embedded images within TIF/TIFF and PDF files, the OCR conversion process can be very time consuming, depending upon the number of image pages contained within each file, and the amount of information contained on each page.

As an example, let’s say there are over 100,000 PDF files across all of the repositories that will be indexed, with each PDF file containing anywhere from 10 to up to 300 pages of images. Let’s also assume each page contains a good amount of text information within the scanned image, such as a completed legal form. In this case, the company has done what many companies have done to convert their physical paper documents to digital format. They’ve scanned all of their documents, saving them as multi-page PDF files, with each scanned document saved as a unique page/image within each PDF.

To obtain an estimate of the time ViaWorks will require to index and convert all of these particular documents, it is important to first obtain a total document page count. VirtualWorks provides the tool, Indexing Metrics and Analytics Tool (IMaAT). This tool will obtain the file count and total size in bytes, by file extension, for all files found on the target file share. It will also count the number of pages found within PDF and TIF/TIFF files. See the article Determining Document Counts and Types for details of the tool usage and to download the tool.

Currently, we are using a rate of 4 pages, per CPU core, per minute to estimate the time required to convert documents. This assumes the CPU(s) are well performing, such as recent year Intel Xeon processors. Continuing with this customer example, let’s assume the IMaAT reported a total of 2 million pages contained within all of the PDF and TIF/TIFF files found across the targeted file shares. Let’s also assume there will be a single ViaWorks server configured with 2 CPUs (each with 4 cores). We can calculate an estimate for the required time to convert these files as follows:

Total #Pages:     2,000,000

#CPU Cores:       8

Conversion Rate:  4 pages/CPU Core/Minute

Total Minutes:    62,500 (2,000,000 / 8*4)

Total Hours:      1,042

Total Days:       43

Using the above formula provides a rough estimate of the processing time required to convert all of these files. Indexing progress would need to be monitored for several days to get a better estimation based on how the conversions were proceeding. Furthermore, the amount and rate of new documents added to the repositories each day also needs to be considered and factored in to the daily conversions.

The formula also assumes that the ViaWorks server would be 100% dedicated to conversions, thus no user searches would be performed. If it is required that users are able to search during this time, ViaWorks should be configured to schedule the conversions to run only during non-user-search-time hours. This would extend the projected completion time.

It should be noted that additional ViaWorks servers can be configured during the initial indexing phase to help with the file conversions and reduce the total time needed to convert all files. For example, a secondary ViaWorks server can be configured to help with the fetch/conversions 100% of the time, while the primary ViaWorks server can be configured to perform fetch/conversions 16 of the 24 hours each day, allowing it to respond to search requests during 8 hours of each day.

OCR Conversion Accuracy

In addition to conversion time, it is very important to understand the issue of text quality within the images. When an image contains text and those text characters are not clear, or if they are distorted, are too small, or if the individual characters in a word are not on the same plane with one another, this will result in the OCR converter not being able to extract the text, and thus the data will not be entered into the index. This is especially true with hand written text, and even with certain electronic fonts.

It is strongly recommended to obtain sample files from the repositories that will be indexed to test for OCR converter accuracy. ViaWorks includes a graphical tool that can be used to perform conversions on individual files to determine what text, if any, can be extracted from a file. This utility can be found within the “Program Files\VirtualWorks\ViaWorks\Tools” folder on the ViaWorks server. Run the “Via.Platform.Converter.Utility.exe” program to test the conversion of any file.

Summary

Understanding the amount and the composition of documents in the various repositories is essential in order to be able to provide guidance regarding the required hardware for the ViaWorks server and to estimate the time required to complete the initial indexing. This means not only obtaining valid document counts, but also the file formats, and in the case with embedded images contained within PDF and TIF file, the page counts within these types of files.

Furthermore, conversion accuracy is highly dependent upon the quality of the text contained within the images. Sampling and testing individual files will help determine what can be expected.