Single-Page Search-indexing for Multipage DjVu

a Report by PlanetDjVu, May 7, 2003

Single Page Search-Indexing is a method of full-text indexing and searching that treats each page of each multipage document as a separate document, rather than treating all pages of a multipage document as one document.

We are pleased to present you with two examples of single-page search-indexing.  Both use the INDIRECT storage method of DjVu files to open a single page WITHIN a multipage document, which allows you to navigate to the other pages of the multipage document after the selected page is opened.

The first example is from the University of Georgia. It is a book titled: Ragnarok: The Age of Fire and Gravel.

Click here to open the search page for this book.

Go ahead and do a search query. We suggest that you use the word "ship". You will see that a set of individual pages are listed. When you open a page, you open to that page within the complete book.

The second example is from JRA, using our SearchPDF product. It is a small collection of historic newspapers - The Investor's Monthly Report.

Click here to open the search page for this book.

Go ahead and do a search query. We suggest that you use the word "ship". You will see that a set of individual pages are listed. Like the example from the University of Georgia, you will see that when you open a page, you open to that page within the complete book or newspaper.

Were the University of Georgia displays the page file name as the title of the page, in the telephone book demonstration, each page has its own Title field, stored and then indexed from the page. This is an example of "page-level metadata", which can be created in DjVu files only with our JRAPublish product.

In fact, three types of stored metadata are used in the historic newspaper example:

Unique Page-level metadata
Shared (common) Page-level metadata
Document-level metadata

The page-level metadata was generated automatically by JRAConvert, a utility application that is bundled with JRAPublish.

You will also notice that the pages open with search-term-highlighting in the historic newspaper example.


The main point of this article is to show how the single pages of multipage document can be indexed, without actually breaking the document apart into single stored pages, as has been done in the past with PDF files.  The INDIRECT storage format of DjVu makes this a straight-forward task. JRA has taken this one step further by adding metadata to individual pages as well as to the full document.

JRA is the only one offering software for page-level metadata indexing. For document-level metadata in DjVu files, you can also use "djvused" from the DjVuLibre library. LizardTech does not offer any software for storing metadata in DjVu files, or for the searching of DjVu document collections.


Adobe will be releasing Acrobat 6 at the end of this month, and one of the many new features will be the ability to store page-level metadata!  Because there is no equivalent to the INDIRECT format for PDF, Adobe uses an XML metadata tree to identify the page-level metadata for each page of the PDF file.

JRA will be working on adding support for page-level metadata in PDF to the SearchPDF product. Until this happens, DjVu has the edge on this approach to document search and retrieval.

For futher reading and consideration, see our earlier, related article: Multiple Index Files for a Single INDIRECT DjVu.


Postscript:

As a final part of this review, we noticed that the OCRed text was fairly poor as created by the University of Georgia with the Expervision OCR engine that is part of the LizardTech product offering. Here is an example:

"Vhgtever may be the cue, the fgct is certain that
over large regs in Scotland, Ireland, gnd gg]cs, I might
add throughout the northern hemisphere, on both sides
the Atlantic, the stratied drift of the glacial perio is
very commonly devoid of fossils."

We re-OCRed the book with JRAPublish, which uses the ABBYY Finereader OCR engine, and here is the same paragraph:

" Whatever may be the cause, the fact is certain that
over large areas in Scotland, Ireland, and Wales, I might
add throughout the northern hemisphere, on both sides of
the Atlantic, the stratified drift of the glacial period is
very commonly devoid of fossils."

What a great improvement a top-notch OCR engine makes!















Hosted by uCoz