Package org.apache.lucene.search.vectorhighlight

This is an another highlighter implementation.

See:
          Description

Interface Summary
BoundaryScanner  
FragListBuilder FragListBuilder is an interface for FieldFragList builder classes.
FragmentsBuilder FragmentsBuilder is an interface for fragments (snippets) builder classes.
 

Class Summary
BaseFragmentsBuilder  
BreakIteratorBoundaryScanner A BoundaryScanner implementation that uses BreakIterator to find boundaries in the text.
FastVectorHighlighter Another highlighter implementation.
FieldFragList FieldFragList has a list of "frag info" that is used by FragmentsBuilder class to create fragments (snippets).
FieldFragList.WeightedFragInfo  
FieldFragList.WeightedFragInfo.SubInfo  
FieldPhraseList FieldPhraseList has a list of WeightedPhraseInfo that is used by FragListBuilder to create a FieldFragList object.
FieldPhraseList.WeightedPhraseInfo  
FieldPhraseList.WeightedPhraseInfo.Toffs  
FieldQuery FieldQuery breaks down query object into terms/phrases and keep them in QueryPhraseMap structure.
FieldQuery.QueryPhraseMap  
FieldTermStack FieldTermStack is a stack that keeps query terms in the specified field of the document to be highlighted.
FieldTermStack.TermInfo  
ScoreOrderFragmentsBuilder An implementation of FragmentsBuilder that outputs score-order fragments.
ScoreOrderFragmentsBuilder.ScoreComparator  
SimpleBoundaryScanner  
SimpleFragListBuilder A simple implementation of FragListBuilder.
SimpleFragmentsBuilder A simple implementation of FragmentsBuilder.
SingleFragListBuilder An implementation class of FragListBuilder that generates one FieldFragList.WeightedFragInfo object.
 

Package org.apache.lucene.search.vectorhighlight Description

This is an another highlighter implementation.

Features

Algorithm

To explain the algorithm, let's use the following sample text (to be highlighted) and user query:

Sample Text Lucene is a search engine library.
User Query Lucene^2 OR "search library"~1

The user query is a BooleanQuery that consists of TermQuery("Lucene") with boost of 2 and PhraseQuery("search library") with slop of 1.

For your convenience, here is the offsets and positions info of the sample text.

+--------+-----------------------------------+
|        |          1111111111222222222233333|
|  offset|01234567890123456789012345678901234|
+--------+-----------------------------------+
|document|Lucene is a search engine library. |
+--------*-----------------------------------+
|position|0      1  2 3      4      5        |
+--------*-----------------------------------+

Step 1.

In Step 1, Fast Vector Highlighter generates FieldQuery.QueryPhraseMap from the user query. QueryPhraseMap consists of the following members:

public class QueryPhraseMap {
  boolean terminal;
  int slop;   // valid if terminal == true and phraseHighlight == true
  float boost;  // valid if terminal == true
  Map<String, QueryPhraseMap> subMap;
} 

QueryPhraseMap has subMap. The key of the subMap is a term text in the user query and the value is a subsequent QueryPhraseMap. If the query is a term (not phrase), then the subsequent QueryPhraseMap is marked as terminal. If the query is a phrase, then the subsequent QueryPhraseMap is not a terminal and it has the next term text in the phrase.

From the sample user query, the following QueryPhraseMap will be generated:

   QueryPhraseMap
+--------+-+  +-------+-+
|"Lucene"|o+->|boost=2|*|  * : terminal
+--------+-+  +-------+-+

+--------+-+  +---------+-+  +-------+------+-+
|"search"|o+->|"library"|o+->|boost=1|slop=1|*|
+--------+-+  +---------+-+  +-------+------+-+

Step 2.

In Step 2, Fast Vector Highlighter generates FieldTermStack. Fast Vector Highlighter uses TermFreqVector data (must be stored Field.TermVector.WITH_POSITIONS_OFFSETS) to generate it. FieldTermStack keeps the terms in the user query. Therefore, in this sample case, Fast Vector Highlighter generates the following FieldTermStack:

   FieldTermStack
+------------------+
|"Lucene"(0,6,0)   |
+------------------+
|"search"(12,18,3) |
+------------------+
|"library"(26,33,5)|
+------------------+
where : "termText"(startOffset,endOffset,position)

Step 3.

In Step 3, Fast Vector Highlighter generates FieldPhraseList by reference to QueryPhraseMap and FieldTermStack.

   FieldPhraseList
+----------------+-----------------+---+
|"Lucene"        |[(0,6)]          |w=2|
+----------------+-----------------+---+
|"search library"|[(12,18),(26,33)]|w=1|
+----------------+-----------------+---+

The type of each entry is WeightedPhraseInfo that consists of an array of terms offsets and weight. The weight (Fast Vector Highlighter uses query boost to calculate the weight) will be taken into account when Fast Vector Highlighter creates FieldFragList in the next step.

Step 4.

In Step 4, Fast Vector Highlighter creates FieldFragList by reference to FieldPhraseList. In this sample case, the following FieldFragList will be generated:

   FieldFragList
+---------------------------------+
|"Lucene"[(0,6)]                  |
|"search library"[(12,18),(26,33)]|
|totalBoost=3                     |
+---------------------------------+

Step 5.

In Step 5, by using FieldFragList and the field stored data, Fast Vector Highlighter creates highlighted snippets!