org.apache.lucene.index.pruning
Class TermPruningPolicy

java.lang.Object
  extended by org.apache.lucene.index.pruning.PruningPolicy
      extended by org.apache.lucene.index.pruning.TermPruningPolicy
Direct Known Subclasses:
CarmelTopKTermPruningPolicy, CarmelUniformTermPruningPolicy, RIDFTermPruningPolicy, TFTermPruningPolicy

public abstract class TermPruningPolicy
extends PruningPolicy

Policy for producing smaller index out of an input index, by examining its terms and removing from the index some or all of their data as follows:

The pruned, smaller index would, for many types of queries return nearly identical top-N results as compared with the original index, but with increased performance.

Pruning of indexes is handy for producing small first-tier indexes that fit completely in RAM, and store these indexes using IndexWriter.addIndexes(IndexReader...)

Interestingly, if the input index is optimized (i.e. doesn't contain deletions), then the index produced via IndexWriter.addIndexes(IndexReader[]) will preserve internal document id-s so that they are in sync with the original index. This means that all other auxiliary information not necessary for first-tier processing, such as some stored fields, can also be removed, to be quickly retrieved on-demand from the original index using the same internal document id. See StorePruningPolicy for information about removing stored fields.

Please note that while this family of policies method produces good results for term queries it often leads to poor results for phrase queries (because postings are removed without considering whether they belong to an important phrase).

Aggressive pruning policies produce smaller indexes - search performance increases, and recall decreases (i.e. search quality deteriorates).

See the following papers for a discussion of this problem and the proposed solutions to improve the quality of a pruned index (not implemented here):


Field Summary
protected  Map<String,Integer> fieldFlags
          Pruning operations to be conducted on fields.
protected  IndexReader in
           
 
Fields inherited from class org.apache.lucene.index.pruning.PruningPolicy
DEL_ALL, DEL_PAYLOADS, DEL_POSTINGS, DEL_STORED, DEL_VECTOR
 
Constructor Summary
protected TermPruningPolicy(IndexReader in, Map<String,Integer> fieldFlags)
          Construct a policy.
 
Method Summary
abstract  void initPositionsTerm(TermPositions in, Term t)
          Called when moving TermPositions to a new Term.
 boolean pruneAllFieldPostings(String field)
          Pruning of all postings for a field
abstract  boolean pruneAllPositions(TermPositions termPositions, Term t)
          Prune all postings per term (invoked once per term per doc)
 boolean prunePayload(TermPositions in, Term curTerm)
          Called when checking for the presence of payload for the current term at a current position
abstract  int pruneSomePositions(int docNum, int[] positions, Term curTerm)
          Prune some postings per term (invoked once per term per doc).
abstract  boolean pruneTermEnum(TermEnum te)
          Pruning of all postings for a term (invoked once per term).
abstract  int pruneTermVectorTerms(int docNumber, String field, String[] terms, int[] freqs, TermFreqVector v)
          Pruning of individual terms in term vectors.
 boolean pruneWholeTermVector(int docNumber, String field)
          Term vector pruning.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

fieldFlags

protected Map<String,Integer> fieldFlags
Pruning operations to be conducted on fields.


in

protected IndexReader in
Constructor Detail

TermPruningPolicy

protected TermPruningPolicy(IndexReader in,
                            Map<String,Integer> fieldFlags)
Construct a policy.

Parameters:
in - input reader
fieldFlags - a map, where keys are field names and values are bitwise-OR flags of operations to be performed (see PruningPolicy for more details).
Method Detail

pruneWholeTermVector

public boolean pruneWholeTermVector(int docNumber,
                                    String field)
                             throws IOException
Term vector pruning.

Parameters:
docNumber - document number
field - field name
Returns:
true if the complete term vector for this field should be removed (as specified by PruningPolicy.DEL_VECTOR flag).
Throws:
IOException

pruneAllFieldPostings

public boolean pruneAllFieldPostings(String field)
                              throws IOException
Pruning of all postings for a field

Parameters:
field - field name
Returns:
true if all postings for all terms in this field should be removed (as specified by PruningPolicy.DEL_POSTINGS).
Throws:
IOException

initPositionsTerm

public abstract void initPositionsTerm(TermPositions in,
                                       Term t)
                                throws IOException
Called when moving TermPositions to a new Term.

Parameters:
in - input term positions
t - current term
Throws:
IOException

prunePayload

public boolean prunePayload(TermPositions in,
                            Term curTerm)
Called when checking for the presence of payload for the current term at a current position

Parameters:
in - positioned term positions
curTerm - current term associated with these positions
Returns:
true if the payload should be removed, false otherwise.

pruneTermVectorTerms

public abstract int pruneTermVectorTerms(int docNumber,
                                         String field,
                                         String[] terms,
                                         int[] freqs,
                                         TermFreqVector v)
                                  throws IOException
Pruning of individual terms in term vectors.

Parameters:
docNumber - document number
field - field name
terms - array of terms
freqs - array of term frequencies
v - the original term frequency vector
Returns:
0 if no terms are to be removed, positive number to indicate how many terms need to be removed. The same number of entries in the terms array must be set to null to indicate which terms to remove.
Throws:
IOException

pruneTermEnum

public abstract boolean pruneTermEnum(TermEnum te)
                               throws IOException
Pruning of all postings for a term (invoked once per term).

Parameters:
te - positioned term enum.
Returns:
true if all postings for this term should be removed, false otherwise.
Throws:
IOException

pruneAllPositions

public abstract boolean pruneAllPositions(TermPositions termPositions,
                                          Term t)
                                   throws IOException
Prune all postings per term (invoked once per term per doc)

Parameters:
termPositions - positioned term positions. Implementations MUST NOT advance this by calling TermPositions methods that advance either the position pointer (next, skipTo) or term pointer (seek).
t - current term
Returns:
true if the current posting should be removed, false otherwise.
Throws:
IOException

pruneSomePositions

public abstract int pruneSomePositions(int docNum,
                                       int[] positions,
                                       Term curTerm)
Prune some postings per term (invoked once per term per doc).

Parameters:
docNum - current document number
positions - original term positions in the document (and indirectly term frequency)
curTerm - current term
Returns:
0 if no postings are to be removed, or positive number to indicate how many postings need to be removed. The same number of entries in the positions array must be set to -1 to indicate which positions to remove.