org.apache.lucene.index.pruning
Class CarmelTopKTermPruningPolicy

java.lang.Object
  extended by org.apache.lucene.index.pruning.PruningPolicy
      extended by org.apache.lucene.index.pruning.TermPruningPolicy
          extended by org.apache.lucene.index.pruning.CarmelTopKTermPruningPolicy

public class CarmelTopKTermPruningPolicy
extends TermPruningPolicy

Pruning policy with a search quality parameterized guarantee - configuration of this policy allows to specify two parameters: k and ε such that:

For any OR query with r terms, the score of each of the top k results in the original index, should be "practically the same" as the score that document in the pruned index: the scores difference should not exceed r * ε.

See the following paper for more details about this method: Static index pruning for information retrieval systems, D. Carmel at al, ACM SIGIR 2001 .

The claim of this pruning technique is, quoting from the above paper:

Prune the index in such a way that a human "cannot distinguish the difference" between the results of a search engine whose index is pruned and one whose index is not pruned.

For indexes with a large number of terms this policy might be too slow. In such situations, the uniform pruning approach in CarmelUniformTermPruningPolicy will be faster, though it might produce inferior search quality, as that policy does not pose a theoretical guarantee on resulted search quality.

TODO implement also CarmelTermPruningDeltaTopPolicy


Nested Class Summary
static class CarmelTopKTermPruningPolicy.ByDocComparator
           
 
Field Summary
static float DEFAULT_EPSILON
          Default largest meaningless score difference
static int DEFAULT_R
          Default number of query terms
static int DEFAULT_TOP_K
          Default number of guaranteed top K scores
 
Fields inherited from class org.apache.lucene.index.pruning.TermPruningPolicy
fieldFlags, in
 
Fields inherited from class org.apache.lucene.index.pruning.PruningPolicy
DEL_ALL, DEL_PAYLOADS, DEL_POSTINGS, DEL_STORED, DEL_VECTOR
 
Constructor Summary
CarmelTopKTermPruningPolicy(IndexReader in, Map<String,Integer> fieldFlags)
          Constructor with default parameters
CarmelTopKTermPruningPolicy(IndexReader in, Map<String,Integer> fieldFlags, int k, float epsilon, int r, Similarity sim)
          Constructor with specific settings
 
Method Summary
 void initPositionsTerm(TermPositions tp, Term t)
          Called when moving TermPositions to a new Term.
 boolean pruneAllPositions(TermPositions termPositions, Term t)
          Prune all postings per term (invoked once per term per doc)
 int pruneSomePositions(int docNum, int[] positions, Term curTerm)
          Prune some postings per term (invoked once per term per doc).
 boolean pruneTermEnum(TermEnum te)
          Pruning of all postings for a term (invoked once per term).
 int pruneTermVectorTerms(int docNumber, String field, String[] terms, int[] freqs, TermFreqVector tfv)
          Pruning of individual terms in term vectors.
 
Methods inherited from class org.apache.lucene.index.pruning.TermPruningPolicy
pruneAllFieldPostings, prunePayload, pruneWholeTermVector
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_TOP_K

public static final int DEFAULT_TOP_K
Default number of guaranteed top K scores

See Also:
Constant Field Values

DEFAULT_R

public static final int DEFAULT_R
Default number of query terms

See Also:
Constant Field Values

DEFAULT_EPSILON

public static final float DEFAULT_EPSILON
Default largest meaningless score difference

See Also:
Constant Field Values
Constructor Detail

CarmelTopKTermPruningPolicy

public CarmelTopKTermPruningPolicy(IndexReader in,
                                   Map<String,Integer> fieldFlags)
Constructor with default parameters

See Also:
DEFAULT_TOP_K, DEFAULT_EPSILON, DEFAULT_R, DefaultSimilarity, CarmelTopKTermPruningPolicy(IndexReader, Map, int, float, int, Similarity)

CarmelTopKTermPruningPolicy

public CarmelTopKTermPruningPolicy(IndexReader in,
                                   Map<String,Integer> fieldFlags,
                                   int k,
                                   float epsilon,
                                   int r,
                                   Similarity sim)
Constructor with specific settings

Parameters:
in - reader for original index
k - number of guaranteed top scores. Each top K results in the pruned index is either also an original top K result or its original score is indistinguishable from some original top K result.
epsilon - largest meaningless score difference Results whose scores difference is smaller or equal to epsilon are considered indistinguishable.
r - maximal number of terms in a query for which search quaility in pruned index is guaranteed
sim - similarity to use when selecting top docs fir each index term. When null, DefaultSimilarity is used.
Method Detail

pruneTermEnum

public boolean pruneTermEnum(TermEnum te)
                      throws IOException
Description copied from class: TermPruningPolicy
Pruning of all postings for a term (invoked once per term).

Specified by:
pruneTermEnum in class TermPruningPolicy
Parameters:
te - positioned term enum.
Returns:
true if all postings for this term should be removed, false otherwise.
Throws:
IOException

initPositionsTerm

public void initPositionsTerm(TermPositions tp,
                              Term t)
                       throws IOException
Description copied from class: TermPruningPolicy
Called when moving TermPositions to a new Term.

Specified by:
initPositionsTerm in class TermPruningPolicy
Parameters:
tp - input term positions
t - current term
Throws:
IOException

pruneAllPositions

public boolean pruneAllPositions(TermPositions termPositions,
                                 Term t)
                          throws IOException
Description copied from class: TermPruningPolicy
Prune all postings per term (invoked once per term per doc)

Specified by:
pruneAllPositions in class TermPruningPolicy
Parameters:
termPositions - positioned term positions. Implementations MUST NOT advance this by calling TermPositions methods that advance either the position pointer (next, skipTo) or term pointer (seek).
t - current term
Returns:
true if the current posting should be removed, false otherwise.
Throws:
IOException

pruneTermVectorTerms

public int pruneTermVectorTerms(int docNumber,
                                String field,
                                String[] terms,
                                int[] freqs,
                                TermFreqVector tfv)
                         throws IOException
Description copied from class: TermPruningPolicy
Pruning of individual terms in term vectors.

Specified by:
pruneTermVectorTerms in class TermPruningPolicy
Parameters:
docNumber - document number
field - field name
terms - array of terms
freqs - array of term frequencies
tfv - the original term frequency vector
Returns:
0 if no terms are to be removed, positive number to indicate how many terms need to be removed. The same number of entries in the terms array must be set to null to indicate which terms to remove.
Throws:
IOException

pruneSomePositions

public int pruneSomePositions(int docNum,
                              int[] positions,
                              Term curTerm)
Description copied from class: TermPruningPolicy
Prune some postings per term (invoked once per term per doc).

Specified by:
pruneSomePositions in class TermPruningPolicy
Parameters:
docNum - current document number
positions - original term positions in the document (and indirectly term frequency)
curTerm - current term
Returns:
0 if no postings are to be removed, or positive number to indicate how many postings need to be removed. The same number of entries in the positions array must be set to -1 to indicate which positions to remove.