|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.lucene.index.pruning.PruningPolicy org.apache.lucene.index.pruning.TermPruningPolicy org.apache.lucene.index.pruning.CarmelTopKTermPruningPolicy
public class CarmelTopKTermPruningPolicy
Pruning policy with a search quality parameterized guarantee - configuration of this policy allows to specify two parameters: k and ε such that:
For any OR query with r terms, the score of each of the top k results in the original index, should be "practically the same" as the score that document in the pruned index: the scores difference should not exceed r * ε. |
See the following paper for more details about this method: Static index pruning for information retrieval systems, D. Carmel at al, ACM SIGIR 2001 .
The claim of this pruning technique is, quoting from the above paper:
Prune the index in such a way that a human "cannot distinguish the difference" between the results of a search engine whose index is pruned and one whose index is not pruned. |
For indexes with a large number of terms this policy might be too slow. In
such situations, the uniform pruning approach in
CarmelUniformTermPruningPolicy
will be faster, though it might
produce inferior search quality, as that policy does not pose a theoretical
guarantee on resulted search quality.
TODO implement also CarmelTermPruningDeltaTopPolicy
Nested Class Summary | |
---|---|
static class |
CarmelTopKTermPruningPolicy.ByDocComparator
|
Field Summary | |
---|---|
static float |
DEFAULT_EPSILON
Default largest meaningless score difference |
static int |
DEFAULT_R
Default number of query terms |
static int |
DEFAULT_TOP_K
Default number of guaranteed top K scores |
Fields inherited from class org.apache.lucene.index.pruning.TermPruningPolicy |
---|
fieldFlags, in |
Fields inherited from class org.apache.lucene.index.pruning.PruningPolicy |
---|
DEL_ALL, DEL_PAYLOADS, DEL_POSTINGS, DEL_STORED, DEL_VECTOR |
Constructor Summary | |
---|---|
CarmelTopKTermPruningPolicy(IndexReader in,
Map<String,Integer> fieldFlags)
Constructor with default parameters |
|
CarmelTopKTermPruningPolicy(IndexReader in,
Map<String,Integer> fieldFlags,
int k,
float epsilon,
int r,
Similarity sim)
Constructor with specific settings |
Method Summary | |
---|---|
void |
initPositionsTerm(TermPositions tp,
Term t)
Called when moving TermPositions to a new Term . |
boolean |
pruneAllPositions(TermPositions termPositions,
Term t)
Prune all postings per term (invoked once per term per doc) |
int |
pruneSomePositions(int docNum,
int[] positions,
Term curTerm)
Prune some postings per term (invoked once per term per doc). |
boolean |
pruneTermEnum(TermEnum te)
Pruning of all postings for a term (invoked once per term). |
int |
pruneTermVectorTerms(int docNumber,
String field,
String[] terms,
int[] freqs,
TermFreqVector tfv)
Pruning of individual terms in term vectors. |
Methods inherited from class org.apache.lucene.index.pruning.TermPruningPolicy |
---|
pruneAllFieldPostings, prunePayload, pruneWholeTermVector |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int DEFAULT_TOP_K
public static final int DEFAULT_R
public static final float DEFAULT_EPSILON
Constructor Detail |
---|
public CarmelTopKTermPruningPolicy(IndexReader in, Map<String,Integer> fieldFlags)
DEFAULT_TOP_K
,
DEFAULT_EPSILON
,
DEFAULT_R
,
DefaultSimilarity
,
CarmelTopKTermPruningPolicy(IndexReader, Map, int, float, int, Similarity)
public CarmelTopKTermPruningPolicy(IndexReader in, Map<String,Integer> fieldFlags, int k, float epsilon, int r, Similarity sim)
in
- reader for original indexk
- number of guaranteed top scores. Each top K results in the pruned
index is either also an original top K result or its original
score is indistinguishable from some original top K result.epsilon
- largest meaningless score difference Results whose scores
difference is smaller or equal to epsilon are considered
indistinguishable.r
- maximal number of terms in a query for which search quaility in
pruned index is guaranteedsim
- similarity to use when selecting top docs fir each index term.
When null, DefaultSimilarity
is used.Method Detail |
---|
public boolean pruneTermEnum(TermEnum te) throws IOException
TermPruningPolicy
pruneTermEnum
in class TermPruningPolicy
te
- positioned term enum.
IOException
public void initPositionsTerm(TermPositions tp, Term t) throws IOException
TermPruningPolicy
TermPositions
to a new Term
.
initPositionsTerm
in class TermPruningPolicy
tp
- input term positionst
- current term
IOException
public boolean pruneAllPositions(TermPositions termPositions, Term t) throws IOException
TermPruningPolicy
pruneAllPositions
in class TermPruningPolicy
termPositions
- positioned term positions. Implementations MUST NOT
advance this by calling TermPositions
methods that advance either
the position pointer (next, skipTo) or term pointer (seek).t
- current term
IOException
public int pruneTermVectorTerms(int docNumber, String field, String[] terms, int[] freqs, TermFreqVector tfv) throws IOException
TermPruningPolicy
pruneTermVectorTerms
in class TermPruningPolicy
docNumber
- document numberfield
- field nameterms
- array of termsfreqs
- array of term frequenciestfv
- the original term frequency vector
IOException
public int pruneSomePositions(int docNum, int[] positions, Term curTerm)
TermPruningPolicy
pruneSomePositions
in class TermPruningPolicy
docNum
- current document numberpositions
- original term positions in the document (and indirectly
term frequency)curTerm
- current term
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |