|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.lucene.index.pruning.PruningPolicy org.apache.lucene.index.pruning.TermPruningPolicy org.apache.lucene.index.pruning.CarmelUniformTermPruningPolicy
public class CarmelUniformTermPruningPolicy
Enhanced implementation of Carmel Uniform Pruning,
TermPositions
whose in-document frequency is below a specified
threshold
See CarmelTopKTermPruningPolicy
for link to the paper describing this
policy. are pruned.
Conclusions of that paper indicate that it's best to compute per-term
thresholds, as we do in CarmelTopKTermPruningPolicy
. However for
large indexes with a large number of terms that method might be too slow, and
the (enhanced) uniform approach implemented here may will be faster, although
it might produce inferior search quality.
This implementation enhances the Carmel uniform pruning approach, as it allows to specify three levels of thresholds:
These thresholds are applied so that always the most specific one takes precedence: first a per-term threshold is used if present, then per-field threshold if present, and finally the default threshold.
Threshold are maintained in a map, keyed by either field names or terms in
field:text
format. precedence of these values is the following:
Thresholds in this method of pruning are expressed as the percentage of the
top-N scoring documents per term that are retained. The list of top-N
documents is established by using a regular IndexSearcher
and
Similarity
to run a simple TermQuery
.
Smaller threshold value will produce a smaller index. See
TermPruningPolicy
for size vs performance considerations.
For indexes with a large number of terms this policy might be still too slow,
since it issues a term query for each term in the index. In such situations,
the term frequency pruning approach in TFTermPruningPolicy
will be
faster, though it might produce inferior search quality.
Nested Class Summary | |
---|---|
static class |
CarmelUniformTermPruningPolicy.ByDocComparator
|
Field Summary |
---|
Fields inherited from class org.apache.lucene.index.pruning.TermPruningPolicy |
---|
fieldFlags, in |
Fields inherited from class org.apache.lucene.index.pruning.PruningPolicy |
---|
DEL_ALL, DEL_PAYLOADS, DEL_POSTINGS, DEL_STORED, DEL_VECTOR |
Constructor Summary | |
---|---|
CarmelUniformTermPruningPolicy(IndexReader in,
Map<String,Integer> fieldFlags,
Map<String,Float> thresholds,
float defThreshold,
Similarity sim)
|
Method Summary | |
---|---|
void |
initPositionsTerm(TermPositions tp,
Term t)
Called when moving TermPositions to a new Term . |
boolean |
pruneAllPositions(TermPositions termPositions,
Term t)
Prune all postings per term (invoked once per term per doc) |
int |
pruneSomePositions(int docNum,
int[] positions,
Term curTerm)
Prune some postings per term (invoked once per term per doc). |
boolean |
pruneTermEnum(TermEnum te)
Pruning of all postings for a term (invoked once per term). |
int |
pruneTermVectorTerms(int docNumber,
String field,
String[] terms,
int[] freqs,
TermFreqVector tfv)
Pruning of individual terms in term vectors. |
Methods inherited from class org.apache.lucene.index.pruning.TermPruningPolicy |
---|
pruneAllFieldPostings, prunePayload, pruneWholeTermVector |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public CarmelUniformTermPruningPolicy(IndexReader in, Map<String,Integer> fieldFlags, Map<String,Float> thresholds, float defThreshold, Similarity sim)
Method Detail |
---|
public boolean pruneTermEnum(TermEnum te) throws IOException
TermPruningPolicy
pruneTermEnum
in class TermPruningPolicy
te
- positioned term enum.
IOException
public void initPositionsTerm(TermPositions tp, Term t) throws IOException
TermPruningPolicy
TermPositions
to a new Term
.
initPositionsTerm
in class TermPruningPolicy
tp
- input term positionst
- current term
IOException
public boolean pruneAllPositions(TermPositions termPositions, Term t) throws IOException
TermPruningPolicy
pruneAllPositions
in class TermPruningPolicy
termPositions
- positioned term positions. Implementations MUST NOT
advance this by calling TermPositions
methods that advance either
the position pointer (next, skipTo) or term pointer (seek).t
- current term
IOException
public int pruneTermVectorTerms(int docNumber, String field, String[] terms, int[] freqs, TermFreqVector tfv) throws IOException
TermPruningPolicy
pruneTermVectorTerms
in class TermPruningPolicy
docNumber
- document numberfield
- field nameterms
- array of termsfreqs
- array of term frequenciestfv
- the original term frequency vector
IOException
public int pruneSomePositions(int docNum, int[] positions, Term curTerm)
TermPruningPolicy
pruneSomePositions
in class TermPruningPolicy
docNum
- current document numberpositions
- original term positions in the document (and indirectly
term frequency)curTerm
- current term
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |