|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.lucene.index.pruning.PruningPolicy org.apache.lucene.index.pruning.TermPruningPolicy org.apache.lucene.index.pruning.RIDFTermPruningPolicy
public class RIDFTermPruningPolicy
Implementation of TermPruningPolicy
that uses "residual IDF"
metric to determine the postings of terms to keep/remove, as defined in
http://www.dc.fi.udc.es/~barreiro/publications/blanco_barreiro_ecir2007.pdf.
Residual IDF measures a difference between a collection-wide IDF of a term (which assumes a uniform distribution of occurrences) and the actual observed total number of occurrences of a term in all documents. Positive values indicate that a term is informative (e.g. for rare terms), negative values indicate that a term is not informative (e.g. too popular to offer good selectivity).
This metric produces small values close to [-1, 1], so useful ranges for thresholds under this metrics are somewhere between [0, 1]. The higher the threshold the more informative (and more rare) terms will be retained. For filtering of common words a value of close to or slightly below 0 (e.g. -0.1) should be a good starting point.
Field Summary |
---|
Fields inherited from class org.apache.lucene.index.pruning.TermPruningPolicy |
---|
fieldFlags, in |
Fields inherited from class org.apache.lucene.index.pruning.PruningPolicy |
---|
DEL_ALL, DEL_PAYLOADS, DEL_POSTINGS, DEL_STORED, DEL_VECTOR |
Constructor Summary | |
---|---|
RIDFTermPruningPolicy(IndexReader in,
Map<String,Integer> fieldFlags,
Map<String,Double> thresholds,
double defThreshold)
|
Method Summary | |
---|---|
void |
initPositionsTerm(TermPositions tp,
Term t)
Called when moving TermPositions to a new Term . |
boolean |
pruneAllPositions(TermPositions termPositions,
Term t)
Prune all postings per term (invoked once per term per doc) |
int |
pruneSomePositions(int docNum,
int[] positions,
Term curTerm)
Prune some postings per term (invoked once per term per doc). |
boolean |
pruneTermEnum(TermEnum te)
Pruning of all postings for a term (invoked once per term). |
int |
pruneTermVectorTerms(int docNumber,
String field,
String[] terms,
int[] freqs,
TermFreqVector v)
Pruning of individual terms in term vectors. |
Methods inherited from class org.apache.lucene.index.pruning.TermPruningPolicy |
---|
pruneAllFieldPostings, prunePayload, pruneWholeTermVector |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public RIDFTermPruningPolicy(IndexReader in, Map<String,Integer> fieldFlags, Map<String,Double> thresholds, double defThreshold)
Method Detail |
---|
public void initPositionsTerm(TermPositions tp, Term t) throws IOException
TermPruningPolicy
TermPositions
to a new Term
.
initPositionsTerm
in class TermPruningPolicy
tp
- input term positionst
- current term
IOException
public boolean pruneTermEnum(TermEnum te) throws IOException
TermPruningPolicy
pruneTermEnum
in class TermPruningPolicy
te
- positioned term enum.
IOException
public boolean pruneAllPositions(TermPositions termPositions, Term t) throws IOException
TermPruningPolicy
pruneAllPositions
in class TermPruningPolicy
termPositions
- positioned term positions. Implementations MUST NOT
advance this by calling TermPositions
methods that advance either
the position pointer (next, skipTo) or term pointer (seek).t
- current term
IOException
public int pruneTermVectorTerms(int docNumber, String field, String[] terms, int[] freqs, TermFreqVector v) throws IOException
TermPruningPolicy
pruneTermVectorTerms
in class TermPruningPolicy
docNumber
- document numberfield
- field nameterms
- array of termsfreqs
- array of term frequenciesv
- the original term frequency vector
IOException
public int pruneSomePositions(int docNum, int[] positions, Term curTerm)
TermPruningPolicy
pruneSomePositions
in class TermPruningPolicy
docNum
- current document numberpositions
- original term positions in the document (and indirectly
term frequency)curTerm
- current term
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |