org.apache.lucene.benchmark.byTask.feeds
Class TrecDocParser

java.lang.Object
  extended by org.apache.lucene.benchmark.byTask.feeds.TrecDocParser
Direct Known Subclasses:
TrecFBISParser, TrecFR94Parser, TrecFTParser, TrecGov2Parser, TrecLATimesParser, TrecParserByPath

public abstract class TrecDocParser
extends Object

Parser for trec doc content, invoked on doc text excluding and which are handled in TrecContentSource. Required to be stateless and hence thread safe.


Nested Class Summary
static class TrecDocParser.ParsePathType
          Types of trec parse paths,
 
Field Summary
static TrecDocParser.ParsePathType DEFAULT_PATH_TYPE
          trec parser type used for unknown extensions
 
Constructor Summary
TrecDocParser()
           
 
Method Summary
static String extract(StringBuilder buf, String startTag, String endTag, int maxPos, String[] noisePrefixes)
          Extract from buf the text of interest within specified tags
abstract  DocData parse(DocData docData, String name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType)
          parse the text prepared in docBuf into a result DocData, no synchronization is required.
static TrecDocParser.ParsePathType pathType(File f)
          Compute the path type of a file by inspecting name of file and its parents
static String stripTags(StringBuilder buf, int start)
          strip tags from buf: each tag is replaced by a single blank.
static String stripTags(String buf, int start)
          strip tags from input.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_PATH_TYPE

public static final TrecDocParser.ParsePathType DEFAULT_PATH_TYPE
trec parser type used for unknown extensions

Constructor Detail

TrecDocParser

public TrecDocParser()
Method Detail

pathType

public static TrecDocParser.ParsePathType pathType(File f)
Compute the path type of a file by inspecting name of file and its parents


parse

public abstract DocData parse(DocData docData,
                              String name,
                              TrecContentSource trecSrc,
                              StringBuilder docBuf,
                              TrecDocParser.ParsePathType pathType)
                       throws IOException,
                              InterruptedException
parse the text prepared in docBuf into a result DocData, no synchronization is required.

Parameters:
docData - reusable result
name - name that should be set to the result
trecSrc - calling trec content source
docBuf - text to parse
pathType - type of parsed file, or null if unknown - may be used by parsers to alter their behavior according to the file path type.
Throws:
IOException
InterruptedException

stripTags

public static String stripTags(StringBuilder buf,
                               int start)
strip tags from buf: each tag is replaced by a single blank.

Returns:
text obtained when stripping all tags from buf (Input StringBuilder is unmodified).

stripTags

public static String stripTags(String buf,
                               int start)
strip tags from input.

See Also:
stripTags(StringBuilder, int)

extract

public static String extract(StringBuilder buf,
                             String startTag,
                             String endTag,
                             int maxPos,
                             String[] noisePrefixes)
Extract from buf the text of interest within specified tags

Parameters:
buf - entire input text
startTag - tag marking start of text of interest
endTag - tag marking end of text of interest
maxPos - if ≥ 0 sets a limit on start of text of interest
Returns:
text of interest or null if not found