org.apache.lucene.benchmark.byTask.feeds
Class TrecDocParser
java.lang.Object
org.apache.lucene.benchmark.byTask.feeds.TrecDocParser
- Direct Known Subclasses:
- TrecFBISParser, TrecFR94Parser, TrecFTParser, TrecGov2Parser, TrecLATimesParser, TrecParserByPath
public abstract class TrecDocParser
- extends Object
Parser for trec doc content, invoked on doc text excluding and
which are handled in TrecContentSource. Required to be stateless and hence thread safe.
Method Summary |
static String |
extract(StringBuilder buf,
String startTag,
String endTag,
int maxPos,
String[] noisePrefixes)
Extract from buf the text of interest within specified tags |
abstract DocData |
parse(DocData docData,
String name,
TrecContentSource trecSrc,
StringBuilder docBuf,
TrecDocParser.ParsePathType pathType)
parse the text prepared in docBuf into a result DocData,
no synchronization is required. |
static TrecDocParser.ParsePathType |
pathType(File f)
Compute the path type of a file by inspecting name of file and its parents |
static String |
stripTags(StringBuilder buf,
int start)
strip tags from buf : each tag is replaced by a single blank. |
static String |
stripTags(String buf,
int start)
strip tags from input. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
DEFAULT_PATH_TYPE
public static final TrecDocParser.ParsePathType DEFAULT_PATH_TYPE
- trec parser type used for unknown extensions
TrecDocParser
public TrecDocParser()
pathType
public static TrecDocParser.ParsePathType pathType(File f)
- Compute the path type of a file by inspecting name of file and its parents
parse
public abstract DocData parse(DocData docData,
String name,
TrecContentSource trecSrc,
StringBuilder docBuf,
TrecDocParser.ParsePathType pathType)
throws IOException,
InterruptedException
- parse the text prepared in docBuf into a result DocData,
no synchronization is required.
- Parameters:
docData
- reusable resultname
- name that should be set to the resulttrecSrc
- calling trec content sourcedocBuf
- text to parsepathType
- type of parsed file, or null if unknown - may be used by
parsers to alter their behavior according to the file path type.
- Throws:
IOException
InterruptedException
stripTags
public static String stripTags(StringBuilder buf,
int start)
- strip tags from
buf
: each tag is replaced by a single blank.
- Returns:
- text obtained when stripping all tags from
buf
(Input StringBuilder is unmodified).
stripTags
public static String stripTags(String buf,
int start)
- strip tags from input.
- See Also:
stripTags(StringBuilder, int)
extract
public static String extract(StringBuilder buf,
String startTag,
String endTag,
int maxPos,
String[] noisePrefixes)
- Extract from
buf
the text of interest within specified tags
- Parameters:
buf
- entire input textstartTag
- tag marking start of text of interestendTag
- tag marking end of text of interestmaxPos
- if ≥ 0 sets a limit on start of text of interest
- Returns:
- text of interest or null if not found