WordlistLoader (Lucene 4.0.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.util
Class WordlistLoader

java.lang.Object
  org.apache.lucene.analysis.util.WordlistLoader

public class WordlistLoader
extends Object
extends Object

Loader for text files that represent a list of stopwords.

See Also:: to obtain {@link Reader} instances
NOTE: This API is for internal purposes only and might change in incompatible ways in the next release.

Method Summary
`static List<String>`	`getLines(InputStream stream, Charset charset)` Accesses a resource by name and returns the (non comment) lines containing data using the given character encoding.
`static CharArraySet`	`getSnowballWordSet(Reader reader, CharArraySet result)` Reads stopwords from a stopword list in Snowball format.
`static CharArraySet`	`getSnowballWordSet(Reader reader, Version matchVersion)` Reads stopwords from a stopword list in Snowball format.
`static CharArrayMap<String>`	`getStemDict(Reader reader, CharArrayMap<String> result)` Reads a stem dictionary.
`static CharArraySet`	`getWordSet(Reader reader, CharArraySet result)` Reads lines from a Reader and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace).
`static CharArraySet`	`getWordSet(Reader reader, String comment, CharArraySet result)` Reads lines from a Reader and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace).
`static CharArraySet`	`getWordSet(Reader reader, String comment, Version matchVersion)` Reads lines from a Reader and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace).
`static CharArraySet`	`getWordSet(Reader reader, Version matchVersion)` Reads lines from a Reader and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace).

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Method Detail

getWordSet

public static CharArraySet getWordSet(Reader reader,
                                      CharArraySet result)
                               throws IOException

Reads lines from a Reader and adds every line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).

Parameters:: reader - Reader containing the wordlist; result - the CharArraySet to fill with the readers words
Returns:: the given CharArraySet with the reader's words
Throws:: IOException

getWordSet

public static CharArraySet getWordSet(Reader reader,
                                      Version matchVersion)
                               throws IOException

Parameters:: reader - Reader containing the wordlist; matchVersion - the Lucene Version
Returns:: A CharArraySet with the reader's words
Throws:: IOException

getWordSet

public static CharArraySet getWordSet(Reader reader,
                                      String comment,
                                      Version matchVersion)
                               throws IOException

Reads lines from a Reader and adds every non-comment line as an entry to a CharArraySet (omitting leading and trailing whitespace). Every line of the Reader should contain only one word. The words need to be in lowercase if you make use of an Analyzer which uses LowerCaseFilter (like StandardAnalyzer).

Parameters:: reader - Reader containing the wordlist; comment - The string representing a comment.; matchVersion - the Lucene Version
Returns:: A CharArraySet with the reader's words
Throws:: IOException

getWordSet

public static CharArraySet getWordSet(Reader reader,
                                      String comment,
                                      CharArraySet result)
                               throws IOException

Parameters:: reader - Reader containing the wordlist; comment - The string representing a comment.; result - the CharArraySet to fill with the readers words
Returns:: the given CharArraySet with the reader's words
Throws:: IOException

getSnowballWordSet

public static CharArraySet getSnowballWordSet(Reader reader,
                                              CharArraySet result)
                                       throws IOException

Reads stopwords from a stopword list in Snowball format.

The snowball format is the following:

Lines may contain multiple words separated by whitespace.
The comment character is the vertical line (|).
Lines may contain trailing comments.

Parameters:: reader - Reader containing a Snowball stopword list; result - the CharArraySet to fill with the readers words
Returns:: the given CharArraySet with the reader's words
Throws:: IOException

getSnowballWordSet

public static CharArraySet getSnowballWordSet(Reader reader,
                                              Version matchVersion)
                                       throws IOException

Reads stopwords from a stopword list in Snowball format.

The snowball format is the following:

Lines may contain multiple words separated by whitespace.
The comment character is the vertical line (|).
Lines may contain trailing comments.

Parameters:: reader - Reader containing a Snowball stopword list; matchVersion - the Lucene Version
Returns:: A CharArraySet with the reader's words
Throws:: IOException

getStemDict

public static CharArrayMap<String> getStemDict(Reader reader,
                                               CharArrayMap<String> result)
                                        throws IOException

Reads a stem dictionary. Each line contains:

word\tstem

(i.e. two tab separated words)

Returns:: stem dictionary that overrules the stemming algorithm
Throws:: IOException - If there is a low-level I/O error.

getLines

public static List<String> getLines(InputStream stream,
                                    Charset charset)
                             throws IOException

Accesses a resource by name and returns the (non comment) lines containing data using the given character encoding.

A comment line is any line that starts with the character "#"

Returns:: a list of non-blank non-comment lines with whitespace trimmed
Throws:: IOException - If there is a low-level I/O error.

Overview

Package

Class

Use

Tree

Deprecated

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.util Class WordlistLoader

getWordSet

getWordSet

getWordSet

getWordSet

getSnowballWordSet

getSnowballWordSet

getStemDict

getLines

org.apache.lucene.analysis.util
Class WordlistLoader