|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.lucene.analysis.standard.StandardTokenizerImpl
public final class StandardTokenizerImpl
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29
Tokens produced are of the following types:
Field Summary | |
---|---|
static int |
HANGUL_TYPE
|
static int |
HIRAGANA_TYPE
|
static int |
IDEOGRAPHIC_TYPE
|
static int |
KATAKANA_TYPE
|
static int |
NUMERIC_TYPE
Numbers |
static int |
SOUTH_EAST_ASIAN_TYPE
Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). |
static int |
WORD_TYPE
Alphanumeric sequences |
static int |
YYEOF
This character denotes the end of file |
static int |
YYINITIAL
lexical states |
Constructor Summary | |
---|---|
StandardTokenizerImpl(InputStream in)
Creates a new scanner. |
|
StandardTokenizerImpl(Reader in)
Creates a new scanner There is also a java.io.InputStream version of this constructor. |
Method Summary | |
---|---|
int |
getNextToken()
Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs. |
void |
getText(CharTermAttribute t)
Fills CharTermAttribute with the current token text. |
void |
yybegin(int newState)
Enters a new lexical state |
int |
yychar()
Returns the current position. |
char |
yycharat(int pos)
Returns the character at position pos from the matched text. |
void |
yyclose()
Closes the input stream. |
int |
yylength()
Returns the length of the matched text region. |
void |
yypushback(int number)
Pushes the specified amount of characters back into the input stream. |
void |
yyreset(Reader reader)
Resets the scanner to read from a new input stream. |
int |
yystate()
Returns the current lexical state. |
String |
yytext()
Returns the text matched by the current regular expression. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final int YYEOF
public static final int YYINITIAL
public static final int WORD_TYPE
public static final int NUMERIC_TYPE
public static final int SOUTH_EAST_ASIAN_TYPE
See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
public static final int IDEOGRAPHIC_TYPE
public static final int HIRAGANA_TYPE
public static final int KATAKANA_TYPE
public static final int HANGUL_TYPE
Constructor Detail |
---|
public StandardTokenizerImpl(Reader in)
in
- the java.io.Reader to read input from.public StandardTokenizerImpl(InputStream in)
in
- the java.io.Inputstream to read input from.Method Detail |
---|
public final int yychar()
StandardTokenizerInterface
yychar
in interface StandardTokenizerInterface
public final void getText(CharTermAttribute t)
getText
in interface StandardTokenizerInterface
public final void yyclose() throws IOException
IOException
public final void yyreset(Reader reader)
yyreset
in interface StandardTokenizerInterface
reader
- the new input streampublic final int yystate()
public final void yybegin(int newState)
newState
- the new lexical statepublic final String yytext()
public final char yycharat(int pos)
pos
- the position of the character to fetch.
A value from 0 to yylength()-1.
public final int yylength()
yylength
in interface StandardTokenizerInterface
public void yypushback(int number)
number
- the number of characters to be read again.
This number must not be greater than yylength()!public int getNextToken() throws IOException
getNextToken
in interface StandardTokenizerInterface
IOException
- if any I/O-Error occurs
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |