org.apache.lucene.analysis
Class Tokenizer

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.Tokenizer
All Implemented Interfaces:
Closeable
Direct Known Subclasses:
CharTokenizer, ChineseTokenizer, CJKTokenizer, ClassicTokenizer, EdgeNGramTokenizer, EmptyTokenizer, ICUTokenizer, JapaneseTokenizer, KeywordTokenizer, MockTokenizer, NGramTokenizer, PathHierarchyTokenizer, ReversePathHierarchyTokenizer, SentenceTokenizer, StandardTokenizer, UAX29URLEmailTokenizer, WikipediaTokenizer

public abstract class Tokenizer
extends TokenStream

A Tokenizer is a TokenStream whose input is a Reader.

This is an abstract class; subclasses must override TokenStream.incrementToken()

NOTE: Subclasses overriding TokenStream.incrementToken() must call AttributeSource.clearAttributes() before setting attributes.


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.AttributeFactory, AttributeSource.State
 
Field Summary
protected  Reader input
          The text source for this Tokenizer.
 
Constructor Summary
protected Tokenizer()
          Deprecated. use Tokenizer(Reader) instead.
protected Tokenizer(AttributeSource.AttributeFactory factory)
          Deprecated. use Tokenizer(AttributeSource.AttributeFactory, Reader) instead.
protected Tokenizer(AttributeSource.AttributeFactory factory, Reader input)
          Construct a token stream processing the given input using the given AttributeFactory.
protected Tokenizer(AttributeSource source)
          Deprecated. use Tokenizer(AttributeSource, Reader) instead.
protected Tokenizer(AttributeSource source, Reader input)
          Construct a token stream processing the given input using the given AttributeSource.
protected Tokenizer(Reader input)
          Construct a token stream processing the given input.
 
Method Summary
 void close()
          By default, closes the input Reader.
protected  int correctOffset(int currentOff)
          Return the corrected offset.
 void reset(Reader input)
          Expert: Reset the tokenizer to a new reader.
 
Methods inherited from class org.apache.lucene.analysis.TokenStream
end, incrementToken, reset
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

input

protected Reader input
The text source for this Tokenizer.

Constructor Detail

Tokenizer

@Deprecated
protected Tokenizer()
Deprecated. use Tokenizer(Reader) instead.

Construct a tokenizer with null input.


Tokenizer

protected Tokenizer(Reader input)
Construct a token stream processing the given input.


Tokenizer

@Deprecated
protected Tokenizer(AttributeSource.AttributeFactory factory)
Deprecated. use Tokenizer(AttributeSource.AttributeFactory, Reader) instead.

Construct a tokenizer with null input using the given AttributeFactory.


Tokenizer

protected Tokenizer(AttributeSource.AttributeFactory factory,
                    Reader input)
Construct a token stream processing the given input using the given AttributeFactory.


Tokenizer

@Deprecated
protected Tokenizer(AttributeSource source)
Deprecated. use Tokenizer(AttributeSource, Reader) instead.

Construct a token stream processing the given input using the given AttributeSource.


Tokenizer

protected Tokenizer(AttributeSource source,
                    Reader input)
Construct a token stream processing the given input using the given AttributeSource.

Method Detail

close

public void close()
           throws IOException
By default, closes the input Reader.

Specified by:
close in interface Closeable
Overrides:
close in class TokenStream
Throws:
IOException

correctOffset

protected final int correctOffset(int currentOff)
Return the corrected offset. If input is a CharStream subclass this method calls CharStream.correctOffset(int), else returns currentOff.

Parameters:
currentOff - offset as seen in the output
Returns:
corrected offset based on the input
See Also:
CharStream.correctOffset(int)

reset

public void reset(Reader input)
           throws IOException
Expert: Reset the tokenizer to a new reader. Typically, an analyzer (in its reusableTokenStream method) will use this to re-use a previously created tokenizer.

Throws:
IOException