MockTokenizer (Lucene 3.6.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis
Class MockTokenizer

java.lang.Object
  org.apache.lucene.util.AttributeSource
      org.apache.lucene.analysis.TokenStream
          org.apache.lucene.analysis.Tokenizer
              org.apache.lucene.analysis.MockTokenizer

All Implemented Interfaces:: Closeable

public class MockTokenizer
extends Tokenizer
extends Tokenizer

Tokenizer for testing.

This tokenizer is a replacement for WHITESPACE, SIMPLE, and KEYWORD tokenizers. If you are writing a component such as a TokenFilter, its a great idea to test it wrapping this tokenizer instead for extra checks. This tokenizer has the following behavior:

An internal state-machine is used for checking consumer consistency. These checks can be disabled with setEnableChecks(boolean).
For convenience, optionally lowercases terms that it outputs.

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
`AttributeSource.AttributeFactory`

Field Summary
`static int`	`DEFAULT_MAX_TOKEN_LENGTH`
`static int`	`KEYWORD` Acts Similar to KeywordTokenizer.
`static int`	`SIMPLE` Acts like LetterTokenizer.
`static int`	`WHITESPACE` Acts Similar to WhitespaceTokenizer

Fields inherited from class org.apache.lucene.analysis.Tokenizer
`input`

Constructor Summary
`MockTokenizer(AttributeSource.AttributeFactory factory, Reader input, int pattern, boolean lowerCase, int maxTokenLength)`
`MockTokenizer(Reader input, int pattern, boolean lowerCase)`
`MockTokenizer(Reader input, int pattern, boolean lowerCase, int maxTokenLength)`

Method Summary
`void`	`close()` By default, closes the input Reader.
`void`	`end()` This method is called by the consumer after the last token has been consumed, after `TokenStream.incrementToken()` returned `false` (using the new `TokenStream` API).
`boolean`	`incrementToken()` Consumers (i.e., `IndexWriter`) use this method to advance the stream to the next token.
`protected boolean`	`isTokenChar(int c)`
`protected int`	`normalize(int c)`
`protected int`	`readCodePoint()`
`void`	`reset()` Resets this stream to the beginning.
`void`	`reset(Reader input)` Expert: Reset the tokenizer to a new reader.
`void`	`setEnableChecks(boolean enableChecks)` Toggle consumer workflow checking: if your test consumes tokenstreams normally you should leave this enabled.

Methods inherited from class org.apache.lucene.analysis.Tokenizer
`correctOffset`

Methods inherited from class org.apache.lucene.util.AttributeSource
`addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, wait, wait, wait`

Field Detail

WHITESPACE

public static final int WHITESPACE

Acts Similar to WhitespaceTokenizer

See Also:: Constant Field Values

KEYWORD

public static final int KEYWORD

Acts Similar to KeywordTokenizer. TODO: Keyword returns an "empty" token for an empty reader...

See Also:: Constant Field Values

SIMPLE

public static final int SIMPLE

Acts like LetterTokenizer.

See Also:: Constant Field Values

DEFAULT_MAX_TOKEN_LENGTH

public static final int DEFAULT_MAX_TOKEN_LENGTH

See Also:: Constant Field Values

Constructor Detail

MockTokenizer

public MockTokenizer(AttributeSource.AttributeFactory factory,
                     Reader input,
                     int pattern,
                     boolean lowerCase,
                     int maxTokenLength)

MockTokenizer

public MockTokenizer(Reader input,
                     int pattern,
                     boolean lowerCase,
                     int maxTokenLength)

MockTokenizer

public MockTokenizer(Reader input,
                     int pattern,
                     boolean lowerCase)

Method Detail

incrementToken

public final boolean incrementToken()
                             throws IOException

Description copied from class: TokenStream

Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate AttributeImpls with the attributes of the next token.

The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use AttributeSource.captureState() to create a copy of the current attribute state.

This method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to AttributeSource.addAttribute(Class) and AttributeSource.getAttribute(Class), references to all AttributeImpls that this stream uses should be retrieved during instantiation.

To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in TokenStream.incrementToken().

Specified by:: incrementToken in class TokenStream

Returns:: false for end of stream; true otherwise
Throws:: IOException

readCodePoint

protected int readCodePoint()
                     throws IOException

Throws:: IOException

isTokenChar

protected boolean isTokenChar(int c)

normalize

protected int normalize(int c)

reset

public void reset()
           throws IOException

Description copied from class: TokenStream

Resets this stream to the beginning. This is an optional operation, so subclasses may or may not implement this method. TokenStream.reset() is not needed for the standard indexing process. However, if the tokens of a TokenStream are intended to be consumed more than once, it is necessary to implement TokenStream.reset(). Note that if your TokenStream caches tokens and feeds them back again after a reset, it is imperative that you clone the tokens when you store them away (on the first pass) as well as when you return them (on future passes after TokenStream.reset()).

Overrides:: reset in class TokenStream

Throws:: IOException

close

public void close()
           throws IOException

Description copied from class: Tokenizer

By default, closes the input Reader.

Specified by:: close in interface Closeable
Overrides:: close in class Tokenizer

Throws:: IOException

reset

public void reset(Reader input)
           throws IOException

Description copied from class: Tokenizer

Expert: Reset the tokenizer to a new reader. Typically, an analyzer (in its reusableTokenStream method) will use this to re-use a previously created tokenizer.

Overrides:: reset in class Tokenizer

Throws:: IOException

end

public void end()
         throws IOException

Description copied from class: TokenStream

This method is called by the consumer after the last token has been consumed, after TokenStream.incrementToken() returned false (using the new TokenStream API). Streams implementing the old API should upgrade to use this feature.

This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of a stream might differ from the offset of the last token eg in case one or more whitespaces followed after the last token, but a WhitespaceTokenizer was used.

Overrides:: end in class TokenStream

Throws:: IOException

setEnableChecks

public void setEnableChecks(boolean enableChecks)

Toggle consumer workflow checking: if your test consumes tokenstreams normally you should leave this enabled.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis Class MockTokenizer

WHITESPACE

KEYWORD

SIMPLE

DEFAULT_MAX_TOKEN_LENGTH

MockTokenizer

MockTokenizer

MockTokenizer

incrementToken

readCodePoint

isTokenChar

normalize

reset

close

reset

end

setEnableChecks

org.apache.lucene.analysis
Class MockTokenizer