org.apache.lucene.analysis.ja
Class JapaneseTokenizer
java.lang.Object
  
org.apache.lucene.util.AttributeSource
      
org.apache.lucene.analysis.TokenStream
          
org.apache.lucene.analysis.Tokenizer
              
org.apache.lucene.analysis.ja.JapaneseTokenizer
- All Implemented Interfaces: 
 - Closeable
 
public final class JapaneseTokenizer
- extends Tokenizer
 
Tokenizer for Japanese that uses morphological analysis.
 
 This tokenizer sets a number of additional attributes:
 
 
 This tokenizer uses a rolling Viterbi search to find the 
 least cost segmentation (path) of the incoming characters.  
 For tokens that appear to be compound (> length 2 for all
 Kanji, or > length 7 for non-Kanji), we see if there is a
 2nd best segmentation of that token after applying
 penalties to the long tokens.  If so, and the Mode is
 JapaneseTokenizer.Mode.SEARCH, we output the alternate segmentation 
 as well.
| 
Nested Class Summary | 
static class | 
JapaneseTokenizer.Mode
 
          Tokenization mode: this determines how the tokenizer handles
 compound and unknown words. | 
static class | 
JapaneseTokenizer.Type
 
          Token type reflecting the original source of this token | 
 
 
 
| Fields inherited from class org.apache.lucene.analysis.Tokenizer | 
input | 
 
 
 
 
| Methods inherited from class org.apache.lucene.util.AttributeSource | 
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState | 
 
 
DEFAULT_MODE
public static final JapaneseTokenizer.Mode DEFAULT_MODE
- Default tokenization mode. Currently this is 
JapaneseTokenizer.Mode.SEARCH.
 
JapaneseTokenizer
public JapaneseTokenizer(Reader input,
                         UserDictionary userDictionary,
                         boolean discardPunctuation,
                         JapaneseTokenizer.Mode mode)
- Create a new JapaneseTokenizer.
- Parameters:
 input - Reader containing textuserDictionary - Optional: if non-null, user dictionary.discardPunctuation - true if punctuation tokens should be dropped from the output.mode - tokenization mode.
 
setGraphvizFormatter
public void setGraphvizFormatter(GraphvizFormatter dotOut)
- Expert: set this to produce graphviz (dot) output of
  the Viterbi lattice
 
 
reset
public void reset()
           throws IOException
- Overrides:
 reset in class TokenStream
 
- Throws:
 IOException
 
end
public void end()
- Overrides:
 end in class TokenStream
 
 
incrementToken
public boolean incrementToken()
                       throws IOException
- Specified by:
 incrementToken in class TokenStream
 
- Throws:
 IOException
 
          Copyright © 2000-2012 Apache Software Foundation.  All Rights Reserved.