org.apache.lucene.analysis.standard
Class UAX29URLEmailTokenizer
java.lang.Object
  
org.apache.lucene.util.AttributeSource
      
org.apache.lucene.analysis.TokenStream
          
org.apache.lucene.analysis.Tokenizer
              
org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer
- All Implemented Interfaces: 
 - Closeable
 
public final class UAX29URLEmailTokenizer
- extends Tokenizer
 
This class implements Word Break rules from the Unicode Text Segmentation 
 algorithm, as specified in 
 Unicode Standard Annex #29 
 URLs and email addresses are also tokenized according to the relevant RFCs.
 
 Tokens produced are of the following types:
 
   - <ALPHANUM>: A sequence of alphabetic and numeric characters
 
   - <NUM>: A number
 
   - <URL>: A URL
 
   - <EMAIL>: An email address
 
   - <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast
       Asian languages, including Thai, Lao, Myanmar, and Khmer
 
   - <IDEOGRAPHIC>: A single CJKV ideographic character
 
   - <HIRAGANA>: A single hiragana character
 
 
 
 You must specify the required Version
 compatibility when creating UAX29URLEmailTokenizer:
 
   -  As of 3.4, Hiragana and Han characters are no longer wrongly split
   from their combining characters. If you use a previous version number,
   you get the exact broken behavior for backwards compatibility.
 
 
 
 
 
| Fields inherited from class org.apache.lucene.analysis.Tokenizer | 
input | 
 
 
 
 
| Methods inherited from class org.apache.lucene.util.AttributeSource | 
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState | 
 
 
ALPHANUM
public static final int ALPHANUM
- See Also:
 - Constant Field Values
 
NUM
public static final int NUM
- See Also:
 - Constant Field Values
 
SOUTHEAST_ASIAN
public static final int SOUTHEAST_ASIAN
- See Also:
 - Constant Field Values
 
IDEOGRAPHIC
public static final int IDEOGRAPHIC
- See Also:
 - Constant Field Values
 
HIRAGANA
public static final int HIRAGANA
- See Also:
 - Constant Field Values
 
KATAKANA
public static final int KATAKANA
- See Also:
 - Constant Field Values
 
HANGUL
public static final int HANGUL
- See Also:
 - Constant Field Values
 
URL
public static final int URL
- See Also:
 - Constant Field Values
 
EMAIL
public static final int EMAIL
- See Also:
 - Constant Field Values
 
TOKEN_TYPES
public static final String[] TOKEN_TYPES
- String token types that correspond to token type int constants
 
UAX29URLEmailTokenizer
public UAX29URLEmailTokenizer(Version matchVersion,
                              Reader input)
- Creates a new instance of the UAX29URLEmailTokenizer.  Attaches
 the 
input to the newly created JFlex scanner.
- Parameters:
 input - The input reader
 
UAX29URLEmailTokenizer
public UAX29URLEmailTokenizer(Version matchVersion,
                              AttributeSource source,
                              Reader input)
- Creates a new UAX29URLEmailTokenizer with a given 
AttributeSource.
 
UAX29URLEmailTokenizer
public UAX29URLEmailTokenizer(Version matchVersion,
                              AttributeSource.AttributeFactory factory,
                              Reader input)
- Creates a new UAX29URLEmailTokenizer with a given 
AttributeSource.AttributeFactory
 
setMaxTokenLength
public void setMaxTokenLength(int length)
- Set the max allowed token length.  Any token longer
  than this is skipped.
 
 
getMaxTokenLength
public int getMaxTokenLength()
- See Also:
 setMaxTokenLength(int)
 
incrementToken
public final boolean incrementToken()
                             throws IOException
- Specified by:
 incrementToken in class TokenStream
 
- Throws:
 IOException
 
end
public final void end()
- Overrides:
 end in class TokenStream
 
 
reset
public void reset()
           throws IOException
- Overrides:
 reset in class TokenStream
 
- Throws:
 IOException
 
          Copyright © 2000-2012 Apache Software Foundation.  All Rights Reserved.