org.apache.lucene.analysis.hi
Class HindiNormalizer
java.lang.Object
org.apache.lucene.analysis.hi.HindiNormalizer
public class HindiNormalizer
- extends Object
Normalizer for Hindi.
Normalizes text to remove some differences in spelling variations.
Implements the Hindi-language specific algorithm specified in:
Word normalization in Indian languages
Prasad Pingali and Vasudeva Varma.
http://web2py.iiit.ac.in/publications/default/download/inproceedings.pdf.3fe5b38c-02ee-41ce-9a8f-3e745670be32.pdf
with the following additions from Hindi CLIR in Thirty Days
Leah S. Larkey, Margaret E. Connell, and Nasreen AbdulJaleel.
http://maroo.cs.umass.edu/pub/web/getpdf.php?id=454:
- Internal Zero-width joiner and Zero-width non-joiners are removed
- In addition to chandrabindu, NA+halant is normalized to anusvara
Method Summary |
int |
normalize(char[] s,
int len)
Normalize an input buffer of Hindi text |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
HindiNormalizer
public HindiNormalizer()
normalize
public int normalize(char[] s,
int len)
- Normalize an input buffer of Hindi text
- Parameters:
s
- input bufferlen
- length of input buffer
- Returns:
- length of input buffer after normalization