org.apache.lucene.analysis.icu.segmentation
Class LaoBreakIterator
java.lang.Object
com.ibm.icu.text.BreakIterator
org.apache.lucene.analysis.icu.segmentation.LaoBreakIterator
- All Implemented Interfaces:
- Cloneable
public class LaoBreakIterator
- extends com.ibm.icu.text.BreakIterator
Syllable iterator for Lao text.
This breaks Lao text into syllables according to:
Syllabification of Lao Script for Line Breaking
Phonpasit Phissamay, Valaxay Dalolay, Chitaphone Chanhsililath, Oulaiphone Silimasak,
Sarmad Hussain, Nadir Durrani, Science Technology and Environment Agency, CRULP.
- http://www.panl10n.net/english/final%20reports/pdf%20files/Laos/LAO06.pdf
- http://www.panl10n.net/Presentations/Cambodia/Phonpassit/LineBreakingAlgo.pdf
Most work is accomplished with RBBI rules, however some additional special logic is needed
that cannot be coded in a grammar, and this is implemented here.
For example, what appears to be a final consonant might instead be part of the next syllable.
Rules match in a greedy fashion, leaving an illegal sequence that matches no rules.
Take for instance the text ກວ່າດອກ
The first rule greedily matches ກວ່າດ, but then ອກ is encountered, which is illegal.
What LaoBreakIterator does, according to the paper:
- backtrack and remove the ດ from the last syllable, placing it on the current syllable.
- verify the modified previous syllable (ກວ່າ ) is still legal.
- verify the modified current syllable (ດອກ) is now legal.
- If 2 or 3 fails, then restore the ດ to the last syllable and skip the current character.
Finally, LaoBreakIterator also takes care of the second concern mentioned in the paper.
This is the issue of combining marks being in the wrong order (typos).
- WARNING: This API is experimental and might change in incompatible ways in the next release.
Fields inherited from class com.ibm.icu.text.BreakIterator |
DONE, KIND_CHARACTER, KIND_LINE, KIND_SENTENCE, KIND_TITLE, KIND_WORD |
Constructor Summary |
LaoBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator rules)
|
Methods inherited from class com.ibm.icu.text.BreakIterator |
getAvailableLocales, getAvailableULocales, getBreakInstance, getCharacterInstance, getCharacterInstance, getCharacterInstance, getLineInstance, getLineInstance, getLineInstance, getLocale, getSentenceInstance, getSentenceInstance, getSentenceInstance, getTitleInstance, getTitleInstance, getTitleInstance, getWordInstance, getWordInstance, getWordInstance, isBoundary, preceding, registerInstance, registerInstance, unregister |
LaoBreakIterator
public LaoBreakIterator(com.ibm.icu.text.RuleBasedBreakIterator rules)
current
public int current()
- Specified by:
current
in class com.ibm.icu.text.BreakIterator
first
public int first()
- Specified by:
first
in class com.ibm.icu.text.BreakIterator
following
public int following(int offset)
- Specified by:
following
in class com.ibm.icu.text.BreakIterator
getText
public CharacterIterator getText()
- Specified by:
getText
in class com.ibm.icu.text.BreakIterator
last
public int last()
- Specified by:
last
in class com.ibm.icu.text.BreakIterator
next
public int next()
- Specified by:
next
in class com.ibm.icu.text.BreakIterator
next
public int next(int n)
- Specified by:
next
in class com.ibm.icu.text.BreakIterator
previous
public int previous()
- Specified by:
previous
in class com.ibm.icu.text.BreakIterator
setText
public void setText(CharacterIterator text)
- Specified by:
setText
in class com.ibm.icu.text.BreakIterator
setText
public void setText(String newText)
- Overrides:
setText
in class com.ibm.icu.text.BreakIterator
clone
public Object clone()
- Clone method. Creates another LaoBreakIterator with the same behavior
and current state as this one.
- Overrides:
clone
in class com.ibm.icu.text.BreakIterator
- Returns:
- The clone.