org.apache.lucene.search.spell
Class SpellChecker

java.lang.Object
  extended by org.apache.lucene.search.spell.SpellChecker
All Implemented Interfaces:
Closeable

public class SpellChecker
extends Object
implements Closeable

Spell Checker class (Main class)
(initially inspired by the David Spencer code).

Example Usage:

  SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
  // To index a field of a user index:
  spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
  // To index a file containing words:
  spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
  String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
 

Version:
1.0

Field Summary
static float DEFAULT_ACCURACY
          The default minimum score to use, if not specified by calling setAccuracy(float) .
static String F_WORD
          Field name for each word in the ngram index.
 
Constructor Summary
SpellChecker(Directory spellIndex)
          Use the given directory as a spell checker index with a LevensteinDistance as the default StringDistance.
SpellChecker(Directory spellIndex, StringDistance sd)
          Use the given directory as a spell checker index.
SpellChecker(Directory spellIndex, StringDistance sd, Comparator<SuggestWord> comparator)
          Use the given directory as a spell checker index with the given StringDistance measure and the given Comparator for sorting the results.
 
Method Summary
 void clearIndex()
          Removes all terms from the spell check index.
 void close()
          Close the IndexSearcher used by this SpellChecker
 boolean exist(String word)
          Check whether the word exists in the index.
 float getAccuracy()
          The accuracy (minimum score) to be used, unless overridden in suggestSimilar(String, int, org.apache.lucene.index.IndexReader, String, boolean, float), to decide whether a suggestion is included or not.
 Comparator<SuggestWord> getComparator()
           
 StringDistance getStringDistance()
          Returns the StringDistance instance used by this SpellChecker instance.
 void indexDictionary(Dictionary dict, IndexWriterConfig config, boolean fullMerge)
          Indexes the data from the given Dictionary.
 void setAccuracy(float acc)
          Sets the accuracy 0 < minScore < 1; default DEFAULT_ACCURACY
 void setComparator(Comparator<SuggestWord> comparator)
          Sets the Comparator for the SuggestWordQueue.
 void setSpellIndex(Directory spellIndexDir)
          Use a different index as the spell checker index or re-open the existing index if spellIndex is the same value as given in the constructor.
 void setStringDistance(StringDistance sd)
          Sets the StringDistance implementation for this SpellChecker instance.
 String[] suggestSimilar(String word, int numSug)
          Suggest similar words.
 String[] suggestSimilar(String word, int numSug, float accuracy)
          Suggest similar words.
 String[] suggestSimilar(String word, int numSug, IndexReader ir, String field, boolean morePopular)
          Deprecated. use suggestSimilar(String, int, IndexReader, String, SuggestMode)
  • SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX instead of morePopular=false
  • SuggestMode.SuGGEST_MORE_POPULAR instead of morePopular=true
 String[] suggestSimilar(String word, int numSug, IndexReader ir, String field, boolean morePopular, float accuracy)
          Deprecated. use suggestSimilar(String, int, IndexReader, String, SuggestMode, float)
  • SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX instead of morePopular=false
  • SuggestMode.SuGGEST_MORE_POPULAR instead of morePopular=true
 String[] suggestSimilar(String word, int numSug, IndexReader ir, String field, SuggestMode suggestMode)
          Calls suggestSimilar(word, numSug, ir, suggestMode, field, this.accuracy)
 String[] suggestSimilar(String word, int numSug, IndexReader ir, String field, SuggestMode suggestMode, float accuracy)
          Suggest similar words (optionally restricted to a field of an index).
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_ACCURACY

public static final float DEFAULT_ACCURACY
The default minimum score to use, if not specified by calling setAccuracy(float) .

See Also:
Constant Field Values

F_WORD

public static final String F_WORD
Field name for each word in the ngram index.

See Also:
Constant Field Values
Constructor Detail

SpellChecker

public SpellChecker(Directory spellIndex,
                    StringDistance sd)
             throws IOException
Use the given directory as a spell checker index. The directory is created if it doesn't exist yet.

Parameters:
spellIndex - the spell index directory
sd - the StringDistance measurement to use
Throws:
IOException - if Spellchecker can not open the directory

SpellChecker

public SpellChecker(Directory spellIndex)
             throws IOException
Use the given directory as a spell checker index with a LevensteinDistance as the default StringDistance. The directory is created if it doesn't exist yet.

Parameters:
spellIndex - the spell index directory
Throws:
IOException - if spellchecker can not open the directory

SpellChecker

public SpellChecker(Directory spellIndex,
                    StringDistance sd,
                    Comparator<SuggestWord> comparator)
             throws IOException
Use the given directory as a spell checker index with the given StringDistance measure and the given Comparator for sorting the results.

Parameters:
spellIndex - The spelling index
sd - The distance
comparator - The comparator
Throws:
IOException - if there is a problem opening the index
Method Detail

setSpellIndex

public void setSpellIndex(Directory spellIndexDir)
                   throws IOException
Use a different index as the spell checker index or re-open the existing index if spellIndex is the same value as given in the constructor.

Parameters:
spellIndexDir - the spell directory to use
Throws:
AlreadyClosedException - if the Spellchecker is already closed
IOException - if spellchecker can not open the directory

setComparator

public void setComparator(Comparator<SuggestWord> comparator)
Sets the Comparator for the SuggestWordQueue.

Parameters:
comparator - the comparator

getComparator

public Comparator<SuggestWord> getComparator()

setStringDistance

public void setStringDistance(StringDistance sd)
Sets the StringDistance implementation for this SpellChecker instance.

Parameters:
sd - the StringDistance implementation for this SpellChecker instance

getStringDistance

public StringDistance getStringDistance()
Returns the StringDistance instance used by this SpellChecker instance.

Returns:
the StringDistance instance used by this SpellChecker instance.

setAccuracy

public void setAccuracy(float acc)
Sets the accuracy 0 < minScore < 1; default DEFAULT_ACCURACY

Parameters:
acc - The new accuracy

getAccuracy

public float getAccuracy()
The accuracy (minimum score) to be used, unless overridden in suggestSimilar(String, int, org.apache.lucene.index.IndexReader, String, boolean, float), to decide whether a suggestion is included or not.

Returns:
The current accuracy setting

suggestSimilar

public String[] suggestSimilar(String word,
                               int numSug)
                        throws IOException
Suggest similar words.

As the Lucene similarity that is used to fetch the most relevant n-grammed terms is not the same as the edit distance strategy used to calculate the best matching spell-checked word from the hits that Lucene found, one usually has to retrieve a couple of numSug's in order to get the true best match.

I.e. if numSug == 1, don't count on that suggestion being the best one. Thus, you should set this value to at least 5 for a good suggestion.

Parameters:
word - the word you want a spell check done on
numSug - the number of suggested words
Returns:
String[]
Throws:
IOException - if the underlying index throws an IOException
AlreadyClosedException - if the Spellchecker is already closed
See Also:
suggestSimilar(String, int, org.apache.lucene.index.IndexReader, String, boolean, float)

suggestSimilar

public String[] suggestSimilar(String word,
                               int numSug,
                               float accuracy)
                        throws IOException
Suggest similar words.

As the Lucene similarity that is used to fetch the most relevant n-grammed terms is not the same as the edit distance strategy used to calculate the best matching spell-checked word from the hits that Lucene found, one usually has to retrieve a couple of numSug's in order to get the true best match.

I.e. if numSug == 1, don't count on that suggestion being the best one. Thus, you should set this value to at least 5 for a good suggestion.

Parameters:
word - the word you want a spell check done on
numSug - the number of suggested words
accuracy - The minimum score a suggestion must have in order to qualify for inclusion in the results
Returns:
String[]
Throws:
IOException - if the underlying index throws an IOException
AlreadyClosedException - if the Spellchecker is already closed
See Also:
suggestSimilar(String, int, org.apache.lucene.index.IndexReader, String, boolean, float)

suggestSimilar

@Deprecated
public String[] suggestSimilar(String word,
                                          int numSug,
                                          IndexReader ir,
                                          String field,
                                          boolean morePopular)
                        throws IOException
Deprecated. use suggestSimilar(String, int, IndexReader, String, SuggestMode)
  • SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX instead of morePopular=false
  • SuggestMode.SuGGEST_MORE_POPULAR instead of morePopular=true

Suggest similar words (optionally restricted to a field of an index).

As the Lucene similarity that is used to fetch the most relevant n-grammed terms is not the same as the edit distance strategy used to calculate the best matching spell-checked word from the hits that Lucene found, one usually has to retrieve a couple of numSug's in order to get the true best match.

I.e. if numSug == 1, don't count on that suggestion being the best one. Thus, you should set this value to at least 5 for a good suggestion.

Uses the getAccuracy() value passed into the constructor as the accuracy.

Parameters:
word - the word you want a spell check done on
numSug - the number of suggested words
ir - the indexReader of the user index (can be null see field param)
field - the field of the user index: if field is not null, the suggested words are restricted to the words present in this field.
morePopular - return only the suggest words that are as frequent or more frequent than the searched word (only if restricted mode = (indexReader!=null and field!=null)
Returns:
String[] the sorted list of the suggest words with these 2 criteria: first criteria: the edit distance, second criteria (only if restricted mode): the popularity of the suggest words in the field of the user index
Throws:
IOException - if the underlying index throws an IOException
AlreadyClosedException - if the Spellchecker is already closed
See Also:
suggestSimilar(String, int, IndexReader, String, SuggestMode, float)

suggestSimilar

@Deprecated
public String[] suggestSimilar(String word,
                                          int numSug,
                                          IndexReader ir,
                                          String field,
                                          boolean morePopular,
                                          float accuracy)
                        throws IOException
Deprecated. use suggestSimilar(String, int, IndexReader, String, SuggestMode, float)
  • SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX instead of morePopular=false
  • SuggestMode.SuGGEST_MORE_POPULAR instead of morePopular=true

Suggest similar words (optionally restricted to a field of an index).

As the Lucene similarity that is used to fetch the most relevant n-grammed terms is not the same as the edit distance strategy used to calculate the best matching spell-checked word from the hits that Lucene found, one usually has to retrieve a couple of numSug's in order to get the true best match.

I.e. if numSug == 1, don't count on that suggestion being the best one. Thus, you should set this value to at least 5 for a good suggestion.

Parameters:
word - the word you want a spell check done on
numSug - the number of suggested words
ir - the indexReader of the user index (can be null see field param)
field - the field of the user index: if field is not null, the suggested words are restricted to the words present in this field.
morePopular - return only the suggest words that are as frequent or more frequent than the searched word (only if restricted mode = (indexReader!=null and field!=null)
accuracy - The minimum score a suggestion must have in order to qualify for inclusion in the results
Returns:
String[] the sorted list of the suggest words with these 2 criteria: first criteria: the edit distance, second criteria (only if restricted mode): the popularity of the suggest words in the field of the user index
Throws:
IOException - if the underlying index throws an IOException
AlreadyClosedException - if the Spellchecker is already closed
See Also:
suggestSimilar(String, int, IndexReader, String, SuggestMode, float)

suggestSimilar

public String[] suggestSimilar(String word,
                               int numSug,
                               IndexReader ir,
                               String field,
                               SuggestMode suggestMode)
                        throws IOException
Calls suggestSimilar(word, numSug, ir, suggestMode, field, this.accuracy)

Throws:
IOException

suggestSimilar

public String[] suggestSimilar(String word,
                               int numSug,
                               IndexReader ir,
                               String field,
                               SuggestMode suggestMode,
                               float accuracy)
                        throws IOException
Suggest similar words (optionally restricted to a field of an index).

As the Lucene similarity that is used to fetch the most relevant n-grammed terms is not the same as the edit distance strategy used to calculate the best matching spell-checked word from the hits that Lucene found, one usually has to retrieve a couple of numSug's in order to get the true best match.

I.e. if numSug == 1, don't count on that suggestion being the best one. Thus, you should set this value to at least 5 for a good suggestion.

Parameters:
word - the word you want a spell check done on
numSug - the number of suggested words
ir - the indexReader of the user index (can be null see field param)
field - the field of the user index: if field is not null, the suggested words are restricted to the words present in this field.
suggestMode - (NOTE: if indexReader==null and/or field==null, then this is overridden with SuggestMode.SUGGEST_ALWAYS)
accuracy - The minimum score a suggestion must have in order to qualify for inclusion in the results
Returns:
String[] the sorted list of the suggest words with these 2 criteria: first criteria: the edit distance, second criteria (only if restricted mode): the popularity of the suggest words in the field of the user index
Throws:
IOException - if the underlying index throws an IOException
AlreadyClosedException - if the Spellchecker is already closed

clearIndex

public void clearIndex()
                throws IOException
Removes all terms from the spell check index.

Throws:
IOException
AlreadyClosedException - if the Spellchecker is already closed

exist

public boolean exist(String word)
              throws IOException
Check whether the word exists in the index.

Parameters:
word -
Returns:
true if the word exists in the index
Throws:
IOException
AlreadyClosedException - if the Spellchecker is already closed

indexDictionary

public final void indexDictionary(Dictionary dict,
                                  IndexWriterConfig config,
                                  boolean fullMerge)
                           throws IOException
Indexes the data from the given Dictionary.

Parameters:
dict - Dictionary to index
config - IndexWriterConfig to use
fullMerge - whether or not the spellcheck index should be fully merged
Throws:
AlreadyClosedException - if the Spellchecker is already closed
IOException

close

public void close()
           throws IOException
Close the IndexSearcher used by this SpellChecker

Specified by:
close in interface Closeable
Throws:
IOException - if the close operation causes an IOException
AlreadyClosedException - if the SpellChecker is already closed