org.apache.lucene.facet.taxonomy.directory
Class DirectoryTaxonomyWriter

java.lang.Object
  extended by org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyWriter
All Implemented Interfaces:
Closeable, TaxonomyWriter, TwoPhaseCommit

public class DirectoryTaxonomyWriter
extends Object
implements TaxonomyWriter

TaxonomyWriter which uses a Directory to store the taxonomy information on disk, and keeps an additional in-memory cache of some or all categories.

In addition to the permanently-stored information in the Directory, efficiency dictates that we also keep an in-memory cache of recently seen or all categories, so that we do not need to go back to disk for every category addition to see which ordinal this category already has, if any. A TaxonomyWriterCache object determines the specific caching algorithm used.

This class offers some hooks for extending classes to control the IndexWriter instance that is used. See openIndexWriter(org.apache.lucene.store.Directory, org.apache.lucene.index.IndexWriterConfig).

WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary
static class DirectoryTaxonomyWriter.DiskOrdinalMap
          DirectoryTaxonomyWriter.OrdinalMap maintained on file system
static class DirectoryTaxonomyWriter.MemoryOrdinalMap
          DirectoryTaxonomyWriter.OrdinalMap maintained in memory
static interface DirectoryTaxonomyWriter.OrdinalMap
          Mapping from old ordinal to new ordinals, used when merging indexes wit separate taxonomies.
 
Field Summary
static String INDEX_CREATE_TIME
          Property name of user commit data that contains the creation time of a taxonomy index.
 
Constructor Summary
DirectoryTaxonomyWriter(Directory d)
           
DirectoryTaxonomyWriter(Directory directory, IndexWriterConfig.OpenMode openMode)
          Creates a new instance with a default cached as defined by defaultTaxonomyWriterCache().
DirectoryTaxonomyWriter(Directory directory, IndexWriterConfig.OpenMode openMode, TaxonomyWriterCache cache)
          Construct a Taxonomy writer.
 
Method Summary
 int addCategory(CategoryPath categoryPath)
          addCategory() adds a category with a given path name to the taxonomy, and returns its ordinal.
protected  int addCategoryDocument(CategoryPath categoryPath, int length, int parent)
           
 void addTaxonomies(Directory[] taxonomies, DirectoryTaxonomyWriter.OrdinalMap[] ordinalMaps)
          Take all the categories of one or more given taxonomies, and add them to the main taxonomy (this), if they are not already there.
 void close()
          Frees used resources as well as closes the underlying IndexWriter, which commits whatever changes made to it to the underlying Directory.
protected  void closeResources()
          A hook for extending classes to close additional resources that were used.
 void commit()
          Calling commit() ensures that all the categories written so far are visible to a reader that is opened (or reopened) after that call.
 void commit(Map<String,String> commitUserData)
          Like commit(), but also store properties with the index.
protected  IndexWriterConfig createIndexWriterConfig(IndexWriterConfig.OpenMode openMode)
          Create the IndexWriterConfig that would be used for opening the internal index writer.
static TaxonomyWriterCache defaultTaxonomyWriterCache()
          Defines the default TaxonomyWriterCache to use in constructors which do not specify one.
protected  void ensureOpen()
          Verifies that this instance wasn't closed, or throws AlreadyClosedException if it is.
protected  int findCategory(CategoryPath categoryPath)
          Look up the given category in the cache and/or the on-disk storage, returning the category's ordinal, or a negative number in case the category does not yet exist in the taxonomy.
 int getCacheMemoryUsage()
          Returns the number of memory bytes used by the cache.
 int getParent(int ordinal)
          getParent() returns the ordinal of the parent category of the category with the given ordinal.
 int getSize()
          getSize() returns the number of categories in the taxonomy.
protected  IndexWriter openIndexWriter(Directory directory, IndexWriterConfig config)
          Open internal index writer, which contains the taxonomy data.
protected  IndexReader openReader()
          Open an IndexReader from the internal IndexWriter, by calling IndexReader.open(IndexWriter, boolean).
 void prepareCommit()
          prepare most of the work needed for a two-phase commit.
 void prepareCommit(Map<String,String> commitUserData)
          Like above, and also prepares to store user data with the index.
 void rollback()
          Rollback changes to the taxonomy writer and closes the instance.
 void setCacheMissesUntilFill(int i)
          Set the number of cache misses before an attempt is made to read the entire taxonomy into the in-memory cache.
 void setDelimiter(char delimiter)
          setDelimiter changes the character that the taxonomy uses in its internal storage as a delimiter between category components.
static void unlock(Directory directory)
          Forcibly unlocks the taxonomy in the named directory.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INDEX_CREATE_TIME

public static final String INDEX_CREATE_TIME
Property name of user commit data that contains the creation time of a taxonomy index.

Applications should not use this property in their commit data because it will be overridden by this taxonomy writer.

See Also:
Constant Field Values
Constructor Detail

DirectoryTaxonomyWriter

public DirectoryTaxonomyWriter(Directory directory,
                               IndexWriterConfig.OpenMode openMode,
                               TaxonomyWriterCache cache)
                        throws IOException
Construct a Taxonomy writer.

Parameters:
directory - The Directory in which to store the taxonomy. Note that the taxonomy is written directly to that directory (not to a subdirectory of it).
openMode - Specifies how to open a taxonomy for writing: APPEND means open an existing index for append (failing if the index does not yet exist). CREATE means create a new index (first deleting the old one if it already existed). APPEND_OR_CREATE appends to an existing index if there is one, otherwise it creates a new index.
cache - A TaxonomyWriterCache implementation which determines the in-memory caching policy. See for example LruTaxonomyWriterCache and Cl2oTaxonomyWriterCache. If null or missing, defaultTaxonomyWriterCache() is used.
Throws:
CorruptIndexException - if the taxonomy is corrupted.
LockObtainFailedException - if the taxonomy is locked by another writer. If it is known that no other concurrent writer is active, the lock might have been left around by an old dead process, and should be removed using unlock(Directory).
IOException - if another error occurred.

DirectoryTaxonomyWriter

public DirectoryTaxonomyWriter(Directory directory,
                               IndexWriterConfig.OpenMode openMode)
                        throws CorruptIndexException,
                               LockObtainFailedException,
                               IOException
Creates a new instance with a default cached as defined by defaultTaxonomyWriterCache().

Throws:
CorruptIndexException
LockObtainFailedException
IOException

DirectoryTaxonomyWriter

public DirectoryTaxonomyWriter(Directory d)
                        throws CorruptIndexException,
                               LockObtainFailedException,
                               IOException
Throws:
CorruptIndexException
LockObtainFailedException
IOException
Method Detail

setDelimiter

public void setDelimiter(char delimiter)
setDelimiter changes the character that the taxonomy uses in its internal storage as a delimiter between category components. Do not use this method unless you really know what you are doing. It has nothing to do with whatever character the application may be using to represent categories for its own use.

If you do use this method, make sure you call it before any other methods that actually queries the taxonomy. Moreover, make sure you always pass the same delimiter for all LuceneTaxonomyWriter and LuceneTaxonomyReader objects you create for the same directory.


unlock

public static void unlock(Directory directory)
                   throws IOException
Forcibly unlocks the taxonomy in the named directory.

Caution: this should only be used by failure recovery code, when it is known that no other process nor thread is in fact currently accessing this taxonomy.

This method is unnecessary if your Directory uses a NativeFSLockFactory instead of the default SimpleFSLockFactory. When the "native" lock is used, a lock does not stay behind forever when the process using it dies.

Throws:
IOException

openIndexWriter

protected IndexWriter openIndexWriter(Directory directory,
                                      IndexWriterConfig config)
                               throws IOException
Open internal index writer, which contains the taxonomy data.

Extensions may provide their own IndexWriter implementation or instance.
NOTE: the instance this method returns will be closed upon calling to close().
NOTE: the merge policy in effect must not merge none adjacent segments. See comment in createIndexWriterConfig(IndexWriterConfig.OpenMode) for the logic behind this.

Parameters:
directory - the Directory on top of which an IndexWriter should be opened.
config - configuration for the internal index writer.
Throws:
IOException
See Also:
createIndexWriterConfig(IndexWriterConfig.OpenMode)

createIndexWriterConfig

protected IndexWriterConfig createIndexWriterConfig(IndexWriterConfig.OpenMode openMode)
Create the IndexWriterConfig that would be used for opening the internal index writer.
Extensions can configure the IndexWriter as they see fit, including setting a merge-scheduler, or deletion-policy, different RAM size etc.

NOTE: internal docids of the configured index must not be altered. For that, categories are never deleted from the taxonomy index. In addition, merge policy in effect must not merge none adjacent segments.

Parameters:
openMode - see IndexWriterConfig.OpenMode
See Also:
openIndexWriter(Directory, IndexWriterConfig)

openReader

protected IndexReader openReader()
                          throws IOException
Open an IndexReader from the internal IndexWriter, by calling IndexReader.open(IndexWriter, boolean). Extending classes can override this method to return their own IndexReader.

Throws:
IOException

defaultTaxonomyWriterCache

public static TaxonomyWriterCache defaultTaxonomyWriterCache()
Defines the default TaxonomyWriterCache to use in constructors which do not specify one.

The current default is Cl2oTaxonomyWriterCache constructed with the parameters (1024, 0.15f, 3), i.e., the entire taxonomy is cached in memory while building it.


close

public void close()
           throws CorruptIndexException,
                  IOException
Frees used resources as well as closes the underlying IndexWriter, which commits whatever changes made to it to the underlying Directory.

Specified by:
close in interface Closeable
Throws:
CorruptIndexException
IOException

getCacheMemoryUsage

public int getCacheMemoryUsage()
Returns the number of memory bytes used by the cache.

Returns:
Number of cache bytes in memory, for CL2O only; zero otherwise.

closeResources

protected void closeResources()
                       throws IOException
A hook for extending classes to close additional resources that were used. The default implementation closes the IndexReader as well as the TaxonomyWriterCache instances that were used.
NOTE: if you override this method, you should include a super.closeResources() call in your implementation.

Throws:
IOException

findCategory

protected int findCategory(CategoryPath categoryPath)
                    throws IOException
Look up the given category in the cache and/or the on-disk storage, returning the category's ordinal, or a negative number in case the category does not yet exist in the taxonomy.

Throws:
IOException

addCategory

public int addCategory(CategoryPath categoryPath)
                throws IOException
Description copied from interface: TaxonomyWriter
addCategory() adds a category with a given path name to the taxonomy, and returns its ordinal. If the category was already present in the taxonomy, its existing ordinal is returned.

Before adding a category, addCategory() makes sure that all its ancestor categories exist in the taxonomy as well. As result, the ordinal of a category is guaranteed to be smaller then the ordinal of any of its descendants.

Specified by:
addCategory in interface TaxonomyWriter
Throws:
IOException

ensureOpen

protected final void ensureOpen()
Verifies that this instance wasn't closed, or throws AlreadyClosedException if it is.


addCategoryDocument

protected int addCategoryDocument(CategoryPath categoryPath,
                                  int length,
                                  int parent)
                           throws CorruptIndexException,
                                  IOException
Throws:
CorruptIndexException
IOException

commit

public void commit()
            throws CorruptIndexException,
                   IOException
Calling commit() ensures that all the categories written so far are visible to a reader that is opened (or reopened) after that call. When the index is closed(), commit() is also implicitly done. See TwoPhaseCommit.commit()

Specified by:
commit in interface TwoPhaseCommit
Throws:
CorruptIndexException
IOException

commit

public void commit(Map<String,String> commitUserData)
            throws CorruptIndexException,
                   IOException
Like commit(), but also store properties with the index. These properties are retrievable by DirectoryTaxonomyReader.getCommitUserData(). See TwoPhaseCommit.commit(Map).

Specified by:
commit in interface TwoPhaseCommit
Throws:
CorruptIndexException
IOException
See Also:
TwoPhaseCommit.commit(), TwoPhaseCommit.prepareCommit(Map)

prepareCommit

public void prepareCommit()
                   throws CorruptIndexException,
                          IOException
prepare most of the work needed for a two-phase commit. See IndexWriter.prepareCommit().

Specified by:
prepareCommit in interface TwoPhaseCommit
Throws:
CorruptIndexException
IOException

prepareCommit

public void prepareCommit(Map<String,String> commitUserData)
                   throws CorruptIndexException,
                          IOException
Like above, and also prepares to store user data with the index. See IndexWriter.prepareCommit(Map)

Specified by:
prepareCommit in interface TwoPhaseCommit
Throws:
CorruptIndexException
IOException
See Also:
TwoPhaseCommit.prepareCommit()

getSize

public int getSize()
getSize() returns the number of categories in the taxonomy.

Because categories are numbered consecutively starting with 0, it means the taxonomy contains ordinals 0 through getSize()-1.

Note that the number returned by getSize() is often slightly higher than the number of categories inserted into the taxonomy; This is because when a category is added to the taxonomy, its ancestors are also added automatically (including the root, which always get ordinal 0).

Specified by:
getSize in interface TaxonomyWriter

setCacheMissesUntilFill

public void setCacheMissesUntilFill(int i)
Set the number of cache misses before an attempt is made to read the entire taxonomy into the in-memory cache.

LuceneTaxonomyWriter holds an in-memory cache of recently seen categories to speed up operation. On each cache-miss, the on-disk index needs to be consulted. When an existing taxonomy is opened, a lot of slow disk reads like that are needed until the cache is filled, so it is more efficient to read the entire taxonomy into memory at once. We do this complete read after a certain number (defined by this method) of cache misses.

If the number is set to 0, the entire taxonomy is read into the cache on first use, without fetching individual categories first.

Note that if the memory cache of choice is limited in size, and cannot hold the entire content of the on-disk taxonomy, then it is never read in its entirety into the cache, regardless of the setting of this method.


getParent

public int getParent(int ordinal)
              throws IOException
Description copied from interface: TaxonomyWriter
getParent() returns the ordinal of the parent category of the category with the given ordinal.

When a category is specified as a path name, finding the path of its parent is as trivial as dropping the last component of the path. getParent() is functionally equivalent to calling getPath() on the given ordinal, dropping the last component of the path, and then calling getOrdinal() to get an ordinal back.

If the given ordinal is the ROOT_ORDINAL, an INVALID_ORDINAL is returned. If the given ordinal is a top-level category, the ROOT_ORDINAL is returned. If an invalid ordinal is given (negative or beyond the last available ordinal), an ArrayIndexOutOfBoundsException is thrown. However, it is expected that getParent will only be called for ordinals which are already known to be in the taxonomy. TODO (Facet): instead of a getParent(ordinal) method, consider having a

getCategory(categorypath, prefixlen) which is similar to addCategory except it doesn't add new categories; This method can be used to get the ordinals of all prefixes of the given category, and it can use exactly the same code and cache used by addCategory() so it means less code.

Specified by:
getParent in interface TaxonomyWriter
Throws:
IOException

addTaxonomies

public void addTaxonomies(Directory[] taxonomies,
                          DirectoryTaxonomyWriter.OrdinalMap[] ordinalMaps)
                   throws IOException
Take all the categories of one or more given taxonomies, and add them to the main taxonomy (this), if they are not already there.

Additionally, fill a mapping for each of the added taxonomies, mapping its ordinals to the ordinals in the enlarged main taxonomy. These mapping are saved into an array of OrdinalMap objects given by the user, one for each of the given taxonomies (not including "this", the main taxonomy). Often the first of these will be a MemoryOrdinalMap and the others will be a DiskOrdinalMap - see discussion in {OrdinalMap}.

Note that the taxonomies to be added are given as Directory objects, not opened TaxonomyReader/TaxonomyWriter objects, so if any of them are currently managed by an open TaxonomyWriter, make sure to commit() (or close()) it first. The main taxonomy (this) is an open TaxonomyWriter, and does not need to be commit()ed before this call.

Throws:
IOException

rollback

public void rollback()
              throws IOException
Rollback changes to the taxonomy writer and closes the instance. Following this method the instance becomes unusable (calling any of its API methods will yield an AlreadyClosedException).

Specified by:
rollback in interface TwoPhaseCommit
Throws:
IOException