org.apache.lucene.facet.taxonomy
Interface TaxonomyReader

All Superinterfaces:
Closeable
All Known Implementing Classes:
DirectoryTaxonomyReader

public interface TaxonomyReader
extends Closeable

TaxonomyReader is the read-only interface with which the faceted-search library uses the taxonomy during search time.

A TaxonomyReader holds a list of categories. Each category has a serial number which we call an "ordinal", and a hierarchical "path" name:

Notes about concurrent access to the taxonomy:

An implementation must allow multiple readers to be active concurrently with a single writer. Readers follow so-called "point in time" semantics, i.e., a TaxonomyReader object will only see taxonomy entries which were available at the time it was created. What the writer writes is only available to (new) readers after the writer's commit() is called.

In faceted search, two separate indices are used: the main Lucene index, and the taxonomy. Because the main index refers to the categories listed in the taxonomy, it is important to open the taxonomy *after* opening the main index, and it is also necessary to reopen() the taxonomy after reopen()ing the main index.

This order is important, otherwise it would be possible for the main index to refer to a category which is not yet visible in the old snapshot of the taxonomy. Note that it is indeed fine for the the taxonomy to be opened after the main index - even a long time after. The reason is that once a category is added to the taxonomy, it can never be changed or deleted, so there is no danger that a "too new" taxonomy not being consistent with an older index.

WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary
static interface TaxonomyReader.ChildrenArrays
          Equivalent representations of the taxonomy's parent info, used internally for efficient computation of facet results: "youngest child" and "oldest sibling"
 
Field Summary
static int INVALID_ORDINAL
          Ordinals are always non-negative, so a negative ordinal can be used to signify an error.
static int ROOT_ORDINAL
          The root category (the category with the empty path) always has the ordinal 0, to which we give a name ROOT_ORDINAL.
 
Method Summary
 void decRef()
          Expert: decreases the refCount of this TaxonomyReader instance.
 TaxonomyReader.ChildrenArrays getChildrenArrays()
          getChildrenArrays() returns a TaxonomyReader.ChildrenArrays object which can be used together to efficiently enumerate the children of any category.
 Map<String,String> getCommitUserData()
          Retrieve user committed data.
 int getOrdinal(CategoryPath categoryPath)
          getOrdinal() returns the ordinal of the category given as a path.
 int getParent(int ordinal)
          getParent() returns the ordinal of the parent category of the category with the given ordinal.
 int[] getParentArray()
          getParentArray() returns an int array of size getSize() listing the ordinal of the parent category of each category in the taxonomy.
 CategoryPath getPath(int ordinal)
          getPath() returns the path name of the category with the given ordinal.
 boolean getPath(int ordinal, CategoryPath result)
          getPath() returns the path name of the category with the given ordinal.
 int getRefCount()
          Expert: returns the current refCount for this taxonomy reader
 int getSize()
          getSize() returns the number of categories in the taxonomy.
 void incRef()
          Expert: increments the refCount of this TaxonomyReader instance.
 boolean refresh()
          refresh() re-reads the taxonomy information if there were any changes to the taxonomy since this instance was opened or last refreshed.
 
Methods inherited from interface java.io.Closeable
close
 

Field Detail

ROOT_ORDINAL

static final int ROOT_ORDINAL
The root category (the category with the empty path) always has the ordinal 0, to which we give a name ROOT_ORDINAL. getOrdinal() of an empty path will always return ROOT_ORDINAL, and getCategory(ROOT_ORDINAL) will return the empty path.

See Also:
Constant Field Values

INVALID_ORDINAL

static final int INVALID_ORDINAL
Ordinals are always non-negative, so a negative ordinal can be used to signify an error. Methods here return INVALID_ORDINAL (-1) in this case.

See Also:
Constant Field Values
Method Detail

getOrdinal

int getOrdinal(CategoryPath categoryPath)
               throws IOException
getOrdinal() returns the ordinal of the category given as a path. The ordinal is the category's serial number, an integer which starts with 0 and grows as more categories are added (note that once a category is added, it can never be deleted).

If the given category wasn't found in the taxonomy, INVALID_ORDINAL is returned.

Throws:
IOException

getPath

CategoryPath getPath(int ordinal)
                     throws IOException
getPath() returns the path name of the category with the given ordinal. The path is returned as a new CategoryPath object - to reuse an existing object, use getPath(int, CategoryPath).

A null is returned if a category with the given ordinal does not exist.

Throws:
IOException

getPath

boolean getPath(int ordinal,
                CategoryPath result)
                throws IOException
getPath() returns the path name of the category with the given ordinal. The path is written to the given CategoryPath object (which is cleared first).

If a category with the given ordinal does not exist, the given CategoryPath object is not modified, and the method returns false. Otherwise, the method returns true.

Throws:
IOException

refresh

boolean refresh()
                throws IOException,
                       InconsistentTaxonomyException
refresh() re-reads the taxonomy information if there were any changes to the taxonomy since this instance was opened or last refreshed. Calling refresh() is more efficient than close()ing the old instance and opening a new one.

If there were no changes since this instance was opened or last refreshed, then this call does nothing. Note, however, that this is still a relatively slow method (as it needs to verify whether there have been any changes on disk to the taxonomy), so it should not be called too often needlessly. In faceted search, the taxonomy reader's refresh() should be called only after a reopen() of the main index.

Refreshing the taxonomy might fail in some cases, for example if the taxonomy was recreated since this instance was opened or last refreshed. In this case an InconsistentTaxonomyException is thrown, suggesting that in order to obtain up-to-date taxonomy data a new TaxonomyReader should be opened. Note: This TaxonomyReader instance remains unchanged and usable in this case, and the application can continue to use it, and should still Closeable.close() when no longer needed.

It should be noted that refresh() is similar in purpose to IndexReader.reopen(), but the two methods behave differently. refresh() refreshes the existing TaxonomyReader object, rather than opening a new one in addition to the old one as reopen() does. The reason is that in a taxonomy, one can only add new categories and cannot modify or delete existing categories; Therefore, there is no reason to keep an old snapshot of the taxonomy open - refreshing the taxonomy to the newest data and using this new snapshots in all threads (whether new or old) is fine. This saves us needing to keep multiple copies of the taxonomy open in memory.

Returns:
true if anything has changed, false otherwise.
Throws:
IOException
InconsistentTaxonomyException

getParent

int getParent(int ordinal)
              throws IOException
getParent() returns the ordinal of the parent category of the category with the given ordinal.

When a category is specified as a path name, finding the path of its parent is as trivial as dropping the last component of the path. getParent() is functionally equivalent to calling getPath() on the given ordinal, dropping the last component of the path, and then calling getOrdinal() to get an ordinal back. However, implementations are expected to provide a much more efficient implementation:

getParent() should be a very quick method, as it is used during the facet aggregation process in faceted search. Implementations will most likely want to serve replies to this method from a pre-filled cache.

If the given ordinal is the ROOT_ORDINAL, an INVALID_ORDINAL is returned. If the given ordinal is a top-level category, the ROOT_ORDINAL is returned. If an invalid ordinal is given (negative or beyond the last available ordinal), an ArrayIndexOutOfBoundsException is thrown. However, it is expected that getParent will only be called for ordinals which are already known to be in the taxonomy.

Throws:
IOException

getParentArray

int[] getParentArray()
                     throws IOException
getParentArray() returns an int array of size getSize() listing the ordinal of the parent category of each category in the taxonomy.

The caller can hold on to the array it got indefinitely - it is guaranteed that no-one else will modify it. The other side of the same coin is that the caller must treat the array it got as read-only and not modify it, because other callers might have gotten the same array too (and getParent() calls might be answered from the same array).

If you use getParentArray() instead of getParent(), remember that the array you got is (naturally) not modified after a refresh(), so you should always call getParentArray() again after a refresh().

This method's function is similar to allocating an array of size getSize() and filling it with getParent() calls, but implementations are encouraged to implement it much more efficiently, with O(1) complexity. This can be done, for example, by the implementation already keeping the parents in an array, and just returning this array (without any allocation or copying) when requested.

Throws:
IOException

getChildrenArrays

TaxonomyReader.ChildrenArrays getChildrenArrays()
getChildrenArrays() returns a TaxonomyReader.ChildrenArrays object which can be used together to efficiently enumerate the children of any category.

The caller can hold on to the object it got indefinitely - it is guaranteed that no-one else will modify it. The other side of the same coin is that the caller must treat the object which it got (and the arrays it contains) as read-only and not modify it, because other callers might have gotten the same object too.

Implementations should have O(getSize()) time for the first call or after a refresh(), but O(1) time for further calls. In neither case there should be a need to read new data from disk. These guarantees are most likely achieved by calculating this object (based on the getParentArray()) when first needed, and later (if the taxonomy was not refreshed) returning the same object (without any allocation or copying) when requested.

The reason we have one method returning one object, rather than two methods returning two arrays, is to avoid race conditions in a multi- threaded application: We want to avoid the possibility of returning one new array and one old array, as those could not be used together.


getCommitUserData

Map<String,String> getCommitUserData()
                                     throws IOException
Retrieve user committed data.

Throws:
IOException
See Also:
TwoPhaseCommit.commit(Map)

incRef

void incRef()
Expert: increments the refCount of this TaxonomyReader instance. RefCounts can be used to determine when a taxonomy reader can be closed safely, i.e. as soon as there are no more references. Be sure to always call a corresponding decRef(), in a finally clause; otherwise the reader may never be closed.


decRef

void decRef()
            throws IOException
Expert: decreases the refCount of this TaxonomyReader instance. If the refCount drops to 0, then pending changes (if any) can be committed to the taxonomy index and this reader can be closed.

Throws:
IOException

getRefCount

int getRefCount()
Expert: returns the current refCount for this taxonomy reader


getSize

int getSize()
getSize() returns the number of categories in the taxonomy.

Because categories are numbered consecutively starting with 0, it means the taxonomy contains ordinals 0 through getSize()-1.

Note that the number returned by getSize() is often slightly higher than the number of categories inserted into the taxonomy; This is because when a category is added to the taxonomy, its ancestors are also added automatically (including the root, which always get ordinal 0).