Package org.apache.nutch.collection

Subcollection is a subset of an index.

See:
          Description

Class Summary
CollectionManager  
Subcollection SubCollection represents a subset of index, you can define url patterns that will indicate that particular page (url) is part of SubCollection.
 

Package org.apache.nutch.collection Description

Subcollection is a subset of an index. Subcollections are defined by urlpatterns in form of white/blacklist. So to get the page into subcollection it must match the whitelist and not the blacklist.

Subcollection definitions are read from a file subcollections.xml and the format is as follows (imagine here that you are crawling all the virtualhosts from apache.org and you wan't to tag pages with url pattern "http://lucene.apache.org/nutch" and http://wiki.apache.org/nutch/ to be part of subcollection "nutch", this allows you to later search specifically from this subcollection)

<?xml version="1.0" encoding="UTF-8"?>
<subcollections>
        <subcollection>
                <name>nutch</name>
                <id>lucene</id>
                <whitelist>http://lucene.apache.org/nutch</whitelist>
                <whitelist>http://wiki.apache.org/nutch/</whitelist>
                <blacklist />
        </subcollection>
</subcollections>

Despite of this configuration you still can crawl any urls as long as they pass through your global url filters. (note that you must also seed your urls in normal nutch way)



Copyright © 2012 The Apache Software Foundation