org.apache.nutch.indexer.solr
Class SolrDeleteDuplicates
java.lang.Object
org.apache.hadoop.mapreduce.Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>
org.apache.nutch.indexer.solr.SolrDeleteDuplicates
- All Implemented Interfaces:
- Configurable, Tool
public class SolrDeleteDuplicates
- extends Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>
- implements Tool
Utility class for deleting duplicate documents from a solr index.
The algorithm goes like follows:
Preparation:
- Query the solr server for the number of documents (say, N)
- Partition N among M map tasks. For example, if we have two map tasks
the first map task will deal with solr documents from 0 - (N / 2 - 1) and
the second will deal with documents from (N / 2) to (N - 1).
MapReduce:
- Map: Identity map where keys are digests and values are
SolrDeleteDuplicates.SolrRecord
instances(which contain id, boost and timestamp)
- Reduce: After map,
SolrDeleteDuplicates.SolrRecord
s with the same digest will be
grouped together. Now, of these documents with the same digests, delete
all of them except the one with the highest score (boost field). If two
(or more) documents have the same score, then the document with the latest
timestamp is kept. Again, every other is deleted from solr index.
Note that we assume that two documents in
a solr index will never have the same URL. So this class only deals with
documents with different URLs but the same digest.
Field Summary |
static org.slf4j.Logger |
LOG
|
Methods inherited from class org.apache.hadoop.mapreduce.Reducer |
run |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.slf4j.Logger LOG
SolrDeleteDuplicates
public SolrDeleteDuplicates()
getConf
public Configuration getConf()
- Specified by:
getConf
in interface Configurable
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interface Configurable
setup
public void setup(Reducer.Context job)
throws IOException
- Overrides:
setup
in class Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>
- Throws:
IOException
cleanup
public void cleanup(Reducer.Context context)
throws IOException
- Overrides:
cleanup
in class Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>
- Throws:
IOException
reduce
public void reduce(Text key,
Iterable<SolrDeleteDuplicates.SolrRecord> values,
Reducer.Context context)
throws IOException
- Overrides:
reduce
in class Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>
- Throws:
IOException
dedup
public boolean dedup(String solrUrl)
throws IOException,
InterruptedException,
ClassNotFoundException
- Throws:
IOException
InterruptedException
ClassNotFoundException
run
public int run(String[] args)
throws IOException,
InterruptedException,
ClassNotFoundException
- Specified by:
run
in interface Tool
- Throws:
IOException
InterruptedException
ClassNotFoundException
main
public static void main(String[] args)
throws Exception
- Throws:
Exception
Copyright © 2012 The Apache Software Foundation