org.apache.nutch.indexer.solr
Class SolrDeleteDuplicates

java.lang.Object
  extended by org.apache.hadoop.mapreduce.Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>
      extended by org.apache.nutch.indexer.solr.SolrDeleteDuplicates
All Implemented Interfaces:
Configurable, Tool

public class SolrDeleteDuplicates
extends Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>
implements Tool

Utility class for deleting duplicate documents from a solr index. The algorithm goes like follows: Preparation:

  1. Query the solr server for the number of documents (say, N)
  2. Partition N among M map tasks. For example, if we have two map tasks the first map task will deal with solr documents from 0 - (N / 2 - 1) and the second will deal with documents from (N / 2) to (N - 1).
MapReduce: Note that we assume that two documents in a solr index will never have the same URL. So this class only deals with documents with different URLs but the same digest.


Nested Class Summary
static class SolrDeleteDuplicates.SolrInputFormat
           
static class SolrDeleteDuplicates.SolrInputSplit
           
static class SolrDeleteDuplicates.SolrRecord
           
static class SolrDeleteDuplicates.SolrRecordReader
           
 
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.Reducer
Reducer.Context
 
Field Summary
static org.slf4j.Logger LOG
           
 
Constructor Summary
SolrDeleteDuplicates()
           
 
Method Summary
 void cleanup(Reducer.Context context)
           
 boolean dedup(String solrUrl)
           
 Configuration getConf()
           
static void main(String[] args)
           
 void reduce(Text key, Iterable<SolrDeleteDuplicates.SolrRecord> values, Reducer.Context context)
           
 int run(String[] args)
           
 void setConf(Configuration conf)
           
 void setup(Reducer.Context job)
           
 
Methods inherited from class org.apache.hadoop.mapreduce.Reducer
run
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.slf4j.Logger LOG
Constructor Detail

SolrDeleteDuplicates

public SolrDeleteDuplicates()
Method Detail

getConf

public Configuration getConf()
Specified by:
getConf in interface Configurable

setConf

public void setConf(Configuration conf)
Specified by:
setConf in interface Configurable

setup

public void setup(Reducer.Context job)
           throws IOException
Overrides:
setup in class Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>
Throws:
IOException

cleanup

public void cleanup(Reducer.Context context)
             throws IOException
Overrides:
cleanup in class Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>
Throws:
IOException

reduce

public void reduce(Text key,
                   Iterable<SolrDeleteDuplicates.SolrRecord> values,
                   Reducer.Context context)
            throws IOException
Overrides:
reduce in class Reducer<Text,SolrDeleteDuplicates.SolrRecord,Text,SolrDeleteDuplicates.SolrRecord>
Throws:
IOException

dedup

public boolean dedup(String solrUrl)
              throws IOException,
                     InterruptedException,
                     ClassNotFoundException
Throws:
IOException
InterruptedException
ClassNotFoundException

run

public int run(String[] args)
        throws IOException,
               InterruptedException,
               ClassNotFoundException
Specified by:
run in interface Tool
Throws:
IOException
InterruptedException
ClassNotFoundException

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception


Copyright © 2012 The Apache Software Foundation