org.apache.nutch.tools.arc
Class ArcRecordReader

java.lang.Object
  extended by org.apache.nutch.tools.arc.ArcRecordReader
All Implemented Interfaces:
RecordReader<Text,BytesWritable>

public class ArcRecordReader
extends Object
implements RecordReader<Text,BytesWritable>

The ArchRecordReader class provides a record reader which reads records from arc files.

Arc files are essentially tars of gzips. Each record in an arc file is a compressed gzip. Multiple records are concatenated together to form a complete arc. For more information on the arc file format see http://www.archive.org/web/researcher/ArcFileFormat.php.

Arc files are used by the internet archive and grub projects.

See Also:
http://www.archive.org/, http://www.grub.org/

Field Summary
protected  Configuration conf
           
protected  long fileLen
           
protected  FSDataInputStream in
           
static org.slf4j.Logger LOG
           
protected  long pos
           
protected  long splitEnd
           
protected  long splitLen
           
protected  long splitStart
           
 
Constructor Summary
ArcRecordReader(Configuration conf, FileSplit split)
          Constructor that sets the configuration and file split.
 
Method Summary
 void close()
          Closes the record reader resources.
 Text createKey()
          Creates a new instance of the Text object for the key.
 BytesWritable createValue()
          Creates a new instance of the BytesWritable object for the key
 long getPos()
          Returns the current position in the file.
 float getProgress()
          Returns the percentage of progress in processing the file.
static boolean isMagic(byte[] input)
          Returns true if the byte array passed matches the gzip header magic number.
 boolean next(Text key, BytesWritable value)
          Returns true if the next record in the split is read into the key and value pair.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.slf4j.Logger LOG

conf

protected Configuration conf

splitStart

protected long splitStart

pos

protected long pos

splitEnd

protected long splitEnd

splitLen

protected long splitLen

fileLen

protected long fileLen

in

protected FSDataInputStream in
Constructor Detail

ArcRecordReader

public ArcRecordReader(Configuration conf,
                       FileSplit split)
                throws IOException
Constructor that sets the configuration and file split.

Parameters:
conf - The job configuration.
split - The file split to read from.
Throws:
IOException - If an IO error occurs while initializing file split.
Method Detail

isMagic

public static boolean isMagic(byte[] input)

Returns true if the byte array passed matches the gzip header magic number.

Parameters:
input - The byte array to check.
Returns:
True if the byte array matches the gzip header magic number.

close

public void close()
           throws IOException
Closes the record reader resources.

Specified by:
close in interface RecordReader<Text,BytesWritable>
Throws:
IOException

createKey

public Text createKey()
Creates a new instance of the Text object for the key.

Specified by:
createKey in interface RecordReader<Text,BytesWritable>

createValue

public BytesWritable createValue()
Creates a new instance of the BytesWritable object for the key

Specified by:
createValue in interface RecordReader<Text,BytesWritable>

getPos

public long getPos()
            throws IOException
Returns the current position in the file.

Specified by:
getPos in interface RecordReader<Text,BytesWritable>
Returns:
The long of the current position in the file.
Throws:
IOException

getProgress

public float getProgress()
                  throws IOException
Returns the percentage of progress in processing the file. This will be represented as a float from 0 to 1 with 1 being 100% completed.

Specified by:
getProgress in interface RecordReader<Text,BytesWritable>
Returns:
The percentage of progress as a float from 0 to 1.
Throws:
IOException

next

public boolean next(Text key,
                    BytesWritable value)
             throws IOException

Returns true if the next record in the split is read into the key and value pair. The key will be the arc record header and the values will be the raw content bytes of the arc record.

Specified by:
next in interface RecordReader<Text,BytesWritable>
Parameters:
key - The record key
value - The record value
Returns:
True if the next record is read.
Throws:
IOException - If an error occurs while reading the record value.


Copyright © 2012 The Apache Software Foundation