Thursday, 8 August 2013

Sequencefiles which map a single key to multiple values

Sequencefiles which map a single key to multiple values

I am trying to do some preprocessing on data that will be fed to
LucidWorks Big Data for indexing. LWBD accepts SolrXML in the form of
Sequencefile files. I want to create a Pig script which will take all the
SolrXML files in a directory and output them in the format
filename_1 => <here goes some XML>
...
filename_N => <here goes some more XML>
Pig's native PigStorage() load function can automatically create a column
that includes the name of the file from which the data was extracted,
which ideally would look like this:
{"filename_1", "<here goes some XML>"}
...
{"filename_N", "<here goes some more XML>"}
However, PigStorage() also automatically uses '\n' as a line delimiter, so
what I actually end up with is a bag that looks like this:
{"filename_1", "<some partial XML from file 1>"}
{"filename_1", "<some more partial XML from file 1>"}
{"filename_1", "<the end of file 1>"}
...
I'm sure you get the picture. My question is, if I were to write this bag
to a SequenceFile, how would it be read by other applications? Could it be
combined as
"filename_1" => "<some partial XML from file 1>
<some more partial XML from file 1>
<the end of file 1>"
, by the default handling of the application I feed it to? Or is there
some post-processing that I can do to get it into this format? Thank you
for your help.

No comments:

Post a Comment