Merging multiple files into one within Hadoop

In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) – add compression using MR flags. hadoop jar \ $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br> -Dmapred.reduce.tasks=1 \ -Dmapred.job.queue.name=$QUEUE \ -input “$INPUT” \ -output “$OUTPUT” \ -mapper cat \ -reducer cat If you want compression … Read more

Pivot table with Apache Pig

You can do it in 2 ways: 1. Write a UDF which returns a bag of tuples. It will be the most flexible solution, but requires Java code; 2. Write a rigid script like this: inpt = load ‘/pig_fun/input/pivot.txt’ as (Id, Column1, Column2, Column3); bagged = foreach inpt generate Id, TOBAG(TOTUPLE(‘Column1’, Column1), TOTUPLE(‘Column2’, Column2), TOTUPLE(‘Column3’, … Read more

ERROR 1066: Unable to open iterator for alias in Pig, Generic solution

The message “ERROR 1066: Unable to open iterator for alias myAlias” suggests that there is something going wrong in the line where you use myAlias. However, usually you will see this error if something went wrong BEFORE you are trying to use this alias. So the first thing to do is look up further along … Read more