How does Hadoop process records split across block boundaries?
Interesting question, I spent some time looking at the code for the details and here are my thoughts. The splits are handled by the client by InputFormat.getSplits, so a look at FileInputFormat gives the following info: For each input file, get the file length, the block size and calculate the split size as max(minSize, min(maxSize, … Read more