hdfs – Make Me Engineer

Hadoop: …be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation

June 14, 2023 by Tarik

This error is caused by the block replication system of HDFS since it could not manage to make any copies of a specific block within the focused file. Common reasons of that: Only a NameNode instance is running and it’s not in safe-mode There is no DataNode instances up and running, or some are dead. … Read more

How does Hadoop perform input splits?

June 13, 2023 by Tarik

The InputFormat is responsible to provide the splits. In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the mapper on a machine will process the part of the data that is … Read more

Default Namenode port of HDFS is 50070.But I have come across at some places 8020 or 9000 [closed]

June 12, 2023 by Tarik

The default Hadoop ports are as follows: (HTTP ports, they have WEB UI): Daemon Default Port Configuration Parameter ———————– ———— ———————————- Namenode 50070 dfs.http.address Datanodes 50075 dfs.datanode.http.address Secondarynamenode 50090 dfs.secondary.http.address Backup/Checkpoint node? 50105 dfs.backup.http.address Jobracker 50030 mapred.job.tracker.http.address Tasktrackers 50060 mapred.task.tracker.http.address Internally, Hadoop mostly uses Hadoop IPC, which stands for Inter Process Communicator, to communicate amongst … Read more

Spark iterate HDFS directory

June 12, 2023 by Tarik

You can use org.apache.hadoop.fs.FileSystem. Specifically, FileSystem.listFiles([path], true) And with Spark… FileSystem.get(sc.hadoopConfiguration).listFiles(…, true) Edit It’s worth noting that good practice is to get the FileSystem that is associated with the Path‘s scheme. path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true)

Hive: Add partitions for existing folder structure

June 4, 2023 by Tarik

Use msck repair table command: MSCK [REPAIR] TABLE tablename; or ALTER TABLE tablename RECOVER PARTITIONS; if you are running Hive on EMR. Read more details about both commands here: RECOVER PARTITIONS

How do I Combine or Merge Small ORC files into Larger ORC file?

June 2, 2023 by Tarik

You do not need to re-invent the wheel. ALTER TABLE table_name [PARTITION partition_spec] CONCATENATE can be used to merge small ORC files into a larger file since Hive 0.14.0. The merge happens at the stripe level, which avoids decompressing and decoding the data. It works fast. I’d suggest to create an external table partitioned by … Read more

Amazon s3a returns 400 Bad Request with Spark

May 30, 2023 by Tarik

data block size in HDFS, why 64MB?

May 11, 2023 by Tarik

What does 64MB block size mean? The block size is the smallest data unit that a file system can store. If you store a file that’s 1k or 60Mb, it’ll take up one block. Once you cross the 64Mb boundary, you need a second block. If yes, what is the advantage of doing that? HDFS … Read more

The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw- (on Windows)

May 11, 2023 by Tarik

First of all, make sure you are using correct Winutils for your OS. Then next step is permissions. On Windows, you need to run following command on cmd: D:\winutils\bin\winutils.exe chmod 777 D:\tmp\hive Hope you have downloaded winutils already and set the HADOOP_HOME variable.

Why is the final reduce step extremely slow in this MapReduce? (HiveQL, HDFS MapReduce)

May 5, 2023 by Tarik

If final reducer is a join then it looks like skew in join key. First of all check two things: check that b.f1 join key has no duplicates: select b.f1, count(*) cnt from B b group by b.f1 having count(*)>1 order by cnt desc; check the distribution of a.f1: select a.f1, count(*) cnt from A … Read more