Hadoop: …be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation

This error is caused by the block replication system of HDFS since it could not manage to make any copies of a specific block within the focused file. Common reasons of that: Only a NameNode instance is running and it’s not in safe-mode There is no DataNode instances up and running, or some are dead. … Read more

Default Namenode port of HDFS is 50070.But I have come across at some places 8020 or 9000 [closed]

The default Hadoop ports are as follows: (HTTP ports, they have WEB UI): Daemon Default Port Configuration Parameter ———————– ———— ———————————- Namenode 50070 dfs.http.address Datanodes 50075 dfs.datanode.http.address Secondarynamenode 50090 dfs.secondary.http.address Backup/Checkpoint node? 50105 dfs.backup.http.address Jobracker 50030 mapred.job.tracker.http.address Tasktrackers 50060 mapred.task.tracker.http.address Internally, Hadoop mostly uses Hadoop IPC, which stands for Inter Process Communicator, to communicate amongst … Read more

Spark iterate HDFS directory

You can use org.apache.hadoop.fs.FileSystem. Specifically, FileSystem.listFiles([path], true) And with Spark… FileSystem.get(sc.hadoopConfiguration).listFiles(…, true) Edit It’s worth noting that good practice is to get the FileSystem that is associated with the Path‘s scheme. path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true)

Why is the final reduce step extremely slow in this MapReduce? (HiveQL, HDFS MapReduce)

If final reducer is a join then it looks like skew in join key. First of all check two things: check that b.f1 join key has no duplicates: select b.f1, count(*) cnt from B b group by b.f1 having count(*)>1 order by cnt desc; check the distribution of a.f1: select a.f1, count(*) cnt from A … Read more

tech