问题描述
在往HDFS 上导数据时遇到的问题,文件大概有100G左右,总个数有100个
近两天往HDFS上传文件时发现,hdfs报这个告警后,上传就变的特别慢。但是可以正常上传,上传上去的文件大小无异常,只是传输很慢
报错内容删除了敏感信息
19/12/10 11:27:31 INFO hdfs.DataStreamer: Slow ReadProcessor read fields for block BP-15555804:blk_1128962062_655986 took 358746ms (threshold=30000ms); ack: seqno: 66635 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 358748548591 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[,S-c7e3c2db-8e7a-4632-8bf1-0cb4e205cd30,DISK], DatanodeInfoWithStorage[,DS-eaa7d5b5d02-4e7d-9698-0358b7ae88,DISK], DatanodeInfoWithStorage[,DS-10ed3e-b2ad-47ce-9052-6654c689ed,DISK]]
19/12/10 11:27:33 WARN hdfs.DataStreamer: Exception for BP-1763739111--1555014954:blk_112896062_6550298
java.io.IOException: Bad response ERROR for BP-1763739111--15555054:blk_1128962062_6502986 from datanode DatanodeInfoWithStorage[,D-aa70d5b-5d02-4e7d-9698-0c58b7aDISK]
at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1126)
19/12/10 11:27:33 WARN hdfs.DataStreamer: Error Recovery for BP-1763739111--1580014954:blk_112962062_602986 in pipeline [DatanodeInfoWithStorage[,DS-c7e3cb-8e7a-4632-8bf1-0cb4e20530,DISK], DatanodeInfoWithStorage[,DS-eaa70d5b-5d02-4e7d-9698-03c58b7ae288,DISK], DatanodeInfoWithStorage[,DS-10e8d3ce-b2ad-47ce-9052-665b4c6589ed,DISK]]: datanode 1(DatanodeInfoWithStorage[,DS-eaa70d5b-5d02-4e7d-9698-03c58b7ae288,DISK]) is bad.
问题解决
在cloudera 社区中看到的 回答
Right now I can think of 3 possible scenario which might create this issue.
1. Check the datanode logs corresponding to that block id, If there is any issue with the disks then probably you will see some messages like below. In case you can't find these messages then you still need take a look at disks IO
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:200ms (threshold=15ms)
OR
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 986ms (threshold=300ms)
2. check datanode logs and it's GC pattern because a long GC pause can also cause this issue.
3. If datanode doesn't show any disk and GC issues then the networks latency b/w client and datanode could be a culprit, you can use iperf or other network tool to check latency.
翻译:
现在我能想到三种可能导致这个问题的情况。
1. 检查与该块id对应的datanode日志,如果磁盘有任何问题,那么您可能会看到下面这样的消息。如果你找不到这些消息,那么你仍然需要看看磁盘IO
警告org.apache.hadoop.hdfs.server.datanode。DataNode:慢块接收器写数据到磁盘的开销:200ms(阈值=15ms)
或
警告org.apache.hadoop.hdfs.server.datanode。DataNode:慢块接收器写包到镜像需要986ms(阈值=300ms)
2. 检查datanode日志和它的GC模式,因为长时间的GC暂停也会导致这个问题。
3.如果datanode没有显示任何磁盘和GC问题,那么网络延迟b/w客户机和datanode可能是罪魁祸首,您可以使用iperf或其他网络工具来检查延迟。
目前是有DataNode 坏掉了,因为没有权限目前无法处理这个issue,只能向管理员提交。在博客记录下来这个问题
版权声明:本文为weixin_44421339原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。