解决Namenode异常停止后无法正常启动
hadoop namenode无法启动问题我在网上找了一下有很多问题,这个集群服务器碰到此问题就是比较大的问题了,下面我们一起来看看此问题解决办法吧.
背景:公司在线上使用了CDH5 HA模式,有两个Namenode节点,结果其中的Standby节点因为一些关于edits文件的报错异常停止了,并且在启动的过程中一直报告找不到各种文件.
刚开始怀疑问题可能只发生在Standby本身,因此尝试了bootstrapStandby来重新初始化Standby节点,但问题依旧.
而后来因为我尝试重启ZKFC(Zookeeper Failover)服务器,导致了Active节点进行自动切换,在切换失败后打算切换回去时,也无法启动服务了,报错跟Standby节点一模一样,于是整个Hadoop集群就挂了.
问题严重,在搜遍了整个Google都找不到任何有用的信息之后,只能求助于老大,最后,老大想到一个思路,就是将fsimage(元数据)文件与edits(编辑日志)文件都反编译成文本,查看里面具体有什么内容,为什么加载edits文件时会报错.
结果,这个思路给我们带来了曙光,并最终修复了整个集群.
环境介绍:
- idc2-server1:namenode,journalnode,zkfc
- idc2-server2:namenode,journalnode,zkfc
- idc2-server3:journalnode,resourcemanager
具体过程:
首先,是Standby Namenode上出现以下错误,然后自动异常关闭了进程:
- 2014-11-1102:12:54,057FATALorg.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer:Unknownerrorencounteredwhiletailingedits.ShuttingdownstandbyNN.
- java.io.FileNotFoundException:Filedoesnotexist:/user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part-00015
- atorg.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
- atorg.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
- ...
其中提到了"Unknown error encountered while tailing edits. Shutting down standby NN."
于是,我们尝试启动Standby Namenode服务,结果报告以下错误:
- 2014-11-1204:26:28,860INFOorg.apache.hadoop.hdfs.server.namenode.EditLogInputStream:Fast-forwardingstream'http://idc2-server10.heylinux.com:8480/getJournal?jid=idc2&segmentTxId=240823073&storageInfo=-55%3A1838233660%3A0%3ACID-d77ea84b-1b24-4bc2-ad27-7d2384d222d6'totransactionID240741256
- 2014-11-1204:26:28,874ERRORorg.apache.hadoop.hdfs.server.namenode.FSEditLogLoader:EncounteredexceptiononoperationCloseOp[length=0,inodeId=0,path=/user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part-00015,replication=3,mtime=1415671845582,atime=1415670522749,blockSize=134217728,blocks=[],permissions=oozie:hdfs:rw-r--r--,aclEntries=null,clientName=,clientMachine=,opCode=OP_CLOSE,txid=240823292]
- java.io.FileNotFoundException:Filedoesnotexist:/user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part-00015
- atorg.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
- atorg.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)//phpfensi.com
- ...
- 2014-11-1204:26:32,641WARNorg.apache.hadoop.hdfs.server.namenode.FSNamesystem:Encounteredexceptionloadingfsimage
- java.io.FileNotFoundException:Filedoesnotexist:/user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part-00015
- atorg.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
说找不到"/user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part-00015"这个文件。
而事实上,这个文件是临时文件,不重要并且已经被删除了,但在上面,却报告"ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader:Encountered exception on operation CloseOp",可以看出是在加载edits文件,执行OP_CLOSE操作时提示找不到文件.
刚开始我们怀疑可能只是Standby上的fsimage文件或edits文件有问题,于是我们在Standby上执行了bootstrapStandby,改过程会自动从Active Namenode上获取最新的fsimage文件,并从Journalnode日志服务器上下载并执行新的edits文件.
sudo -u hdfs hadoop namenode -bootstrapStandby
但是,在初始化之后,加载edits时仍然遇到上面相同的报错,而接下来,由于我尝试将ZKFC(Zookeeper Failover)服务重启,导致了Active Namenode自动切换到Standby,但由于Standby无法take over,所以Active Namenode切换回去的时候,也无法正常重启了,错误跟Standby启动时一样.
这样一来,整个Hadoop集群就挂了,在搜遍了整个Google都找不到任何有用的信息之后,我打电话给了老大,老大也通过上面的错误Google不到任何一条有用的信息.
于是老大尝试在edits文件中grep上面的路径,找到了一些相关的edits文件:
- #cd/data1/dfs/nn/
- #cp-rpacurrentcurrent.backup.orig
- #cd/data2/dfs/nn/
- #cp-rpacurrentcurrent.backup.orig
- #cd/data1/dfs/nn/current
- #grepattempt_1415171013961_37060_m_000015_0*
- Binaryfileedits_0000000000240687057-0000000000240698453matches
- Binaryfileedits_0000000000240823073-0000000000240838096matches
- Binaryfileedits_inprogress_0000000000244853266matches
于是,我们思考能否从这些edits文件或fsimage文件中找到一些线索,而下面的两篇文章中,提到了Hadoop自带的针对fsimage和edits文件的反编译工具:
- http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html
- http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html
其中,关于edits文件的一些描述给我们带来了极大的希望:
- Incasethereissomeproblemwithhadoopclusterandtheeditsfileiscorrupteditispossibletosaveatleastpartoftheeditsfilethatiscorrect.ThiscanbedonebyconvertingthebinaryeditstoXML,edititmanuallyandthenconvertitbacktobinary.
通过以上描述,我们了解到edits文件可能会corrupted,而反编译之后手动修改,在编译回二进制格式进行替换,可以作为一种解决办法,于是我们将上面找到的两个关联edits文件,将其复制出来并进行了反编译:
- #mkdir/tmp2/
- #cd/data1/dfs/nn
- #cpedits_0000000000240687057-0000000000240698453/tmp2/
- #cpedits_0000000000240823073-0000000000240838096/tmp2/
- #cd/tmp2
- #hdfsoev-iedits_0000000000240687057-0000000000240698453-oedits_0000000000240687057-0000000000240698453.xml
- #hdfsoev-iedits_0000000000240823073-0000000000240838096-oedits_0000000000240823073-0000000000240838096.xml
反编译之后,生成了两个XML文件,我们在XML文件中搜索"/user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part-00015",找到了OP_CLOSE与OP_DELETE相关记录:
- <RECORD>
- <OPCODE>OP_DELETE</OPCODE>
- <DATA>
- <TXID>240818498</TXID>
- <LENGTH>0</LENGTH>
- <PATH>/user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part-00015</PATH>
- <TIMESTAMP>1415671972595</TIMESTAMP>
- <RPC_CLIENTID>4a38861d-3bee-40e6-abb6-d2b58f313781</RPC_CLIENTID>
- <RPC_CALLID>676</RPC_CALLID>
- </DATA>
- </RECORD>
- <RECORD>
- <OPCODE>OP_CLOSE</OPCODE>
- <DATA>
- <TXID>240823292</TXID>
- <LENGTH>0</LENGTH>
- <INODEID>0</INODEID>
- <PATH>/user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part-00015</PATH>
- <REPLICATION>3</REPLICATION>
- <MTIME>1415671845582</MTIME>
- <ATIME>1415670522749</ATIME>
- <BLOCKSIZE>134217728</BLOCKSIZE>
- <CLIENT_NAME></CLIENT_NAME>
- <CLIENT_MACHINE></CLIENT_MACHINE>
- <PERMISSION_STATUS>
- <USERNAME>oozie</USERNAME>
- <GROUPNAME>hdfs</GROUPNAME>
- <MODE>420</MODE>
- </PERMISSION_STATUS>
- </DATA>
- </RECORD>
可以看到,对于"/user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000015_0/part-00015",OP_DELETE发生在了OP_CLOSE之前,因此执行OP_CLOSE时会提示"File does not exist".
于是,我们尝试将OP_CLOSE这部分代码,替换成其它的内容,比如无关痛痒的修改一个现有文件的权限,并保留TXID 240823292以确保edits文件的完整性.
- <RECORD>
- <OPCODE>OP_SET_PERMISSIONS</OPCODE>
- <DATA>
- <TXID>240823292</TXID>
- <SRC>/user/oozie-heylinux/.staging/job_1415171013961_37194</SRC>
- <MODE>504</MODE>
- </DATA>
- </RECORD>
修改完成之后,再将XML文件反编译回binary格式.
- #cd/tmp2/
- #cpedits_0000000000240823073-0000000000240838096.xmledits_0000000000240823073-0000000000240838096.xml.orig
- #vimedits_0000000000240823073-0000000000240838096.xml
- #hdfsoev-iedits_0000000000240823073-0000000000240838096.xml-oedits_0000000000240823073-0000000000240838096-pbinary
然后将binary文件同步到journalnode日志服务器中:
- #cd/var/hadoop/data/dfs/jn/idc2prod/
- #cp-rpacurrentcurrent.backup.orig
- #cd/tmp2/
- #cpedits_0000000000240823073-0000000000240838096/data1/dfs/nn/current/
- #cpedits_0000000000240823073-0000000000240838096/data2/dfs/nn/current/
- #cpedits_0000000000240823073-0000000000240838096/var/hadoop/data/dfs/jn/idc2prod/current/
- #scpedits_0000000000240823073-0000000000240838096root@idc2-server2:/var/hadoop/data/dfs/jn/idc2prod/current/
- #scpedits_0000000000240823073-0000000000240838096root@idc2-server3:/var/hadoop/data/dfs/jn/idc2prod/current/
然后启动namenode服务,可以发现,之间的错误已经不存在了,取而代之的已经变成了其它文件.
- 2014-11-1208:57:13,053INFOorg.apache.hadoop.hdfs.server.namenode.EditLogInputStream:Fast-forwardingstream'http://idc2-server1.heylinux.com:8480/getJournal?jid=idc2prod&segmentTxId=240823073&storageInfo=-55%3A1838233660%3A0%3ACID-d77ea84b-1b24-4bc2-ad27-7d2384d222d6'totransactionID240299210
- 2014-11-1208:57:13,063ERRORorg.apache.hadoop.hdfs.server.namenode.FSEditLogLoader:EncounteredexceptiononoperationCloseOp[length=0,inodeId=0,path=/user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000018_0/part-00018,replication=3,mtime=1415671845675,atime=1415670519537,blockSize=134217728,blocks=[],permissions=oozie:hdfs:rw-r--r--,aclEntries=null,clientName=,clientMachine=,opCode=OP_CLOSE,txid=240823337]
- java.io.FileNotFoundException:Filedoesnotexist:/user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000018_0/part-00018
- atorg.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
- atorg.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
- ...
- 2014-11-1208:57:16,847WARNorg.apache.hadoop.hdfs.server.namenode.FSNamesystem:Encounteredexceptionloadingfsimage
- java.io.FileNotFoundException:Filedoesnotexist:/user/dong/data/dpp/classification/gender/vw-output-train/2014-10-30-research-with-confict-fix-bug-rerun/_temporary/1/_temporary/attempt_1415171013961_37060_m_000018_0/part-00018
- atorg.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
- ...
那么,接下来,就是重复以上动作,其中有时候能找到一部分规律,可以批量将同一个目录下反复报错的文件的OP_CLOSE都替换掉,但更多的时候则是随机的文件,需要一次次修改XML文件,然后编译成binary,再启动namenode,进行针对性的修改,一直反复的进行下去,直到Namenode能够成功启动为止.
我们在具体的操作过程中,还遇到过关于OP_ADD_BLOCK的错误,造成问题的原因是由于最后一个edits文件在反编译回binary文件时出现一些关于OP_UPDATE_BLOCK的错误.
我将报错的部分通过以上方式进行了替换,才成功的将edits文件反编译回binary文件.
具体的解决办法:就是根据"Encountered exception on operation AddBlockOp"定位到OP_ADD_BLOCK相关配置并替换即可.
- 2014-11-1218:07:39,070ERRORorg.apache.hadoop.hdfs.server.namenode.FSEditLogLoader:EncounteredexceptiononoperationAddBlockOp[path=/user/dong/data/dpp/classification/gender/vw-input/2014-10-30-research-with-no-confict-fix-bug-rerun/all_labelled/_temporary/1/_temporary/attempt_1415171013961_42350_m_001474_0/part-m-01474,penultimateBlock=NULL,lastBlock=blk_1109647729_35920089,RpcClientId=,RpcCallId=-2]
- java.lang.IllegalStateException
最后,在Namenode启动成功后,会报告很多Block丢失,解决办法是通过fsck删除这些错误的Block.
# hadoop fsck / -files -blocks -locations | tee -a fsck.out
然后在fsck.out中获取所有Block的信息,执行"hadoop fsck / -move"加Block路径进行删除,最后,退出safemode,生活再次变得美好起来.
# hadoop dfsadmin -safemode leave
PS:在后来,上面的问题又再次出现了,通过分析我们发现是由于Hadoop的一个Bug导致的,从2.10-beta就开始存在了:
https://issues.apache.org/jira/browse/HDFS-6527
https://issues.apache.org/jira/browse/HDFS-6647
想要彻底避免这类问题的再次发生,必须将Hadoop升级到2.5以上版本,而对应的CDH版本则是5.2.0.
热门评论