HBase运维过程中,最大的问题除了自己一些bug外,就是网络的延迟。这种延迟会导致hadoop的append的timeout,本来只是一个小事,但是会导致HBase因为无法append WAL log 退出。
而这次遇到的却是zookeeper的问题。
我们的集群里面有3台zookeeper。首先lead(A) 和其中的一台follower B(xx.xx.xx.85)连接出现异常,而这台zookeeper的follower B之后退出。
2011-08-01 03:28:30,013 [LearnerHandler-/xx.xx.xx.85:48270] ERROR org.apache.zookeeper.server.quorum.LearnerHandler: Unexpected exception causing shutdown while sock still open
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:375)
2011-08-01 03:28:30,013 [LearnerHandler-/xx.xx.xx.85:48270] WARN org.apache.zookeeper.server.quorum.LearnerHandler: ******* GOODBYE /xx.xx.xx.85:48270 ********
B试图退出,但是退出失败。大量的session连接关闭。
而后,follower c 也出现异常。
2011-08-01 03:29:38,562 [CommitProcessor:0] ERROR org.apache.zookeeper.server.NIOServerCnxn: Unexpected Exception:
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:148)
at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1043)
at org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1080)
at org.apache.zookeeper.server.DataTree.setWatches(DataTree.java:1154)
at org.apache.zookeeper.server.ZKDatabase.setWatches(ZKDatabase.java:383)
at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:297)
at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
整个过程中,zookeeper和Hbase的session都中断了。导致master遇到fatal的error而退出
2011-08-01 03:29:38,953 [main-EventThread] FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected zk exception getting RS nodes
org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /SPN-hbase/rs
at org.apache.zookeeper.KeeperException.create(KeeperException.java:104)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:307)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndGetNewChildren(ZKUtil.java:418)
at org.apache.hadoop.hbase.zookeeper.RegionServerTracker.nodeChildrenChanged(RegionServerTracker.java:86)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:315)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:560)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:536)
2011-08-01 03:29:38,953 [main-EventThread] INFO org.apache.hadoop.hbase.master.HMaster: Aborting
2011-08-01 03:29:38,954 [main-EventThread] WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: master:8100-0x230684e82d6738d-0x230684e82d6738d Unable to list children of znode /SPN-hbase/tokenauth/keys
org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /SPN-hbase/tokenauth/keys
at org.apache.zookeeper.KeeperException.create(KeeperException.java:104)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:307)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndGetNewChildren(ZKUtil.java:418)
at org.apache.hadoop.hbase.security.token.ZKSecretWatcher.nodeChildrenChanged(ZKSecretWatcher.java:116)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:315)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:560)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:536)
2011-08-01 03:29:38,954 [main-EventThread] ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:8100-0x230684e82d6738d-0x230684e82d6738d Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /SPN-hbase/tokenauth/keys
at org.apache.zookeeper.KeeperException.create(KeeperException.java:104)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:307)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndGetNewChildren(ZKUtil.java:418)
at org.apache.hadoop.hbase.security.token.ZKSecretWatcher.nodeChildrenChanged(ZKSecretWatcher.java:116)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:315)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:560)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:536)
2011-08-01 03:29:38,954 [main-EventThread] ERROR org.apache.hadoop.hbase.security.token.ZKSecretWatcher: Error reading data from zookeeper
org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /SPN-hbase/tokenauth/keys
at org.apache.zookeeper.KeeperException.create(KeeperException.java:104)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:307)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndGetNewChildren(ZKUtil.java:418)
at org.apache.hadoop.hbase.security.token.ZKSecretWatcher.nodeChildrenChanged(ZKSecretWatcher.java:116)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:315)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:560)
at
由于我们还有backup master,然而,backup master因为zookeeper的缘故也无法正常工作。
之后,大量的regionserver down。
2011-08-01 03:29:38,565 [ZKSecretWatcher-leaderElector] INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Unexpected error from ZK, stopping candidate
2011-08-01 03:29:38,565 [ZKSecretWatcher-leaderElector] INFO org.apache.hadoop.hbase.security.token.AuthenticationTokenSecretManager: Stopping leader election, because: Unexpected error from ZK: KeeperErrorCode = InvalidACL for /SPN-hbase/tokenauth/keymaster
整个过程中,我们看到zookeeper的一次异常对HBase的致命打击。
现在,我们只能在regionserver和zookeeper上面加watchdog,对down的server快速重启来避免这种问题的发生。
HBase也意识到这个问题。
https://issues.apache.org/jira/browse/HBASE-3065
试图在zookeeper扰动的过程中尽量保证HBase的运行。增加了更多的retry
分享到:
相关推荐
Hadoop2.2+Zookeeper3.4.5+HBase0.96集群环境搭建
Hadoop集群搭建必备安装包,包括zookeeper3.4.12+hbase1.4.4+sqoop1.4.7bin_hadoop-2.6.0+kafka2.10亲测可用。
hbase配置内置的zookeeper;hbase配置内置的zookeeper;
zookeeper及hbase安装配置,安装时先检查zookeeper,hbase是否与hadoop版本对应
Hadoop HA高可用集群搭建(Hadoop+Zookeeper+HBase)
企业内部实际 hadoop zookeeper hbase搭建步骤明细
hadoop-2.8.1 zookeeper-3.4.9 hbase-1.3.1分布式环境搭建整理
Hadoop2.6+HA+Zookeeper3.4.6+Hbase1.0.0 集群安装详细步骤
Zookeeper和Hbase安装总结手册.
hadoop集群配置流程以及用到的配置文件,hadoop2.8.4、hbase2.1.0、zookeeper3.4.12
VMware10+CentOS6.5+Hadoop2.2+Zookeeper3.4.6+HBase0.96安装过程详解 用于解决分布式集群服务器
数据仓库hadoop+zookeeper+hbase集群安装方法记录,自己搭建纯手写的记录。相关软件请自行下载
Hadoop+Zookeeper+Hbase+Hive部署
zookeeper+hbase集群搭建+搭建过程报错解决+原理,通俗易懂,详细配置注解!!!
Hadoop+ZooKeeper+HBase+hive(HQL)安装步骤
徐老师大数据培训Hadoop+HBase+ZooKeeper+Spark+Kafka+Scala+Ambari
VMware10+CentOS6.5+Hadoop2.2+Zookeeper3.4.6+HBase0.96安装过程详解.pdf
hadoop、hbase、hive、zookeeper版本对应关系续(最新版)
HBase即Hadoop Database,是一个高可靠性、高性能、面向列、可伸缩的分布式存储系统,利用HBase技术可在普通的PC Server上搭建起大规模结构化存储集群。 HBase是Google Bigtable的开源实现,类似Google Bigtable利用...
Docker(Hadoop_3.3.1+HBase_2.4.16+Zookeeper_3.7.1+Hive_3.1.3 )配置文件 搭建集群环境