版本信息
- cdh-6.0.1
- hadoop-3.0
- hbase-2.0.0
问题
想在空闲时候重启一下hbase释放一下内存,顺便修改一下yarn的一些配置,结果停掉后,hbase起不来了,错误信息就是hbase:namespace
表is not online,master一直初始化,具体错误信息:
15:41:59.313 [ProcExecTimeout] WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager - STUCK Region-In-Transition rit=OPENING, location=node4,16020,1589648302672, table=real_time_data, region=74cac15d22e99800ad0ace14c9ed74d6
15:41:59.313 [ProcExecTimeout] WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager - STUCK Region-In-Transition rit=OPENING, location=node3,16020,1596598630022, table=real_time_data, region=8e68891d5826c09974d81ad5d705c3b6
15:41:59.313 [ProcExecTimeout] WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager - STUCK Region-In-Transition rit=OPENING, location=node3,16020,1596598630022, table=real_time_data, region=75c42d75e2556bf70ff527f2425e8509
15:41:59.313 [ProcExecTimeout] WARN org.apache.hadoop.hbase.master.assignment.AssignmentManager - STUCK Region-In-Transition rit=OPENING, location=node3,16020,1596598630022, table=real_time_data, region=2eee04869ac2c35984d4d22e6e9f2f31
15:42:08.264 [master/node3:16000] INFO org.apache.hadoop.hbase.client.RpcRetryingCallerImpl - Call exception, tries=15, retries=15, started=128887 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.NotServingRegionException: hbase:namespace,,1558205786137.40562c48c9210c06813adce48773cb6a. is not online on node1,16020,1596957741742
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3273)
at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3250)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1414)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2446)
at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:409)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:131)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
, details=row 'default' on table 'hbase:namespace' at region=hbase:namespace,,1558205786137.40562c48c9210c06813adce48773cb6a., hostname=node1,16020,1589648239142, seqNum=55
... ...
15:44:58.229 [qtp1792826268-435] WARN org.eclipse.jetty.servlet.ServletHandler - /master-status
org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
at org.apache.hadoop.hbase.master.HMaster.isInMaintenanceMode(HMaster.java:2827) ~[hbase-server-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]
at org.apache.hadoop.hbase.tmpl.master.MasterStatusTmplImpl.renderNoFlush(MasterStatusTmplImpl.java:271) ~[hbase-server-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]
at org.apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.renderNoFlush(MasterStatusTmpl.java:389) ~[hbase-server-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]
at org.apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.render(MasterStatusTmpl.java:380) ~[hbase-server-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]
at org.apache.hadoop.hbase.master.MasterStatusServlet.doGet(MasterStatusServlet.java:81) ~[hbase-server-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) ~[javax.servlet-api-3.1.0.jar:3.1.0]
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) ~[javax.servlet-api-3.1.0.jar:3.1.0]
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) ~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) ~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:112) ~[hbase-http-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) ~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.apache.hadoop.hbase.http.ClickjackingPreventionFilter.doFilter(ClickjackingPreventionFilter.java:48) ~[hbase-http-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) ~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.apache.hadoop.hbase.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1374) ~[hbase-http-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) ~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.apache.hadoop.hbase.http.NoCacheFilter.doFilter(NoCacheFilter.java:49) ~[hbase-http-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) ~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.apache.hadoop.hbase.http.NoCacheFilter.doFilter(NoCacheFilter.java:49) ~[hbase-http-2.0.0.3.0.0.0-1634.jar:2.0.0.3.0.0.0-1634]
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) ~[jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) [jetty-security-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) [jetty-servlet-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.Server.handle(Server.java:534) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) [jetty-server-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) [jetty-io-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) [jetty-util-9.3.19.v20170502.jar:9.3.19.v20170502]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
常规操作
到这里,我尝试使用hbck命令查看详情并修复,发现hbase2.0.0
版本hbck
已经废弃了修复的命令。
-----------------------------------------------------------------------
NOTE: As of HBase version 2.0, the hbck tool is significantly changed.
In general, all Read-Only options are supported and can be be used
safely. Most -fix/ -repair options are NOT supported. Please see usage
below for details on which options are not supported.
-----------------------------------------------------------------------
省略若干...
省略若干...
省略若干...
NOTE: Following options are NOT supported as of HBase version 2.0+.
UNSUPPORTED Metadata Repair options: (expert features, use with caution!)
-fix Try to fix region assignments. This is for backwards compatiblity
-fixAssignments Try to fix region assignments. Replaces the old -fix
-fixMeta Try to fix meta problems. This assumes HDFS region info is good.
-fixHdfsHoles Try to fix region holes in hdfs.
-fixHdfsOrphans Try to fix region dirs with no .regioninfo file in hdfs
-fixTableOrphans Try to fix table dirs with no .tableinfo file in hdfs (online mode only)
-fixHdfsOverlaps Try to fix region overlaps in hdfs.
-maxMerge When fixing region overlaps, allow at most regions to merge. (n=5 by default)
-sidelineBigOverlaps When fixing region overlaps, allow to sideline big overlaps
-maxOverlapsToSideline When fixing region overlaps, allow at most regions to sideline per group. (n=2 by default)
-fixSplitParents Try to force offline split parents to be online.
-removeParents Try to offline and sideline lingering parents and keep daughter regions.
-fixEmptyMetaCells Try to fix hbase:meta entries not referencing any region (empty REGIONINFO_QUALIFIER rows)
UNSUPPORTED Metadata Repair shortcuts
-repair Shortcut for -fixAssignments -fixMeta -fixHdfsHoles -fixHdfsOrphans -fixHdfsOverlaps -fixVersionFile -sidelineBigOverlaps -fixReferenceFiles-fixHFileLinks
-repairHoles Shortcut for -fixAssignments -fixMeta -fixHdfsHoles
然后,查阅资料看到了hbck2,官方地址:https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2, 这个工具,本来以为抓住了救命的稻草,结果:
===================================================================
HBCK2 Overview
HBCK2 is currently a simple tool that does one thing at a time only.
In hbase-2.x, the Master is the final arbiter of all state, so a general principal for most HBCK2 commands is that it asks the Master to effect all repair. This means a Master must be up before you can run HBCK2 commands.
The HBCK2 implementation approach is to make use of an HbckService hosted on the Master. The Service publishes a few methods for the HBCK2 tool to pull on. Therefore, for HBCK2 commands relying on Master's HbckService facade, first thing HBCK2 does is poke the cluster to ensure the service is available. This will fail if the remote Server does not publish the Service or if the HbckService is lacking the requested method. For the latter case, if you can, update your cluster to obtain more fix facility.
HBCK2 versions should be able to work across multiple hbase-2 releases. It will fail with a complaint if it is unable to run. There is no HbckService in versions of hbase before 2.0.3 and 2.1.1. HBCK2 will not work against these versions.
Next we look first at how you 'find' issues in your running cluster followed by a section on how you 'fix' found problems.
===================================================================
wtm,服了。hbase2.0.0 ~ 2.0.2
以及hbase2.1.0 ~ 2.1.0
是不适用的,既不能使用hbck
,也不能使用hbck2
,这里出现了断层。
解决办法
1. 修复master,让集群正常启动
由于目前master无法初始化,集群无法启动,因为元数据表hbase:meta
信息有损坏,hbase:namespace
表is not online,首先需要让hbase:namespace
表上线,启动hbase
集群再说,否则后续的修复工作都进行不了;然后修复那些表(此时内心是崩溃的,都准备重搭建集群了)。
查看hbase源码,发现hbase元数据表hbase:namespace
表如果没有会重建,TableNamespaceManager.java
:
思路:备份hbase:namespace
表hdfs数据,删除hbase:namespace
表,启动时让其重建,然后将备份的数据bulkload
进新建的hbase:namespace
表中去。
删除hbase:meta
中hbase:namespace
那一行数据,并且mv走hbase:namespace
表对应的hdfs目录到临时目录备份,这样相当于把hbase:namespace
这个表删除了。
然后,重启hbase集群,namespace表会被重建,集群终于起来了。此时,hbase:namespace这张表里面保存的namespace只有default这个默认的namespace,我们通过bulkload,把临时目录里面的hfile文件移到hbase:namespace这张表里面,这样就还原了命名空间表。
2. 修复hbase表
很不容易,hbase集群已经起来了,通过webui发现,此时里面的表都是空的,无法找到每个region对应的hdfs数据文件。
由于hbase中的hbase:meta
表保存所有表的region分配等信息,现在由于集群异常停止,破坏了hbase:meta
表,应该是hbase:meta
表有损坏,导致hbase:namespace
表无法找到对应分配的region。
思路:通过.regioninfo
来修复hbase:meta
表,参考博客:https://blog.csdn.net/xyzkenan/article/details/103476160
工具地址:https://github.com/DarkPhoenixs/hbase-meta-repair
总结
这次异常,hbase集群无法启动,
两个表现:
- namespace region is not online;
- Master is initializing;
解决思路是:
- 首先删除hbase:namespace表,让hbase启动服务时自动创建,解决了hbase无法启动问题:
- 然后,通过.regioninfo文件修复hbase:meta表。
文章评论
:sad: :oops: 你这个表情挺有意思啊 :oops:
@极客分享 哈哈
我们hbase集群是cdh6.2.0,hbase是2.1.0的我也是tm醉了
@浪味的小绵羊 2.x版本只有2.1.0不能用hbck/hbck2工具修复,你们刚刚好。