elasticsearch出现TranslogCorruptedException导致shard不能启动的问题修复

测试elasticsearch过程中，遇到translog损坏的异常，将修复的过程记录下来。

1. 问题

单机数据量有8亿+，一个index，20+个字段，使用bulk不停的写数据，bulk.size=5W，此时机器意外断电宕机。

机器修复后重启ES，出现translogCorruptedException异常：

[2015-01-06 16:12:34,061][WARN ][indices.cluster          ] [node_141] [ips][4] failed to start shardorg.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [ips][4] failed to recover shardat org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)at java.lang.Thread.run(Thread.java:745)Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from streamat org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)... 4 moreCaused by: java.io.EOFExceptionat org.elasticsearch.common.io.stream.InputStreamStreamInput.readBytes(InputStreamStreamInput.java:53)at org.elasticsearch.index.translog.BufferedChecksumStreamInput.readBytes(BufferedChecksumStreamInput.java:55)at org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:86)at org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:74)at org.elasticsearch.index.translog.Translog$Create.readFrom(Translog.java:353)at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)... 5 more[2015-01-06 16:12:34,062][DEBUG][index.service            ] [node_141] [ips] [4] closing... (reason: [recovery failure [IndexShardGatewayRecoveryException[[ips][4] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]])[2015-01-06 16:12:34,062][DEBUG][index.shard.service      ] [node_141] [ips][4] state: [RECOVERING]->[CLOSED], reason [recovery failure [IndexShardGatewayRecoveryException[[ips][4] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]]

提示有四个shard start failed，bulk写数据到index失败：

Primary shard is not active or isn't assigned is a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@372c22f5]]

2. 解决方法

找了一些办法修复，包括lucene的CheckIndex修复工具。

CheckIndex的官方解释：Basic tool and API to check the health of an index and write a new segments file that removes reference to problematic segments.

会造成损坏segment中的数据丢失。

想找一个数据丢失最少的解决方法，在google group上找到一个类似的问题：ES failed to recover after crash

Motov给的解决方案：

- shut down elasticsearch cluster

- find all shards that cannot recover by searching log file

- for each shard move its non-zero length translog file into a temporary directory (see explanation below)

- start elasticsearch cluster

- if you see messages for other shards - repeat

也就是

关闭集群 --> 找到不能启动的shard --> 清除这些shard的 translog(注意做备份) --> 重启ES集群

如果还不行重复以上过程。

尝试着清除出现问题shard 的 translog，果然ES所有的shard都启动成功。

3. 分析总结

ES 的translog中包含对ES所有的所有更改，是数据备份和恢复的重要组件。

如果在写translog时发生宕机事故，translog写入流程没有正常的结束，translog文件结尾没有正确的结束符号，

导致eof Exception。

另：Motov的完整回答：

In nel's case it was corrupted transaction log. When you run out of disk space sometimes the last transaction cannot be fully written into transaction log and then it fails on recovery. If you see exactly the same error messages, you can try the following:

- shut down elasticsearch cluster

- find all shards that cannot recover by searching log file

- for each shard move its non-zero length translog file into a temporary directory (see explanation below)

- start elasticsearch cluster

- if you see messages for other shards - repeat

If you see message like this:

[2012-06-22 17:36:17,165][WARN ][indices.cluster ] [Cat-Man] [ myindex][ 1] failed to start shard

It means that it cannot recover shard 1 of the index myindex on the node Cat-Man. If you take a look at data/elasticsearch/nodes/0/ indices/ myindex/1/translog directory, you will find files like this: translog-123456677899 or translog-123456677899. recovering. One of them will have non-zero length. Move it to a temporary directory and try starting the server.

The transaction log files that you will be moving out contain your most recently updated and indexed documents. So, these updates will be lost as a result of this operations, but you should be able to recover the rest of your data.