Elasticsearch 调试

Elasticsearch 调试

我们的 elasticsearch 一团糟。集群健康状态始终为红色,我决定调查并尽可能挽救它。但我不知道从哪里开始。以下是有关我们集群的一些信息:

{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 6,
  "active_primary_shards" : 91,
  "active_shards" : 91,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 201,
  "number_of_pending_tasks" : 0
}

6 个节点:

host               ip         heap.percent ram.percent load node.role master name
es04e.p.comp.net 10.0.22.63            30          22 0.00 d         m      es04e-es
es06e.p.comp.net 10.0.21.98            20          15 0.37 d         m      es06e-es
es08e.p.comp.net 10.0.23.198            9          44 0.07 d         *      es08e-es
es09e.p.comp.net 10.0.32.233           62          45 0.00 d         m      es09e-es
es05e.p.comp.net 10.0.65.140           18          14 0.00 d         m      es05e-es
es07e.p.comp.net 10.0.11.69            52          45 0.13 d         m      es07e-es

您可以立即看到我有大量未分配的分片(201)。我遇到了这个答案并尝试了一下'acknowledged:true',但上面发布的两组信息都没有变化。

接下来,我登录其中一个节点es04并查看日志文件。第一个日志文件中有几行引起了我的注意

[2015-05-21 19:44:51,561][WARN ][transport.netty          ] [es04e-es] exception caught on transport layer [[id: 0xbceea4eb]], closing connection

[2015-05-26 15:14:43,157][INFO ][cluster.service          ] [es04e-es] removed {[es03e-es][R8sz5RWNSoiJ2zm7oZV_xg][es03e.p.sojern.net][inet[/10.0.2.16:9300]],}, reason: zen-disco-receive(from master [[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]]])
[2015-05-26 15:22:28,721][INFO ][cluster.service          ] [es04e-es] removed {[es02e-es][XZ5TErowQfqP40PbR-qTDg][es02e.p.sojern.net][inet[/10.0.2.229:9300]],}, reason: zen-disco-receive(from master [[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]]])
[2015-05-26 15:32:00,448][INFO ][discovery.ec2            ] [es04e-es] master_left [[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]]], reason [shut_down]
[2015-05-26 15:32:00,449][WARN ][discovery.ec2            ] [es04e-es] master left (reason = shut_down), current nodes: {[es07e-es][etJN3eOySAydsIi15sqkSQ][es07e.p.sojern.net][inet[/10.0.2.69:9300]],[es04e-es][3KFMUFvzR_CzWRddIMdpBg][es04e.p.sojern.net][inet[/10.0.1.63:9300]],[es05e-es][ZoLnYvAdTcGIhbcFRI3H_A][es05e.p.sojern.net][inet[/10.0.1.140:9300]],[es08e-es][FPa4q07qRg-YA7hAztUj2w][es08e.p.sojern.net][inet[/10.0.2.198:9300]],[es09e-es][4q6eACbOQv-TgEG0-Bye6w][es09e.p.sojern.net][inet[/10.0.2.233:9300]],[es06e-es][zJ17K040Rmiyjf2F8kjIiQ][es06e.p.sojern.net][inet[/10.0.1.98:9300]],}
[2015-05-26 15:32:00,450][INFO ][cluster.service          ] [es04e-es] removed {[es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]],}, reason: zen-disco-master_failed ([es01e-es][JzkWq9qwQSGdrWpkOYvbqQ][es01e.p.sojern.net][inet[/10.0.2.237:9300]])
[2015-05-26 15:32:36,741][INFO ][cluster.service          ] [es04e-es] new_master [es04e-es][3KFMUFvzR_CzWRddIMdpBg][es04e.p.sojern.net][inet[/10.0.1.63:9300]], reason: zen-disco-join (elected_as_master)

在本节中,我意识到有几个节点es01,,已被删除。es02es03

此后,所有日志文件(约30个)都只有一行:

[2015-05-26 15:43:49,971][DEBUG][action.bulk              ] [es04e-es] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

我检查了所有节点,发现它们的 ES 和 logstash 版本相同。我意识到这是一个非常复杂的问题,但如果有人能找出问题所在并给我指明正确的方向,那将是巨大的帮助

编辑:

Indices:
health status index               pri rep docs.count docs.deleted store.size pri.store.size
yellow open   logstash-2015.07.20   5   1   95217146            0     30.8gb         30.8gb
yellow open   logstash-2015.07.12   5   1      66254            0     10.5mb         10.5mb
yellow open   logstash-2015.07.21   5   1   51979343            0     17.8gb         17.8gb
yellow open   logstash-2015.07.17   5   1     184206            0     27.9mb         27.9mb
red    open   logstash-2015.08.18   5   1
red    open   logstash-2015.08.19   5   1
yellow open   logstash-2015.07.25   5   1  116490654            0       55gb           55gb
red    open   logstash-2015.08.11   5   1
red    open   logstash-2015.08.20   5   1
yellow open   logstash-2015.07.28   5   1  171527842            0     79.5gb         79.5gb
red    open   logstash-2015.08.03   5   1
yellow open   logstash-2015.07.26   5   1  130029870            0     61.1gb         61.1gb
red    open   logstash-2015.08.01   5   1
yellow open   logstash-2015.07.18   5   1     143834            0     21.5mb         21.5mb
red    open   logstash-2015.08.05   5   1
yellow open   logstash-2015.07.19   5   1      94908            0       15mb           15mb
yellow open   logstash-2015.07.22   5   1   52295727            0     18.2gb         18.2gb
red    open   logstash-2015.07.29   5   1
red    open   logstash-2015.08.02   5   1
yellow open   logstash-2015.07.16   5   1     185120            0     25.8mb         25.8mb
red    open   logstash-2015.08.04   5   1
yellow open   logstash-2015.07.24   5   1  144885713            0     68.3gb         68.3gb
red    open   logstash-2015.07.30   5   1
yellow open   logstash-2015.07.14   5   1   65650867            0     22.1gb         22.1gb
yellow open   logstash-2015.07.27   5   1  170717799            0     79.3gb         79.3gb
red    open   logstash-2015.07.31   5   1
yellow open   .kibana               1   1          7            0     30.3kb         30.3kb
yellow open   logstash-2015.07.13   5   1      87420            0     13.5mb         13.5mb
yellow open   logstash-2015.07.23   5   1  161453183            0     75.7gb         75.7gb
yellow open   logstash-2015.07.15   5   1     189168            0     34.9mb         34.9mb
yellow open   logstash-2015.07.11   5   1      58411            0      8.9mb          8.9mb

elasticsearch.yml

cloud:
  aws:
    protocol: http
    region: us-east
discovery:
  type: ec2
node:
  name: es04e-es
path:
  data: /var/lib/es-data-elasticsearch

碎片 -http://pastie.org/10364963

_cluster/settings/pretty?true

{
  "persistent" : { },
  "transient" : { }
}

答案1

红色状态的主要原因很明显——未分配的 chard。好的,让我们尝试解决它:

  1. 我们来检查一下 chards 处于未分配状态的原因:

    _cat/shards?h=index,shard,prirep,state,unassigned.reason
    
  2. 检查节点上是否有足够的可用空间:

    _cat/allocation?v
    
  3. 您可以尝试手动将 chard 分配给 datanode,并检查 ElasticSearch 响应和完整的错误解释:

    {
        "commands": [
        {
          "allocate": {
          "index": "index_name",
          "shard": shard_num,
          "node": "datanode_name",
          "allow_primary": true
        }
      }
      ]
    }
    

相关内容