Elasticsearch 集群节点CPU爆满的问题排查以及解决

问题说明

某一天收到报警，线上的Elasticsearch集群中的一台节点CPU达到了99%，查看Elasticsearch的其他节点CPU都是正常的，唯独这台服务器CPU久满不下。

如下图：

Elasticsearch 集群说明

生产Elasticsearch 集群配置：

六台节点，其中数据节点为3台，两台master，1台client。
版本：5.6.4

出问题的节点为node-3。

由于资源有限将两个节点放到了一台机器上。

role	ip	host	port api	transport	hostname	cpu	mem	disk
es master-1	172.x.x.38	da02	9221	9444	da02	4C	16G	500G
es master-2	172.x.x.221	da03	9221	9444	da03	4C	16G	500G
es clinet	172.x.x.37	da01	9220	9333	da01	4C	16G	500G
es node-1	172.x.x.37	da01	9221	9334	da01	4C	16G	500G
es node-2	172.x.x.38	da02	9220	9333	da02	4C	16G	500G
es node-3	172.x.x.221	da03	9222	9555	da03	4C	16G	500G

问题排查思路1

首先想到的就是CPU过高，是不是内存不够了，一直导致Full GC呢？

排查命令：

# top命令获取es的进程的PID,大写P按照CPU使用排序
$ top

# 通过jstat命令，查看gc情况，26299为CPu过高的es pid
# 2000 代表间隔2s刷新一次，总共显示20次
$ jstat -gcutil 26299 2000 20
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
  0.00   0.00   3.96  57.41  93.08  84.84 139244 4365.998  1010   49.443 4415.441
100.00   0.00  71.47  57.41  93.08  84.84 139244 4365.998  1010   49.443 4415.441
  0.00 100.00  32.45  57.51  93.08  84.84 139245 4366.029  1010   49.443 4415.472
  0.00 100.00  98.86  57.51  93.08  84.84 139245 4366.029  1010   49.443 4415.472
 99.34   0.00  54.29  57.67  93.08  84.84 139246 4366.062  1010   49.443 4415.505
  0.00 100.00   7.90  57.82  93.08  84.84 139247 4366.102  1010   49.443 4415.545
  0.00 100.00  67.13  57.82  93.08  84.84 139247 4366.102  1010   49.443 4415.545
100.00   0.00  19.60  57.94  93.08  84.84 139248 4366.133  1010   49.443 4415.577
100.00   0.00  79.07  57.94  93.08  84.84 139248 4366.133  1010   49.443 4415.577

发现Full GC在这20次内并没有增高，Young GC 也还好；

鉴于当前机器的内存还剩余3个G，所以索性也增加这个es节点的内存：

配置如下：

$ vim config/jvm.options
-Xms8g
-Xmx8g

然后配置ES集群的：禁止分片再分配，以提高ES数据节点重启后，分片的再分配，以快速恢复：

# 禁止分片分配：xxx为用户名，yyy为密码
$ curl -XPUT localhost:9220/_cluster/settings -d '{"transient" : {"cluster.routing.allocation.enable" : "none"}}'  -uxxx:yyy

# 检查配置
curl -XGET localhost:9220/_cluster/settings -uxxx:yyy
{
  "persistent": {},
  "transient": {
    "cluster": {
      "routing": {
        "allocation": {
          "enable": "none"
        }
      }
    }
  }
}

重启后，当集群状态为Green后，查看该节点的CPU占用，问题依旧。

# 开启分片在分配
$ curl -XPUT localhost:9220/_cluster/settings -d '{"transient" : {"cluster.routing.allocation.enable" : "all"}}'  -uxxx:yyy
# 检查配置
curl -XGET localhost:9220/_cluster/settings -uxxx:yyy
{
  "persistent": {},
  "transient": {
    "cluster": {
      "routing": {
        "allocation": {
          "enable": "all"
        }
      }
    }
  }
}

问题排查思路2

查看该节点的日志输出：

发现如下报错：

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.PipelineExecutionService$2@3f72e57a on EsThreadPoolExecutor[bulk, queue capac
ity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@4c0b0cb6[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 45095]]
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50) ~[elasticsearch-5.6.4.jar:5.6.4]
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) ~[?:1.8.0_111]
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) ~[?:1.8.0_111]
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:94) ~[elasticsearch-5.6.4.jar:5.6.4]
        ... 49 more
[2020-07-24T00:08:23,969][ERROR][o.e.a.b.TransportBulkAction] [node-3] failed to execute pipeline for a bulk request
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.PipelineExecutionService$2@41dd524d on EsThreadPoolExecutor[bulk, queue capacity = 200,
org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@4c0b0cb6[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 45095]]
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50) ~[elasticsearch-5.6.4.jar:5.6.4]
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) ~[?:1.8.0_111]
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) ~[?:1.8.0_111]
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:94) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:89) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.ingest.PipelineExecutionService.executeBulkRequest(PipelineExecutionService.java:74) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.action.bulk.TransportBulkAction.processBulkIndexIngestRequest(TransportBulkAction.java:508) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:136) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:85) ~[elasticsearch-5.6.4.jar:5.6.4]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.6.4.jar:5.6.4]

关键字：

on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@4c0b0cb6[Running, pool size = 4, active threads = 4, queued tasks = 200, completed tasks = 45095]]

bulk 线程总共4个，激活的线程也是4个，队列中有200个任务。同时有的任务被rejected（拒绝）了。

查看ES API的线程情况：

这里通过kibana的DevTools界面来查看，更加的方便些：

# 查看线程池
$ get _cat/thread_pool?v
ode_name name                active queue rejected
node-3    bulk                     4     365      120
node-3    fetch_shard_started      0     0        0
node-3    fetch_shard_store        0     0        0
node-3    flush                    0     0        0
node-3    force_merge              0     0        0
node-3    generic                  0     0        0
node-3    get                      0     0        0
node-3    index                    0     0        0
node-3    listener                 0     0        0
node-3    management               2     0        0
node-3    refresh                  4     425      0
node-3    search                   0     0        0
node-3    security-token-key       0     0        0
node-3    snapshot                 0     0        0
node-3    warmer                   0     0        0
node-3    watcher                  0     0        0


# 查看占用CPU过高的线程：
$ get _nodes/hot_threads

# 该命令会列出所有节点的线程CPU占用排行，抱歉，看不懂。

其中的bulk和refresh的active都是满的，同时队列都已经上百了，同时rejected也不少了。

好吧，线程池满了，就增大线程，队列满了就增大队列。ES配置如下：

$ vim config/elasticsearch.yml
thread_pool:
    bulk:
        size: 5
        queue_size: 500

参考：这里是5.6的官方文档

https://www.elastic.co/guide/en/elasticsearch/reference/5.6/modules-threadpool.html

bulk 线程的说明：

For bulk operations. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.

警告：

thread_pool.bulk.zide 不是随意调整的。根据上面的官方文档说明：最大值为CPU核心数+1。默认的queue_size 为200。

同样的配置禁用集群分片再分配，参考上面的配置，然后重启，记得再把分片在分配要改回来。

好吧，问题依旧，CPU还是爆满。

问题排查思路3

在重启的过程中，查看当前激活的master节点的日志，有一些警告：

[2020-07-31T09:20:29,277][WARN ][o.e.g.DanglingIndicesState] [client] [[.monitoring-alerts-6/Jyne0wClRsWjd6g34mbHlQ]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
[2020-07-31T09:20:29,278][WARN ][o.e.g.DanglingIndicesState] [client] [[.monitoring-es-6-2017.11.21/fufshsZgQ-SA8wy3lc1qfg]] can not be imported as a dangling index, as an index with the same name and UUID exist in the index tombstones.  This situation is likely caused by copying over the data directory for an index that was previously deleted.
[2020-07-31T09:20:29,278][WARN ][o.e.g.DanglingIndicesState] [client] [[.monitoring-kibana-6-2017.11.21/I_3_Yq1ERy6SW_Q7P3JCKQ]] can not be imported as a dangling index, as an index with the same name and UUID exist in the index tombstones.  This situation is likely caused by copying over the data directory for an index that was previously deleted.

说明：

client节点上有些dangling index（悬空索引）了。而通过API查看，并没有这些index，而且都是2017年的监控的index了。

参考：

https://elasticsearch.cn/question/7895

解决方式：

因为是Client节点，本身不存储数据，所以将data目录直接删掉了。

如果你的是数据节点，注意备份，然后只删除掉data 目录中的具体的悬空索引：根据索引ID来删除：

.monitoring-alerts-6/Jyne0wClRsWjd6g34mbHlQ

# 上面日志中：/后面的一串为索引ID

# 索引位置
$ cd data/nodes/0/indices/
$ ls
Jyne0wClRsWjd6g34mbHlQ I_3_Yq1ERy6SW_Q7P3JCKQ

然后重启client节点，发现master节点不再有这些dangling index的警告了。但是node-3节点 CPU还是100%。

问题排查思路4 解决

通过上面的线程池查看：

# 查看线程池
$ get _cat/thread_pool?v
ode_name name                active queue rejected
node-3    bulk                     4     365      120
node-3    fetch_shard_started      0     0        0
node-3    fetch_shard_store        0     0        0
node-3    flush                    0     0        0
node-3    force_merge              0     0        0
node-3    generic                  0     0        0
node-3    get                      0     0        0
node-3    index                    0     0        0
node-3    listener                 0     0        0
node-3    management               2     0        0
node-3    refresh                  4     425      0
node-3    search                   0     0        0
node-3    security-token-key       0     0        0
node-3    snapshot                 0     0        0
node-3    warmer                   0     0        0
node-3    watcher                  0     0        0

其中的bulk和refresh都是很高的。

调整1：调整所有的index的inde.refresh_interval 为60s

参考：

https://www.elastic.co/guide/en/elasticsearch/reference/6.8/indices-update-settings.html

The REST endpoint is /_settings (to update all indices) or {index}/_settings to update one (or more) indices settings.

// 更改所有index设置：
$ PUT /_settings
{
    "index" : {
        "refresh_interval" : "60s"
    }
}

// 获取单个index设置
$ get logstash-2020.07.03/_settings

// 获取所有index设置
$ get /_settings

注意：

将index.refresh_interval时间间隔调长，这里为60s，所以看到其他index最新的数据会延迟1m钟。

请根据自身情况进行修改。

调整2：调整2台logstash 发送event到Elasticsearch的间隔和大小：

参考：

https://www.leiyawu.com/2018/04/03/elk/

$ vim logstash-5.0.0/config/logstash.yml
pipeline.output.workers: 4 # 输出的work数量，一般为CPu核心数
pipeline.batch.size: 500 # 每次发送event数据的个数
pipeline.batch.delay: 10 # 发送延迟

# 思路就是：每次多发点，发送的频率小点。请根据自身情况进行修改。

说明：

logstash向es发送event的间隔调长，同时将每次发送的event数量调大，以缓解bulk线程池满的情况。

解决ES节点单台服务器CPU爆满的情况。

查看CPU的负载：

问题终于得到解决。

其他优化

其他的一些小优化：

内存锁定：bootstrap.memory_lock: true
关闭机器学习模块，以减少内存占用： xpack.ml.enabled: false
开启transport的传输压缩，提高效率：transport.tcp.compress: true
将transport的host绑定到特定的IP上，提高安全性：transport.host: 172.x.x.221。同时API的host可以绑定的特定的IP上：network.host: x.x.x.x
跨域配置：http.cors

ES节点的配置如下：

network.host: 0.0.0.0
http.port: 9222
cluster.name: microoak5.6
node.name: node-3
discovery.zen.ping.unicast.hosts: ["172.x.x.38:9333","172.x.x.221:9333","172.x.x.37:9333","172.x.x.221:9444","172.x.x.38:9444","172.x.x.37:9334"]
discovery.zen.minimum_master_nodes: 2
node.data: true
node.master: false
http.cors.enabled : true
http.cors.allow-origin : "*.example.com"
http.cors.allow-methods : OPTIONS, HEAD, GET, POST, PUT, DELETE
http.cors.allow-headers : WWW-Authenticate,X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization
bootstrap.memory_lock: true
transport.tcp.port: 9555
transport.host: 172.x.x.221
transport.tcp.compress: true
path.repo: /data/esbackup
thread_pool:
    bulk:
        size: 5
        queue_size: 500
xpack.ml.enabled: false

本文到这里就结束了，欢迎期待后面的文章。您可以关注下方的公众号二维码，在第一时间查看新文章。

问题说明

Elasticsearch 集群说明

问题排查 思路1

问题排查 思路2

问题排查 思路3

问题排查 思路4 解决