我们的一个应用程序使用 Elasticsearch (1.4.4) 作为内存缓存。该应用程序是一个 Java webapp,部署在 Tomcat 7 上,使用 Oracle 1.7。elasticsearch 实例是部署在同一服务器上的单节点设置。
自 elasticsearch 1.3.3 以来,我们在应用程序和空闲应用程序的 Elasticsearch 节点之间的环回接口上经历了大约 40 MBit/s 的输入和输出。
虽然不是很多,但会给原本运行平稳的系统带来明显的负载。我手头没有安装此应用程序的生产系统,因此我无法确切地说出它在生产中的表现如何。
通过 tcpdump 抓取流量并在 Wireshark 中分析表明,应用程序中的 Elasticsearch-Client 不断向节点询问,cluster/node/info
每次都会产生 10k 的答案。
也许完全不相关,但启用服务器和客户端日志记录可以给我们带来:
Elasticsearch 服务器日志:
[2015-05-12 14:45:01,600][INFO ][node ] [Illyana Rasputin] initializing ...
[2015-05-12 14:45:01,608][INFO ][plugins ] [Illyana Rasputin] loaded [], sites []
[2015-05-12 14:45:06,666][INFO ][node ] [Illyana Rasputin] initialized
[2015-05-12 14:45:06,667][INFO ][node ] [Illyana Rasputin] starting ...
[2015-05-12 14:45:06,828][INFO ][transport ] [Illyana Rasputin] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.24.1.128:9300]}
[2015-05-12 14:45:06,851][INFO ][discovery ] [Illyana Rasputin] bkbo_index/TITPDFdtR6SXX5EeOXaidg
[2015-05-12 14:45:09,892][INFO ][cluster.service ] [Illyana Rasputin] new_master [Illyana Rasputin][TITPDFdtR6SXX5EeOXaidg][dev06][inet[/10.24.1.128:9300]], reason: zen-disco-join (elected_as_master)
[2015-05-12 14:45:09,943][INFO ][http ] [Illyana Rasputin] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.24.1.128:9200]}
[2015-05-12 14:45:09,944][INFO ][node ] [Illyana Rasputin] started
[2015-05-12 14:45:11,283][INFO ][gateway ] [Illyana Rasputin] recovered [2] indices into cluster_state
Elasticsearch 客户端:
2015-05-12 14:46:40,683 INFO [localhost-startStop-1] PluginsService:<init>:151 [Antiphon the Overseer] loaded [], sites []
2015-05-12 14:46:41,548 DEBUG [localhost-startStop-1] TransportClientNodesService:<init>:110 [Antiphon the Overseer] node_sampler_interval[5ms]
2015-05-12 14:46:41,594 DEBUG [localhost-startStop-1] TransportClientNodesService:addTransportAddresses:167 [Antiphon the Overseer] adding address [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,625 DEBUG [localhost-startStop-1] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,655 INFO [localhost-startStop-1] TransportClientNodesService$SimpleNodeSampler:doSample:371 [Antiphon the Overseer] failed to get node info for [#transport#-1][dev06][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [0] timed out after [6ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-05-12 14:46:41,658 DEBUG [localhost-startStop-1] NettyTransport:disconnectFromNode:882 [Antiphon the Overseer] disconnecting from [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] due to explicit disconnect call
2015-05-12 14:46:41,661 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,669 INFO [elasticsearch[Antiphon the Overseer][generic][T#1]] TransportClientNodesService$SimpleNodeSampler:doSample:371 [Antiphon the Overseer] failed to get node info for [#transport#-1][dev06][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [1] timed out after [5ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-05-12 14:46:41,670 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:disconnectFromNode:882 [Antiphon the Overseer] disconnecting from [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] due to explicit disconnect call
2015-05-12 14:46:41,676 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,677 WARN [elasticsearch[Antiphon the Overseer][transport_client_worker][T#2]{New I/O worker #2}] TransportService$Adapter:remove:280 [Antiphon the Overseer] Received response for a request that has timed out, sent [14ms] ago, timed out [9ms] ago, action [cluster:monitor/nodes/info], node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]], id [1]
2015-05-12 14:46:41,682 INFO [localhost-startStop-1] PluginsService:<init>:151 [Ricochet] loaded [], sites []
2015-05-12 14:46:41,722 DEBUG [localhost-startStop-1] TransportClientNodesService:<init>:110 [Ricochet] node_sampler_interval[5ms]
2015-05-12 14:46:41,733 DEBUG [localhost-startStop-1] TransportClientNodesService:addTransportAddresses:167 [Ricochet] adding address [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,734 DEBUG [localhost-startStop-1] NettyTransport:connectToNode:751 [Ricochet] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,759 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[Illyana Rasputin][TITPDFdtR6SXX5EeOXaidg][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,760 DEBUG [localhost-startStop-1] NettyTransport:connectToNode:751 [Ricochet] connected to node [[Illyana Rasputin][TITPDFdtR6SXX5EeOXaidg][dev06][inet[localhost/127.0.0.1:9300]]]
是的,此应用程序有两个客户端连接,应该没问题(根据开发人员的说法)。这些断开/重新连接循环大约每分钟发生一次。
有什么线索可以知道这里发生了什么吗?我已经通过 禁用了多播discovery.zen.ping.multicast.enabled: false
。
答案1
您的客户端似乎已加入集群(这很好,但如果您使用 Kibana 4,您可能会收到来自 Kibana 的投诉(不确定这些投诉是否来自 4 测试版)
从您的客户端日志中:
2015-05-12 14:46:41,548 DEBUG [localhost-startStop-1] TransportClientNodesService:<init>:110 [Antiphon the Overseer] node_sampler_interval[5ms]
5ms 似乎对于集群中的节点采样来说相当激进。我还没有查看默认情况下这个值是多少,但我猜想在预期的秒数中,某些东西配置了毫秒数?
此时,您需要考虑客户端 API 的设置,尽管客户端可能会从集群中获取此设置(因为它正在成为集群的一部分)
大概您正在使用 elastic.co 提供的 Java API?
您是否已经client.transport.nodes_sampler_interval
在某个地方进行了配置?
您是否正在使用兼容的客户端/服务器版本,Java 客户端 API 的文档
请注意,我们鼓励您在客户端和集群端使用相同的版本。混合使用主要版本时可能会遇到一些不兼容问题
如果度量单位在版本之间发生变化,我不会感到惊讶,尽管文档确实说默认单位是5s
检查您的 elasticsearch.yaml 和代码中是否存在 的实例。您可能需要用node_sampler_interval
替换 naked ?5
5s