CMS 垃圾收集运行后,WebLogic 服务器似乎冻结

CMS 垃圾收集运行后,WebLogic 服务器似乎冻结

我们正在使用 Java Hotspot 1.6 和 10.3.6 WebLogic 集群,其中 2 个节点在物理负载均衡器后面运行 OSB 11.1.1.7:

java version "1.6.0_91"
Java(TM) SE Runtime Environment (build 1.6.0_91-b13)
Java HotSpot(TM) 64-Bit Server VM (build 20.91-b07, mixed mode)

我们已经调整了 JVM 以使用 CMS 并避免长时间运行垃圾收集:

-XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:NewRatio=4 -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=60 -XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark

大多数 GC 的暂停阶段(初始标记和重新标记)花费的时间都不到一秒钟,我们没有看到任何失败的 CMS 运行,也没有在 gc 日志中看到完整的 GC。

然而,在 CMS GC 执行完成后的几秒钟内,当客户端尝试连接 OSB 时,我们会收到超时错误(特别是在具有大量调用 OSB 的复合材料的 Oracle SOA 安装中):

OSB 1 垃圾收集.日志:

> 2016-07-28T08:31:52.656-0300: 492247.689: [GC
> 2016-07-28T08:31:52.656-0300: 492247.690: [ParNew:
> 1027265K->22014K(1132416K), 0.0412613 secs]
> 4046262K->3043322K(6165632K), 0.0420791 secs] [Times: user=0.25
> sys=0.00, real=0.04 secs]  2016-07-28T08:31:52.701-0300: 492247.735:
> [GC [1 CMS-initial-mark: 3021307K(5033216K)] 3043844K(6165632K),
> 0.0683781 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]  2016-07-28T08:31:52.770-0300: 492247.804: [CMS-concurrent-mark-start]
> 2016-07-28T08:31:56.971-0300: 492252.005: [CMS-concurrent-mark:
> 4.201/4.201 secs] [Times: user=10.55 sys=0.16, real=4.20 secs]  2016-07-28T08:31:56.971-0300: 492252.005:
> [CMS-concurrent-preclean-start] 2016-07-28T08:31:57.064-0300:
> 492252.098: [CMS-concurrent-preclean: 0.090/0.093 secs] [Times: user=0.12 sys=0.00, real=0.09 secs]  2016-07-28T08:31:57.064-0300:
> 492252.098: [CMS-concurrent-abortable-preclean-start] 2016-07-28T08:32:01.198-0300: 492256.231: [GC
> 2016-07-28T08:32:01.198-0300: 492256.232: [ParNew:
> 1028128K->30711K(1132416K), 0.0502326 secs]
> 4049435K->3053383K(6165632K), 0.0508313 secs] [Times: user=0.34
> sys=0.00, real=0.05 secs] CMS: abort preclean due to time
> 2016-07-28T08:32:02.128-0300: 492257.161:
> [CMS-concurrent-abortable-preclean: 4.027/5.063 secs] [Times:
> user=7.62 sys=0.24, real=5.06 secs]  2016-07-28T08:32:02.134-0300:
> 492257.167: [GC[YG occupancy: 186361 K (1132416 K)]2016-07-28T08:32:02.134-0300: 492257.167: [GC
> 2016-07-28T08:32:02.134-0300: 492257.167: [ParNew:
> 186361K->25052K(1132416K), 0.0231613 secs]
> 3209034K->3049538K(6165632K), 0.0236069 secs] [Times: user=0.17
> sys=0.00, real=0.02 secs]  2016-07-28T08:32:02.158-0300: 492257.191:
> [Rescan (parallel) , 0.0289939 secs]2016-07-28T08:32:02.187-0300:
> 492257.220: [weak refs processing, 0.3412320 secs] [1 CMS-remark: 3024485K(5033216K)] 3049538K(6165632K), 0.4179878 secs] [Times:
> user=0.67 sys=0.00, real=0.42 secs]  2016-07-28T08:32:02.552-0300:
> 492257.586: [CMS-concurrent-sweep-start] 2016-07-28T08:32:07.144-0300: 492262.177: [CMS-concurrent-sweep: 4.591/4.591 secs] [Times: user=14.05 sys=0.22, real=4.59 secs]  2016-07-28T08:32:07.146-0300:
> 492262.180: [CMS-concurrent-reset-start] 2016-07-28T08:32:07.283-0300: 492262.316: [CMS-concurrent-reset: 0.137/0.137 secs] [Times: user=0.36 sys=0.01, real=0.14 secs]  2016-07-28T08:32:36.201-0300: 492291.234:
> [GC 2016-07-28T08:32:36.202-0300: 492291.234: [ParNew:
> 1030955K->45449K(1132416K), 0.0671313 secs]
> 2269248K->1285059K(6165632K), 0.0676485 secs] [Times: user=0.41
> sys=0.00, real=0.07 secs]

我们在 OSB WebLogic 访问日志中注意到了这种奇怪的行为,通常在 CMS 启动后 5 或 10 秒,访问日志中的请求之间会有 20 多秒的间隔,而通常我们每秒会看到数十个调用。

OSB1 访问日志:

2016-07-28 08:32:03 POST /service1/ 200 0.043  
2016-07-28 08:32:03 POST /service1/ 200 0.074   
2016-07-28 08:32:03 POST /service1/ 200 0.045   
2016-07-28 08:32:03 POST /service1/ 200 0.053   
2016-07-28 08:32:35 POST /service1/ 200 0.076

同时,从 SOA 服务器发出的一些调用失败:

SOA 日志错误:https://esburl.mycompany.com/service1“成功,原因是:无法调用端点 URI”https://esburl.mycompany.com/service1“成功,原因是:javax.xml.soap.SOAPException:javax.xml.soap.SOAPException:消息发送失败:读取超时>

OSB weblogic 输出日志中没有报告任何错误。我们在 GC 运行时进行了一些线程转储,但无法查明问题所在。我们认为,如果初始标记和重新标记足够快,我们就不会看到这种问题,但我们仍然怀疑这是 JVM、WebLogic 还是应用程序问题(OSB)。有人在 WebLogic 或其他应用程序服务器上看到过类似的事情吗?

谢谢!

编辑:

该问题似乎与 SSL 有关:

~$ curl -v -k https://mycompany.com:9001/service1/ -o /dev/null 
* Hostname was NOT found in DNS cache
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying x.x.x.x...
* Connected to mycompany.com (x.x.x.x) port 9001 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, **Client hello** (1):
} [data not shown]
  0     0    0     0    0     0      0      0 --:--:--  **0:00:16** --:--:--     0* 
  SSLv3, TLS handshake, **Server hello** (2):
{ [data not shown]
* SSLv3, TLS handshake, CERT (11):
{ [data not shown]
* SSLv3, TLS handshake, Server key exchange (12):
{ [data not shown]
* SSLv3, TLS handshake, Server finished (14):
{ [data not shown]
* SSLv3, TLS handshake, Client key exchange (16):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
} [data not shown]
* SSLv3, TLS handshake, Finished (20):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
{ [data not shown]
* SSLv3, TLS handshake, Finished (20):
{ [data not shown]
* SSL connection using EDH-RSA-DES-CBC3-SHA
* Server certificate:
*    subject: C=xx; ST=xx; L=xx; O=xx; OU=Tech; CN=mycompany.com
*    start date: 2016-02-25 13:17:10 GMT
*    expire date: 2021-02-23 13:17:10 GMT
*    issuer: DC=xx; DC=xx; DC=xx; CN=xx
*    SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
> GET /service1 HTTP/1.1
> User-Agent: curl/7.35.0
> Host: mycompany.com:9001
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Wed, 03 Aug 2016 15:46:41 GMT
< Transfer-Encoding: chunked
< Content-Type: text/xml; charset=UTF-8
< X-Powered-By: Servlet/2.5 JSP/2.1

似乎在执行 GC 时,服务器需要花费大量时间才能完成 SSL 握手(在本例中为 16 秒)但我不知道为什么。有人知道与 GC 和 SSL 协议相关的任何 Java 问题吗?

问候。

答案1

经过长时间的调查,我们终于找到了问题的原因,在 CMS GC 运行时,我们进行了几次线程转储,并在 5 或 6 个线程中发现了这个显眼的堆栈:

"[ACTIVE] ExecuteThread: '35' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=3 tid=0x000000001333c000 nid=0xa1 runnable [0xfffffd7e23dd1000] 
   java.lang.Thread.State: RUNNABLE 
at sun.security.pkcs11.wrapper.PKCS11.C_DestroyObject(Native Method) 
at sun.security.pkcs11.SessionKeyRef.dispose(P11Key.java:1043) 
at sun.security.pkcs11.SessionKeyRef.drainRefQueueBounded(P11Key.java:1019) 
at sun.security.pkcs11.SessionKeyRef. (P11Key.java:1034) 
at sun.security.pkcs11.P11Key. (P11Key.java:98) 
at sun.security.pkcs11.P11Key$P11SecretKey. (P11Key.java:379) 
at sun.security.pkcs11.P11Key.secretKey(P11Key.java:271) 
at sun.security.pkcs11.P11TlsRsaPremasterSecretGenerator.engineGenerateKey(P11TlsRsaPremasterSecretGenerator.java:84) 
at javax.crypto.KeyGenerator.generateKey(DashoA13*..) 
at com.sun.net.ssl.internal.ssl.RSAClientKeyExchange. (RSAClientKeyExchange.java:68) 
at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverHelloDone(ClientHandshaker.java:807) 
at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:286) 
at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:608) 
at com.sun.net.ssl.internal.ssl.Handshaker$1.run(Handshaker.java:548) 
at java.security.AccessController.doPrivileged(Native Method) 
at com.sun.net.ssl.internal.ssl.Handshaker$DelegatedTask.run(Handshaker.java:941) 
- locked <0xfffffd7e402b5ca0> (a com.sun.net.ssl.internal.ssl.SSLEngineImpl)

该代码的执行出现在相隔 20-25 秒的转储中的同一个线程中,因此它看起来是一个很有可能的嫌疑。

查看 Oracle 支持文档和以下链接后:

https://bugs.openjdk.java.net/browse/JDK-8059337 https://stackoverflow.com/questions/31188663/disabling-pkcs11-solaris-implementation

我们决定使用“-Dsun.security.pkcs11.enable-solaris=false”禁用 Solaris pkcs11 实现。一旦我们这样做,所有超时都消失了,GC 之后访问日志中的间隙不再存在。

我不确定使用 sun pcks11 相对于其他提供商是否有任何优势,但禁用它似乎可以解决我们的问题,并且到目前为止还没有发现由更改产生的新问题,所以我们将坚持使用它。

相关内容