Cloud SQL MySQL 8.0.26 在 App engine flex + Node.js 应用中出现间歇性“读取通信数据包超时”错误

Cloud SQL MySQL 8.0.26 在 App engine flex + Node.js 应用中出现间歇性“读取通信数据包超时”错误

我们在 App engine Flexible 上运行 node.js 服务器,使用共享专用网络连接到 mysql。(app engine 实例部署到 mysql 所在的同一个 vpc 子网中)。

因此在 app.yaml 中的应用引擎端:

network:
  name: projects/<projectId>/global/networks/<subnetwork>
  subnetwork_name: <subnetwork>

mysql 实例是 MYSQL_8_0、db-n1-standard-1 实例,附加了 10GB 存储空间,可用区域,启用了自动备份。mysql 数据库公开私有 IP,仅允许 SSL 连接。通常,该实例几乎未使用,最大使用率为 6%,并且剩余大量可用存储空间(即使启用了自动存储增加)。

我们的节点服务器使用 Sequelize 连接到 mysql,并设置了所有正确的 SSL 配置:

    sequelizeClient = new Sequelize(MYSQL_CONF.database, MYSQL_CONF.user, dbPass, {
      host: MYSQL_CONF.host,
      dialect: "mysql",
      dialectOptions: {
        ssl: {
          key: cKey,
          cert: cCert,
          ca: cCA,
        },
      },
    });

MYSQL_CONF.host 是用于连接的私有 IP,例如:10.75.0.3。

通常,此设置运行良好,但间歇性地(发生过两次)我们开始收到 mysql 连接超时,日志中出现以下消息:

在mysql端:

"2022-08-01T08:33:48.239519Z 1899 [Note] [MY-010914] [Server] Got timeout reading communication packets"

在节点端:

"SequelizeConnectionError: connect ETIMEDOUT
    at ConnectionManager.connect (/app/node_modules/sequelize/lib/dialects/mysql/connection-manager.js:102:17)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async ConnectionManager._connect (/app/node_modules/sequelize/lib/dialects/abstract/connection-manager.js:220:24)

但是也:

"SequelizeConnectionError: connect EPIPE 10.75.0.3:3306
    at ConnectionManager.connect (/app/node_modules/sequelize/lib/dialects/mysql/connection-manager.js:102:17)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async ConnectionManager._connect (/app/node_modules/sequelize/lib/dialects/abstract/connection-manager.js:220:24)"

和:

"SequelizeConnectionAcquireTimeoutError: Operation timeout
    at ConnectionManager.getConnection (/app/node_modules/sequelize/lib/dialects/abstract/connection-manager.js:204:15)
    at runMicrotasks (<anonymous>)
    at runNextTicks (node:internal/process/task_queues:61:5)
    at processTimers (node:internal/timers:497:9)
    at async /app/node_modules/sequelize/lib/sequelize.js:301:26
    at async MySQLQueryInterface.select (/app/node_modules/sequelize/lib/dialects/abstract/query-interface.js:407:12)
    at async Function.findAll (/app/node_modules/sequelize/lib/model.js:1134:21)
    at async Function.findOne (/app/node_modules/sequelize/lib/model.js:1228:12)
    at async readUser (/app/dist/web/services/user.js:35:18)
    at async getCurrentUser (/app/dist/web/controllers/user.js:49:20)"

还:

"SequelizeConnectionError: 139662677354432:error:1408F10B:SSL routines:ssl3_get_record:wrong version number:../deps/openssl/openssl/ssl/record/ssl3_record.c:332:

    at ConnectionManager.connect (/app/node_modules/sequelize/lib/dialects/mysql/connection-manager.js:102:17)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async ConnectionManager._connect (/app/node_modules/sequelize/lib/dialects/abstract/connection-manager.js:220:24)"

和:

"Error: Client network socket disconnected before secure TLS connection was established
    at connResetException (node:internal/errors:691:14)
    at TLSSocket.onConnectEnd (node:_tls_wrap:1585:19)
    at TLSSocket.emit (node:events:402:35)
    at endReadableNT (node:internal/streams/readable:1343:12)
    at processTicksAndRejections (node:internal/process/task_queues:83:21)"

和:

"SequelizeConnectionRefusedError: connect ECONNREFUSED 10.75.0.3:3306
    at ConnectionManager.connect (/app/node_modules/sequelize/lib/dialects/mysql/connection-manager.js:92:17)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async ConnectionManager._connect (/app/node_modules/sequelize/lib/dialects/abstract/connection-manager.js:220:24)"

这种情况在生产环境中发生过一次,持续了大约 8 个小时,在我们的开发环境中又发生过一次,持续了大约相同的时间。这两个环境具有相同的基础结构,但位于不同的 gcp 项目上。

只是为了确保我们的服务和数据库之间的连接通常是稳定的并且一切运行良好,只是性能会在某个时间下降几个小时。

没有未提交的事务,配置最大连接数甚至没有接近实现 - 发生此问题时最大打开连接数为 21,最大数量为 4030。没有更改 MySQL 配置/数据库标志,一切都处于 Cloud SQL 留下的默认状态。通常,SQL 命令非常基础,并且从未达到最大通信负载。

这只是 GCP 性能下降还是用户在使用其服务时出现错误?还有其他人遇到过这些问题吗?

变量:

SHOW GLOBAL VARIABLES LIKE '%time%';
Variable_name                  |Value    |
-------------------------------+---------+
binlog_max_flush_queue_time    |0        |
connect_timeout                |10       |
default_password_lifetime      |0        |
delayed_insert_timeout         |300      |
explicit_defaults_for_timestamp|ON       |
flush_time                     |0        |
have_statement_timeout         |YES      |
innodb_flush_log_at_timeout    |1        |
innodb_lock_wait_timeout       |50       |
innodb_old_blocks_time         |1000     |
innodb_rollback_on_timeout     |OFF      |
interactive_timeout            |28800    |
lc_time_names                  |en_US    |
lock_wait_timeout              |31536000 |
log_timestamps                 |UTC      |
long_query_time                |10.000000|
max_execution_time             |0        |
net_read_timeout               |30       |
net_write_timeout              |60       |
regexp_time_limit              |32       |
replica_net_timeout            |30       |
rpl_stop_replica_timeout       |31536000 |
rpl_stop_slave_timeout         |31536000 |
slave_net_timeout              |30       |
slow_launch_time               |2        |
system_time_zone               |UTC      |
time_zone                      |SYSTEM   |
wait_timeout                   |28800    |
SHOW GLOBAL STATUS LIKE 'aborted%';
Variable_name   |Value|
----------------+-----+
Aborted_clients |4    |
Aborted_connects|49   |
SHOW GLOBAL STATUS LIKE '%flush%';
Variable_name                   |Value |
--------------------------------+------+
Com_flush                       |196   |
Flush_commands                  |3     |
Innodb_buffer_pool_pages_flushed|313855|
Key_blocks_not_flushed          |0     |
Not_flushed_delayed_rows        |0     |
Uptime_since_flush_status       |176244|
SHOW GLOBAL STATUS LIKE 'uptime%';
Variable_name            |Value |
-------------------------+------+
Uptime                   |176264|
Uptime_since_flush_status|176264|
SHOW GLOBAL VARIABLES LIKE '%flush%';
Variable_name                     |Value   |
----------------------------------+--------+
binlog_max_flush_queue_time       |0       |
flush                             |OFF     |
flush_time                        |0       |
innodb_adaptive_flushing          |ON      |
innodb_adaptive_flushing_lwm      |10      |
innodb_flush_log_at_timeout       |1       |
innodb_flush_log_at_trx_commit    |1       |
innodb_flush_method               |O_DIRECT|
innodb_flush_neighbors            |2       |
innodb_flush_sync                 |ON      |
innodb_flushing_avg_loops         |5       |
innodb_idle_flush_pct             |100     |
innodb_log_wait_for_flush_spin_hwm|400     |
SHOW GLOBAL STATUS LIKE '%rollback%';
Variable_name             |Value|
--------------------------+-----+
Com_rollback              |2    |
Com_rollback_to_savepoint |0    |
Com_xa_rollback           |0    |
Handler_rollback          |4    |
Handler_savepoint_rollback|0    |
SHOW GLOBAL STATUS LIKE '%savepoint%'; 
Variable_name             |Value|
--------------------------+-----+
Com_release_savepoint     |0    |
Com_rollback_to_savepoint |0    |
Com_savepoint             |0    |
Handler_savepoint         |0    |
Handler_savepoint_rollback|0    |

答案1

每秒速率 = RPS 每小时速率 = RPHr

根据今天发布的信息,

针对您的 Google Cloud Database 标志需要考虑的建议,将应用于所涉及的每个服务器。

connect_timeout=60  # from 10 to minimize aborted_connects of 49 RPHr 1 
innodb_old_blocks_time=60000  # from 1000 for 60 sec age out to reduce select_scan RPS
net_read_timeout=60  # from 30 to tolerate delays longer
regexp_time_limit=10000  # 32 milliseconds to ~ 10 second limit
replica_net_timeout=60  # from 30 to tolerate delays longer
slave_net_timeout=60  # from 30 to tolerate delays longer
slow_launch_time=10  # from 2 to tolerate longer launch time.

查看个人资料以获取联系信息和免费实用脚本,以协助性能调整。

相关内容