Google Cloud 外部负载均衡器的后端服务随机失败，出现 502 服务器错误

2024-6-1 • tag-icon

Google Cloud 外部负载均衡器的后端服务随机失败，出现 502 服务器错误

我有以下外部Google云负载均衡器配置：

GlobalNetworkEndpointGroupToClusterByIp是类型INTERNET_IP_PORT指向 Kubernetes 集群 IP 的 Internet NEG。
GlobalNetworkEndpointGroupToManagedS3是类型INTERNET_FQDN_PORT指向 Yandex S3 服务管理的 Internet NEG。

由于某种原因，一些后端服务无法工作，当我尝试连接它们时，它们会响应 HTML 页面，显示502 服务器错误：

错误：服务器错误

服务器遇到临时错误，无法完成您的请求。

请于 30 秒后重试。

在失败的后端服务日志中总是有以下错误：

jsonPayload: {
  cacheId: "GRU-c0ee45d8"
  @type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
  statusDetails: "failed_to_pick_backend"
}

对后端服务的请求在 1 毫秒内失败（如日志中所述），因此看起来它们甚至没有尝试连接我的 Kubernetes 集群的 IP 或托管 S3 并立即失败。

在发布此问题时，S3 和 Imgproxy 后端服务状况良好，但其他服务无法正常工作：

如果我重新部署所有内容，其他一些服务可能会失败，例如：

API 和文档将起作用，其他将失败
API、Docs、FPS 和 Imgproxy 将正常运行，但 S3 将失效
S3 可以工作，其他的都会失败

所以这绝对是随机的，我不明白为什么会发生这种情况。如果我足够幸运，重新部署后所有后端服务都会正常工作。也有可能它们都不起作用。

Kubernetes 集群运行正常，它接受连接，托管 S3 也运行良好。这看起来像是一个错误，但我在 Google 上找不到任何相关信息。

我的 Terraform 配置如下：

resource "google_compute_global_network_endpoint_group" "kubernetes-cluster" {
  name                  = "kubernetes-cluster-${var.ENVIRONMENT_NAME}"
  network_endpoint_type = "INTERNET_IP_PORT"

  depends_on = [
    module.kubernetes-resources
  ]
}

resource "google_compute_global_network_endpoint" "kubernetes-cluster" {
  global_network_endpoint_group = google_compute_global_network_endpoint_group.kubernetes-cluster.name
  port                          = 80
  ip_address                    = yandex_vpc_address.kubernetes.external_ipv4_address.0.address
}

resource "google_compute_global_network_endpoint_group" "s3" {
  name                  = "s3-${var.ENVIRONMENT_NAME}"
  network_endpoint_type = "INTERNET_FQDN_PORT"
}

resource "google_compute_global_network_endpoint" "s3" {
  global_network_endpoint_group = google_compute_global_network_endpoint_group.s3.name
  port                          = 443
  fqdn                          = trimprefix(local.s3.endpoint, "https://")
}

resource "google_compute_backend_service" "s3" {
  name = "s3-${var.ENVIRONMENT_NAME}"

  backend {
    group = google_compute_global_network_endpoint_group.s3.self_link
  }

  custom_request_headers = [
    "Host:${google_compute_global_network_endpoint.s3.fqdn}"
  ]

  cdn_policy {
    cache_key_policy {
      include_host         = true
      include_protocol     = false
      include_query_string = false
    }
  }

  enable_cdn            = true
  load_balancing_scheme = "EXTERNAL"

  log_config {
    enable      = true
    sample_rate = 1.0
  }

  port_name   = "https"
  protocol    = "HTTPS"
  timeout_sec = 60
}

resource "google_compute_backend_service" "imgproxy" {
  name = "imgproxy-${var.ENVIRONMENT_NAME}"

  backend {
    group = google_compute_global_network_endpoint_group.kubernetes-cluster.self_link
  }

  cdn_policy {
    cache_key_policy {
      include_host         = true
      include_protocol     = false
      include_query_string = false
    }
  }

  enable_cdn            = true
  load_balancing_scheme = "EXTERNAL"

  log_config {
    enable      = true
    sample_rate = 1.0
  }

  port_name   = "http"
  protocol    = "HTTP"
  timeout_sec = 60
}

resource "google_compute_backend_service" "api" {
  name = "api-${var.ENVIRONMENT_NAME}"

  custom_request_headers = [
    "Access-Control-Allow-Origin:${var.ALLOWED_CORS_ORIGIN}"
  ]

  backend {
    group = google_compute_global_network_endpoint_group.kubernetes-cluster.self_link
  }

  load_balancing_scheme = "EXTERNAL"

  log_config {
    enable      = true
    sample_rate = 1.0
  }

  port_name   = "http"
  protocol    = "HTTP"
  timeout_sec = 60
}

resource "google_compute_backend_service" "front" {
  name = "front-${var.ENVIRONMENT_NAME}"

  backend {
    group = google_compute_global_network_endpoint_group.kubernetes-cluster.self_link
  }

  cdn_policy {
    cache_key_policy {
      include_host         = true
      include_protocol     = false
      include_query_string = true
    }
  }

  enable_cdn            = true
  load_balancing_scheme = "EXTERNAL"

  log_config {
    enable      = true
    sample_rate = 1.0
  }

  port_name   = "http"
  protocol    = "HTTP"
  timeout_sec = 60
}

resource "google_compute_url_map" "default" {
  name            = "default-${var.ENVIRONMENT_NAME}"
  default_service = google_compute_backend_service.front.self_link

  host_rule {
    hosts = [
      local.hosts.api,
      local.hosts.fps
    ]
    path_matcher = "api"

  }

  host_rule {
    hosts = [
      local.hosts.s3
    ]
    path_matcher = "s3"
  }

  host_rule {
    hosts = [
      local.hosts.imgproxy
    ]
    path_matcher = "imgproxy"
  }

  path_matcher {
    default_service = google_compute_backend_service.api.self_link
    name            = "api"
  }

  path_matcher {
    default_service = google_compute_backend_service.s3.self_link
    name            = "s3"
  }

  path_matcher {
    default_service = google_compute_backend_service.imgproxy.self_link
    name            = "imgproxy"
  }

  test {
    host    = local.hosts.docs
    path    = "/"
    service = google_compute_backend_service.front.self_link
  }

  test {
    host    = local.hosts.api
    path    = "/"
    service = google_compute_backend_service.api.self_link
  }

  test {
    host    = local.hosts.fps
    path    = "/"
    service = google_compute_backend_service.api.self_link
  }

  test {
    host    = local.hosts.s3
    path    = "/"
    service = google_compute_backend_service.s3.self_link
  }

  test {
    host    = local.hosts.imgproxy
    path    = "/"
    service = google_compute_backend_service.imgproxy.self_link
  }
}

# See: https://github.com/hashicorp/terraform-provider-google/issues/5356
resource "random_id" "managed-certificate-name" {
  byte_length = 4
  prefix      = "default-${var.ENVIRONMENT_NAME}-"

  keepers = {
    domains = join(",", values(local.hosts))
  }
}

resource "google_compute_managed_ssl_certificate" "default" {
  name = random_id.managed-certificate-name.hex

  lifecycle {
    create_before_destroy = true
  }

  managed {
    domains = values(local.hosts)
  }
}

resource "google_compute_ssl_policy" "default" {
  name    = "default-${var.ENVIRONMENT_NAME}"
  profile = "MODERN"
}

resource "google_compute_target_https_proxy" "default" {
  name       = "default-${var.ENVIRONMENT_NAME}"
  url_map    = google_compute_url_map.default.self_link
  ssl_policy = google_compute_ssl_policy.default.self_link
  ssl_certificates = [
    google_compute_managed_ssl_certificate.default.self_link
  ]
}

resource "google_compute_global_forwarding_rule" "default" {
  name                  = "default-${var.ENVIRONMENT_NAME}"
  load_balancing_scheme = "EXTERNAL"
  port_range            = "443-443"
  target                = google_compute_target_https_proxy.default.self_link
}

更新。我发现重新创建 NEG 可以解决该问题：

等待 Terraform 完成部署。
通过 Google Cloud Platform Console 创建具有相同配置的 NEG。
编辑后端服务以使用新创建的 NEG。
有用！

但这绝对是黑客行为，似乎无法使用 Terraform 实现自动化。我将继续调查此问题。

答案1

很高兴听到您的问题已得到解决，我了解到您是通过 GCP 控制台手动创建 NEG 并随后编辑后端服务而不是使用 Terraform 来实现此问题的。此问题最可能的原因似乎是竞争条件，即在 Terraform 中，我们通常以链的形式定义资源，因此定义的每个资源都依赖于另一个资源。通常，在通过 Terraform 定义资源时，后端服务创建和 NE 附件都依赖于 NEG 创建。后端服务创建和网络端点 (NE) 附件操作往往并行运行，在这种情况下，NE 附件过程不会正确引用后端服务，因为 Internet NEG 的状态将在后端服务创建/更新期间准确读取（因此 NE 附件必须在后端创建之前进行）。
因此，在 Terraform 中创建后端服务时，我们必须将其定义为依赖于取决于（元参数）[1] NE 附件（即，后端服务应仅在 NE 附件后运行）。

[1]https://www.terraform.io/docs/language/meta-arguments/depends_on.html

希望这能解答您的疑问。

答案1

相关内容