弹性伸缩ECS服务

auto scaling for ECS service

提问人:x89 提问时间:11/17/2023 最后编辑:x89 更新时间:11/20/2023 访问量:99

问:

赏金将在 2 天后到期。这个问题的答案有资格获得 +50 声望赏金。x89 希望引起人们对这个问题的更多关注

我已经按照本指南设置了 GTM 服务器端:https://aws-solutions-library-samples.github.io/advertising-marketing/using-google-tag-manager-for-server-side-website-analytics-on-aws.html

我正在使用 AWS ECS 任务定义和服务。之后,我使用 Snowbridge 使用 HTTP post 请求将数据从 AWS kinesis 发送到 GTM(扫雪机客户端)。

当数据量较大时,我偶尔会收到 GTM 的 502 错误。如果我过滤掉数据并减少转发到 GTM 的数据量,则不会再出现错误。我可以在 GTM 端进行哪些更改以确保可以处理大量数据?ECS可以使用自动扩缩容功能吗?

我已经使用了像这样的参数

deployment_maximum_percent = 200

deployment_minimum_healthy_percent = 50

但问题仍然存在。

这是我的 GTM 配置大致的样子:

resource "aws_ecs_cluster" "gtm" {
  name = "gtm"
  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}

resource "aws_ecs_task_definition" "PrimaryServerSideContainer" {
  family                   = "PrimaryServerSideContainer"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = 2048
  memory                   = 4096
  execution_role_arn       = aws_iam_role.gtm_container_exec_role.arn
  task_role_arn            = aws_iam_role.gtm_container_role.arn
  runtime_platform {
    operating_system_family = "LINUX"
    cpu_architecture        = "X86_64"
  }
  container_definitions = <<TASK_DEFINITION
  [
  {
    "name": "primary",
    "image": "gcr.io/cloud-tagging-10302018/gtm-cloud-image",
    "environment": [
      {
        "name": "PORT",
        "value": "80"
      },
      {
        "name": "PREVIEW_SERVER_URL",
        "value": "${var.PREVIEW_SERVER_URL}"
      },
      {
        "name": "CONTAINER_CONFIG",
        "value": "${var.CONTAINER_CONFIG}"
      }
    ],
    "cpu": 1024,
    "memory": 2048,
    "essential": true,
    "logConfiguration": {
          "logDriver": "awslogs",
          "options": {
            "awslogs-group": "gtm-primary",
            "awslogs-create-group": "true",
            "awslogs-region": "eu-central-1",
            "awslogs-stream-prefix": "ecs"
          }
        },
    "portMappings" : [
        {
          "containerPort" : 80,
          "hostPort"      : 80
        }
      ]
  }
]
TASK_DEFINITION
}


resource "aws_ecs_service" "PrimaryServerSideService" {
  name             = var.primary_service_name
  cluster          = aws_ecs_cluster.gtm.id
  task_definition  = aws_ecs_task_definition.PrimaryServerSideContainer.id
  desired_count    = var.primary_service_desired_count
  launch_type      = "FARGATE"
  platform_version = "LATEST"

  scheduling_strategy = "REPLICA"

  deployment_maximum_percent         = 200
  deployment_minimum_healthy_percent = 50

  network_configuration {
    assign_public_ip = true
    security_groups  = [aws_security_group.gtm-security-group.id]
    subnets          = data.aws_subnets.private.ids
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.PrimaryServerSideTarget.arn
    container_name   = "primary"
    container_port   = 80
  }

  lifecycle {
    ignore_changes = [task_definition]
  }
}

resource "aws_lb" "PrimaryServerSideLoadBalancer" {
  name               = "PrimaryServerSideLoadBalancer"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.gtm-security-group.id]
  subnets            = data.aws_subnets.public.ids

  enable_deletion_protection = false
}
....


我还尝试添加这些:

resource "aws_appautoscaling_target" "ecs_target" {
  max_capacity       = 4
  min_capacity       = 1
  resource_id        = "service/${aws_ecs_cluster.gtm.name}/${aws_ecs_service.PrimaryServerSideService.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "ecs_policy" {
  name               = "scale-down"
  policy_type        = "StepScaling"
  resource_id        = aws_appautoscaling_target.ecs_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_target.service_namespace

  step_scaling_policy_configuration {
    adjustment_type         = "ChangeInCapacity"
    cooldown                = 60
    metric_aggregation_type = "Maximum"

    step_adjustment {
      metric_interval_upper_bound = 0
      scaling_adjustment          = -1
    }
  }
}

但 502 错误仍然存在。

amazon-web-services terraform amazon-ecs google-tag-manager 自动缩放

评论


答:

2赞 Dmytro Sirant 11/20/2023 #1

你正在寻找正确的方向,只剩下两件事要做:

  1. 您需要确定指标以了解是否需要纵向扩展(更有可能是 CPU 使用率)
  2. 根据 p.1 中的指标更新您的规模resource "aws_appautoscaling_policy" "ecs_policy"

目前,您的ecs_policy没有任何指标可供扩展。

下面是示例:

resource "aws_appautoscaling_policy" "ecs_target_cpu" {
  name               = "application-scaling-policy-cpu"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_service_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_service_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_service_target.service_namespace
  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value = 80
  }
  depends_on = [aws_appautoscaling_target.ecs_service_target]
}

评论

1赞 x89 11/20/2023
我已经尝试了这些设置。目标值 300。但我仍然偶尔会遇到 502 错误。可能不如以前那么频繁,但它们仍然存在。例如,MsgSent:2421,MsgFailed:3。除了 target_value 和 mix-max 容量(即在我的设置 atm 中为 15)之外,我还可以使用哪些参数
1赞 Dmytro Sirant 11/20/2023
目标值为 300 的原因是什么?这是 ECS 任务必须横向扩展的百分比。在这种情况下,它会在负载达到 300%(不可能)时等待,然后创建其他任务。docs.aws.amazon.com/autoscaling/application/APIReference/......刚刚注意到的另一件事,为什么容器的 CPU = 1024 和任务的 CPU = 2048?如果容器仅使用 1024 CPU,则任务将仅加载 50%。