8편 · 약 10분

오토스케일링(HPA/VPA)

왜 오토스케일링인가

고정된 복제본 수로는 두 가지 비용 중 하나를 피할 수 없다. 피크 트래픽에 맞추면 평소에는 돈 낭비, 평균 트래픽에 맞추면 피크 때 서비스 장애. 쿠버네티스는 이 딜레마를 세 종류의 자동 스케일러로 해결한다.

스케일러	무엇을 조정하나	단위
HPA (Horizontal Pod Autoscaler)	Pod 수	Deployment / StatefulSet 복제본
VPA (Vertical Pod Autoscaler)	Pod 크기 (CPU·메모리 request)	개별 컨테이너 리소스
CA (Cluster Autoscaler)	노드 수	클러스터 인프라

이 글에서는 HPA와 VPA를 집중적으로 다룬다.

HPA — 복제본을 늘리고 줄인다

동작 원리

HPA 컨트롤러는 기본 15초 주기로 메트릭을 읽고 목표 복제본 수를 계산한다.

desiredReplicas = ceil( currentReplicas × currentMetricValue / targetMetricValue )

예: 현재 4개 Pod의 CPU 합산이 300m이고 목표가 150m이면 → ceil(4 × 300/150) = 8개로 스케일 아웃.

사용할 수 있는 메트릭

리소스 메트릭: cpu, memory (metrics-server 필요)
커스텀 메트릭: Prometheus Adapter 등을 통해 HTTP RPS, 큐 깊이 등을 지표로 사용
외부 메트릭: 클러스터 외부 시스템(SQS, PubSub 등) 메트릭

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60   # Pod 평균 CPU request 대비 60% 목표
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300   # 스케일 다운을 5분간 관찰 후 결정
    scaleUp:
      stabilizationWindowSeconds: 0     # 스케일 업은 즉시

스케일 다운 안정화 창(stabilizationWindowSeconds)

급격한 메트릭 변동 때마다 Pod 수가 오락가락하면(flapping) 서비스가 불안정해진다. HPA는 스케일 다운 결정 전 기본 300초 동안 "이 기간 안에 계산된 희망 복제본 수 중 가장 큰 값"을 사용한다. 덕분에 CPU가 잠깐 내려갔다가 다시 올라가는 상황에서 불필요한 스케일 다운을 막는다.

Memory는 HPA 지표로 적합하지 않다. 메모리는 반환이 느리고 언어마다 GC 타이밍이 달라 신뢰할 수 없는 스케일링 신호다. 메모리 크기 설정은 VPA에 맡기자.

VPA — 컨테이너 크기를 맞춰준다

세 가지 내부 구성 요소

Recommender: 실제 사용량 히스토리를 분석해 권장 requests/limits 값을 계산.
Admission Controller: Pod 생성 시 VPA 권장값을 자동 주입.
Updater: 현재 값이 권장값과 너무 다른 Pod를 축출(evict)해 재시작.

updateMode 선택

모드	동작
`Off`	권장값만 계산, 아무것도 바꾸지 않음. 처음 도입할 때 적합
`Initial`	Pod 최초 생성 시에만 권장값 주입, 이후 변경 없음
`Recreate`	권장값과 크게 다르면 Pod를 축출해 재생성 (PDB 존중)
`Auto`	`Recreate`와 동일. 미래에는 in-place resize 지원 예정

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"   # 먼저 Off로 시작해 권장값을 관찰
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: 50m
        memory: 64Mi
      maxAllowed:
        cpu: 2
        memory: 2Gi

Off 모드로 최소 24–48시간 운영한 뒤 kubectl describe vpa api-server로 권장값을 확인하고, 적용 여부를 판단하는 것이 안전하다.

HPA + VPA 함께 쓰기

같은 CPU·메모리 메트릭에 HPA와 VPA Auto를 함께 걸면 충돌이 발생한다. VPA가 request를 올리면 HPA의 "현재 사용률 비율" 계산이 달라져 서로 다른 방향으로 스케일링하는 루프가 생긴다.

안전한 조합:

HPA → CPU/메모리 메트릭 담당
VPA → Off 또는 Initial 모드로 권장값만 제공
또는 HPA → 커스텀 메트릭(RPS, 큐 깊이 등), VPA → CPU/메모리 Auto

전체 아키텍처 한눈에 보기

HPA와 VPA 컨트롤 루프 아키텍처

KEDA — 이벤트 기반 스케일링

HPA의 CPU/메모리 외에 큐 메시지 수, DB 행 수, Cron 표현식 등으로 스케일링하고 싶다면 KEDA(Kubernetes Event-Driven Autoscaling)를 사용한다. KEDA는 내부적으로 HPA를 생성하지만 외부 이벤트 소스를 직접 연결한다는 점이 다르다. 0 → N으로의 스케일(Scale-to-Zero)도 지원한다.

실전 가이드라인

먼저 resource requests/limits를 정확히 설정한다. HPA의 utilization 계산은 request 대비 비율이므로, request가 잘못 설정되면 스케일링도 엉뚱하게 동작한다.
VPA Off 모드를 먼저 도입해 권장값을 일주일 이상 관찰한 뒤 request를 조정한다.
minReplicas: 2 이상 — 0이나 1로 설정하면 스케일 업 시 지연이 발생하고 가용성이 떨어진다.
HPA + VPA 중복 메트릭 회피 — CPU/메모리 HPA와 VPA Auto를 동시에 걸지 않는다.
PodDisruptionBudget(PDB) 병행 — VPA Updater나 노드 드레인 시 최소 가용 Pod 수를 보장한다.

References

https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/
https://kubernetes.io/docs/concepts/workloads/autoscaling/vertical-pod-autoscale/
https://scaleops.com/blog/hpa-vs-vpa/
https://sedai.io/blog/kubernetes-autoscaling
https://dev.to/piyushjajoo/kubernetes-autoscaling-internals-hpa-and-vpa-under-the-hood-4e0g
https://atmosly.com/blog/kubernetes-autoscaling-hpa-vpa-cluster-autoscaler-guide
https://www.devzero.io/blog/kubernetes-autoscaling
https://www.federicocalo.dev/en/blog/04-autoscaling-in-kubernetes-hpa-vpa-keda-and-karpenter
https://povilasv.me/vertical-pod-autoscaling-the-definitive-guide/
https://oneuptime.com/blog/post/2026-02-09-vpa-update-policy-auto-recreate-off/view