Goga1992 / YOLO-Serving-Cookbook

YOLO Serving Cookbook based on Triton Inference Server

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

πŸ“Έ YOLO Serving Cookbook πŸ“Έ

Architecture
Docker
Architecture
Docker-Compose

3. Kubernetes

Architecture (without Ensemble)
Number of Replicas = 1 Number of Replicas = 5
Kubernetes-Rep=1 Kubernetes-Rep=5
Architecture (with Ensemble)
Number of Replicas = 1 Number of Replicas = 5
Kubernetes-Ensemble-Rep=1 Kubernetes-Ensemble-Rep=5

Experimental Setup

  • Server
    • Sync: FastAPIμ—μ„œ 동기 처리
    • Async: FastAPIμ—μ„œ 비동기 처리
    • Rep: fastapi와 triton-inference-server의 replica 수
    • Ensemble: triton-inference-server λ‚΄μ—μ„œ ensemble을 ν™œμš©ν•΄ μ „, ν›„μ²˜λ¦¬ 및 μ‹œκ°ν™”λ₯Ό μˆ˜ν–‰ (fastapiλŠ” λΉ„λ™κΈ°λ‘œ μž‘λ™)
  • Client (FastAPIλ₯Ό 100회 호좜, 10회 μ‹€ν—˜)
    • Serial: for문을 μ΄μš©ν•΄ 직렬적 호좜
    • Concurrency: ThreadPoolExecutorλ₯Ό μ΄μš©ν•΄ λ™μ‹œ 호좜
    • Random: ThreadPoolExecutorλ₯Ό 이용 및 0 ~ 20초 이후 랜덀 호좜

Results

λ‹¨μœ„: [Sec]
Server Arch. Mean(Serial) End(Serial) Mean(Concurrency) End(Concurrency) Mean(Random) End(Random)
Sync&Rep=1 0.69 78.01 41.93 129.61 40.05 128.63
Sync&Rep=5 0.60 68.99 25.57 61.38 26.88 81.69
Async&Rep=1 0.68 77.02 0.80 82.22 0.78 80.34
Async&Rep=1-5 0.61 69.07 0.60 62.11 - -
Async&Rep=5 0.62 69.77 1.84 39.77 1.91 41.84
Ensemble&Rep=1 0.70 78.02 0.77 78.50 - -
Ensemble&Rep=5 0.66 74.52 1.90 42.03 - -
Figures
EACH-SERIAL TOTAL-SERIAL
EACH-CONCURRENCY EACH-CONCURRENCY-ASYNC
TOTAL-CONCURRENCY
EACH-RANDOM TOTAL-RANDOM

Discussion

Sync, Async, Ensemble

λ‹¨μœ„: [Sec]
Server Arch. Mean(Serial) End(Serial) Mean(Concurrency) End(Concurrency) Mean(Random) End(Random)
Sync 0.647 73.499 33.752 95.496 33.460 105.160
Async 0.652 73.395 1.320 60.991 1.345 61.094
Ensemble 0.680 76.270 1.332 60.269 - -

직렬적 호좜 μ‹œ 동기, 비동기 방식은 차이가 μ‘΄μž¬ν•˜μ§€ μ•ŠλŠ”λ‹€.

ν•˜μ§€λ§Œ 비동기 방식은 동기 방식에 λΉ„ν•΄ λ™μ‹œμ  호좜 μ‹œ μ•½ 36.51%, 랜덀 호좜 μ‹œ μ•½ 41.90% λΉ λ₯Έ 응닡을 확인할 수 μžˆλ‹€.

반면 ensemble 방식을 톡해 큰 이점은 ν™•μΈν•˜μ§€ λͺ»ν–ˆμ§€λ§Œ, λ³Έ μ‹€ν—˜μ˜ ν•œκ³„μΌ 수 μžˆλ‹€. (λ¦¬μ†ŒμŠ€, 데이터 규λͺ¨, ...)

async def둜 μ •μ˜λœ FastAPIμ—μ„œ Random 쑰건의 였λ₯˜ λ°œμƒ
Traceback (most recent call last):
  File "anaconda3\lib\site-packages\requests\models.py", line 972, in json
    return complexjson.loads(self.text, kwargs)
  File "anaconda3\lib\site-packages\simplejson\__init__.py", line 514, in loads
    return _default_decoder.decode(s)
  File "anaconda3\lib\site-packages\simplejson\decoder.py", line 386, in decode
    obj, end = self.raw_decode(s)
  File "anaconda3\lib\site-packages\simplejson\decoder.py", line 416, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "Downloads\curl.py", line 70, in <module>
    main(i)
  File "Downloads\curl.py", line 53, in main
    responses = list(
  File "anaconda3\lib\concurrent\futures\_base.py", line 609, in result_iterator
    yield fs.pop().result()
  File "anaconda3\lib\concurrent\futures\_base.py", line 439, in result
    return self.__get_result()
  File "anaconda3\lib\concurrent\futures\_base.py", line 391, in __get_result
    raise self._exception
  File "anaconda3\lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, self.kwargs)
  File "Downloads\curl.py", line 24, in send_request
    res = response.json()
  File "anaconda3\lib\site-packages\requests\models.py", line 976, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

μ΄λŠ” Random μ‘°κ±΄μ—μ„œ λ°œμƒν•˜λŠ” 였λ₯˜μΈλ°, Concurrency 쑰건에선 λ°œμƒν•˜μ§€ μ•ŠλŠ”κ²Œ μ΄μƒν•˜λ‹€.

λͺ¨λ“  pod에 λŒ€ν•΄ λ‘œκ·Έκ°€ λ‚¨μ•„μžˆλ‚˜ μ‚΄νŽ΄λ΄€μ§€λ§Œ λ°œκ²¬ν•˜μ§€ λͺ»ν–ˆλ‹€.

μ•„λ§ˆ λΉ„λ™κΈ°μ μœΌλ‘œ triton-inference-server의 serviceλ₯Ό ν˜ΈμΆœν•˜λ‹€κ°€ 생긴 였λ₯˜λ‘œ 보인닀.

Chat GPT μ„ μƒλ‹˜μ˜ λ‹΅λ³€

FastAPIλ₯Ό μ‚¬μš©ν•˜λ©΄μ„œ νŠΉμ • νŒ¨ν„΄μ˜ μš”μ²­μ—μ„œλ§Œ 였λ₯˜κ°€ λ°œμƒν•˜λŠ” 경우, λ‹€μ–‘ν•œ 원인이 μžˆμ„ 수 μžˆμŠ΅λ‹ˆλ‹€. λ‹€μŒμ€ λͺ‡ 가지 κ°€λŠ₯성에 λŒ€ν•΄ μ„€λͺ…ν•©λ‹ˆλ‹€:

  1. λ¦¬μ†ŒμŠ€ ν•œκ³„: μ„œλ²„ λ˜λŠ” μ‚¬μš© 쀑인 λ°μ΄ν„°λ² μ΄μŠ€μ˜ μ—°κ²° μˆ˜λ‚˜ 기타 λ¦¬μ†ŒμŠ€ ν•œκ³„μ— 도달할 수 μžˆμŠ΅λ‹ˆλ‹€. ν•œ μˆœκ°„μ— 100회 μš”μ²­μ„ 보낼 λ•Œ, 각 μš”μ²­μ΄ λΉ λ₯΄κ²Œ μ²˜λ¦¬λ˜μ–΄μ„œ λ¦¬μ†ŒμŠ€κ°€ λΉ λ₯΄κ²Œ ν•΄μ œλ˜λ©΄ λ¬Έμ œκ°€ 없을 수 μžˆμŠ΅λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ 100회의 μš”μ²­μ„ 20초 λ™μ•ˆ μ§€μ†μ μœΌλ‘œ 보낼 경우, κ·Έ 사이에 λ¦¬μ†ŒμŠ€κ°€ ν•΄μ œλ˜μ§€ μ•Šμ•„ λ¬Έμ œκ°€ λ°œμƒν•  수 μžˆμŠ΅λ‹ˆλ‹€.
  2. 레이트 λ¦¬λ°‹νŒ…: 일뢀 μ‹œμŠ€ν…œμ€ 짧은 μ‹œκ°„ λ™μ•ˆ λ„ˆλ¬΄ λ§Žμ€ μš”μ²­μ„ μ²˜λ¦¬ν•˜λŠ” 것을 λ°©μ§€ν•˜κΈ° μœ„ν•΄ 레이트 λ¦¬λ°‹νŒ…(rate limiting)을 μ‚¬μš©ν•©λ‹ˆλ‹€. κ·Έλ ‡κΈ° λ•Œλ¬Έμ— 20초 λ™μ•ˆ κΎΈμ€€νžˆ μš”μ²­μ„ 보내면 μ œν•œμ— 걸릴 수 μžˆμŠ΅λ‹ˆλ‹€.
  3. λ©”λͺ¨λ¦¬ 문제: μš”μ²­μ΄ λˆ„μ λ˜λ©΄μ„œ μ„œλ²„μ˜ λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ΄ λŠ˜μ–΄λ‚  수 μžˆμŠ΅λ‹ˆλ‹€. λ§Œμ•½ λ©”λͺ¨λ¦¬κ°€ λΆ€μ‘±ν•˜λ‹€λ©΄, μš”μ²­ 처리 쀑에 μ—λŸ¬κ°€ λ°œμƒν•  수 μžˆμŠ΅λ‹ˆλ‹€.
  4. 쀑볡 μš”μ²­ 처리: 20초 λ™μ•ˆ μ§€μ†μ μœΌλ‘œ μš”μ²­μ„ 보낼 λ•Œ, 이전 μš”μ²­μ˜ κ²°κ³Όλ‚˜ μƒνƒœκ°€ λ‹€μŒ μš”μ²­μ˜ μ²˜λ¦¬μ— 영ν–₯을 쀄 수 μžˆμŠ΅λ‹ˆλ‹€. μ€‘λ³΅λ˜λŠ” μš”μ²­μ΄λ‚˜ μƒνƒœ 변경이 원인이 될 수 μžˆμŠ΅λ‹ˆλ‹€.
  5. λ‘œκΉ… λ˜λŠ” λͺ¨λ‹ˆν„°λ§ 도ꡬ: μš”μ²­μ„ μ²˜λ¦¬ν•˜λŠ” λ™μ•ˆ 둜그λ₯Ό λ‚¨κΈ°κ±°λ‚˜ λͺ¨λ‹ˆν„°λ§ 도ꡬλ₯Ό μ‚¬μš©ν•˜λŠ” 경우, μš”μ²­ λΉˆλ„κ°€ 높아지면 좔가적인 λΆ€ν•˜κ°€ λ°œμƒν•  수 μžˆμŠ΅λ‹ˆλ‹€.
  • ν•΄κ²° 방법
    • λ‘œκΉ…: λ°œμƒν•˜λŠ” 였λ₯˜μ˜ 둜그λ₯Ό ν™•μΈν•˜μ—¬ ꡬ체적인 원인을 νŒŒμ•…ν•©λ‹ˆλ‹€.
    • ν”„λ‘œνŒŒμΌλ§: μ„œλ²„μ˜ CPU, λ©”λͺ¨λ¦¬, λ„€νŠΈμ›Œν¬ μ‚¬μš©λŸ‰ λ“±μ˜ λ¦¬μ†ŒμŠ€ μ‚¬μš© 상황을 λͺ¨λ‹ˆν„°λ§ν•˜μ—¬ 병λͺ© ν˜„μƒμ΄λ‚˜ λ¦¬μ†ŒμŠ€ λΆ€μ‘± 문제λ₯Ό νŒŒμ•…ν•©λ‹ˆλ‹€.
    • ν…ŒμŠ€νŒ…: μ„œλ²„μ— μš”μ²­μ„ λ³΄λ‚΄λŠ” λ‹€μ–‘ν•œ μ‹œλ‚˜λ¦¬μ˜€λ₯Ό ν…ŒμŠ€νŠΈν•˜μ—¬ 문제의 원인을 νŒŒμ•…ν•©λ‹ˆλ‹€.

μ΄λŸ¬ν•œ 점검을 톡해 문제의 원인을 νŒŒμ•…ν•˜κ³  μ μ ˆν•œ 쑰치λ₯Ό μ·¨ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

Replicas

λ‹¨μœ„: [Sec]
Server Arch. Mean(Serial) End(Serial) Mean(Concurrency) End(Concurrency) Mean(Random) End(Random)
Rep=1 0.691 77.682 14.501 96.777 20.415 104.487
Rep=5 0.629 71.094 9.767 47.726 14.391 61.767

Replica 수의 증가λ₯Ό 톡해 API의 응닡을 λΉ λ₯΄κ²Œ ν•  수 μžˆμŒμ„ ν™•μΈν–ˆλ‹€. (νŒŒλ“œλŠ” μ„œλΉ„μŠ€μ™€ ν†΅μ‹ ν•˜λ„λ‘ ꡬ성할 수 있으며, μ„œλΉ„μŠ€μ™€μ˜ 톡신은 μ„œλΉ„μŠ€μ˜ 맴버 쀑 일뢀 νŒŒλ“œμ— μžλ™μ μœΌλ‘œ λ‘œλ“œ-λ°ΈλŸ°μ‹± λœλ‹€.)

특히 λ™μ‹œμ  호좜 μ‹œ 큰 ν–₯상이 μžˆμŒμ„ 확인할 수 μžˆλ‹€.

WORKER TIMEOUT

fastapi의 replicaλŠ” 1개, triton-inference-server의 replicaλŠ” 5개 일 λ•ŒλŠ” λ°œμƒν•˜μ§€ μ•Šλ˜ 였λ₯˜κ°€ fastapi의 replicaλŠ” 5개, triton-inference-server의 replicaλŠ” 5개 일 λ•Œ μ•„λž˜μ™€ 같이 λ°œμƒν–ˆλ‹€.

이것은 "--timeout", "120"을 Dockerfile에 μΆ”κ°€ν•˜μ—¬ ν•΄κ²°ν–ˆλ‹€.

[1] [CRITICAL] WORKER TIMEOUT (pid:8)
[1] [WARNING] Worker with pid 8 was terminated due to signal 6
[379] [INFO] Booting worker with pid: 379
[379] [INFO] Started server process [379]
[379] [INFO] Waiting for application startup.
[379] [INFO] Application startup complete.

Autoscaling

HPA μ‚¬μš© μ‹œ ν•œ μˆœκ°„μ— 100회의 μš”μ²­μ΄ μž…λ ₯되면 replicaλ₯Ό μƒμ„±ν•˜κΈ° 전에 단일 fastapi pod에 μž…λ ₯되기 λ•Œλ¬Έμ— autoscaling 효과λ₯Ό λ³Ό 수 μ—†λ‹€.

λ”°λΌμ„œ autoscaling을 μ›ν™œνžˆ ν•˜λ €λ©΄ Resource 기쀀이 μ•„λ‹Œ μƒˆλ‘œμš΄ metricsκ°€ ν•„μš”ν•˜λ‹€.

μ˜ˆμ‹œ: hpa.yaml
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: triton-inference-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triton-inference-server
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: fastapi-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: fastapi
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 80
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
Architecture

Gradio

About

YOLO Serving Cookbook based on Triton Inference Server

License:GNU Affero General Public License v3.0


Languages

Language:Python 100.0%