TheOpenCloudEngine / uEngine-cloud

OCE's main component includes : PaaS (Self-service) Portal, Dev-ops, Cloud orchestrator. Also includes microservices-architecture components: Identity & Access Management conforming to OAuth2 and JWT spec and Zuul-based API proxy that interacts with IAM and the service registry (Eureka).

Home Page:http://uengine.org/products/pass

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DCOS 앱 무한배포시도 방지

SeungpilPark opened this issue · comments

backoffFactor

The multiplicand to apply to the backoffSeconds value. The default value is 1.15. The backoffSeconds and backoffFactor values are multiplied until they reach the maxLaunchDelaySeconds value. After they reach that value, Marathon waits maxLaunchDelaySeconds before repeating this cycle exponentially. For example, if backoffSeconds: 3, backoffFactor: 2, and maxLaunchDelaySeconds: 3600, there will be ten attempts to launch a failed task, each three seconds apart. After these ten attempts, Marathon will wait 3600 seconds before repeating this cycle.

This prevents sandboxes associated with consecutively failing tasks from filling up the hard disk on Mesos slaves. This applies also to tasks that are killed due to failing too many health checks.

backoffSeconds

The amount of time (in seconds) before Marathon retries launching a failed task. The default is 1. The backoffSeconds and backoffFactor values are multiplied until they reach the maxLaunchDelaySeconds value. After they reach that value, Marathon waits maxLaunchDelaySeconds before repeating this cycle exponentially. For example, if backoffSeconds: 3, backoffFactor: 2, and maxLaunchDelaySeconds: 3600, there will be ten attempts to launch a failed task, each three seconds apart. After these ten attempts, Marathon will wait 3600 seconds before repeating this cycle.

This prevents sandboxes associated with consecutively failing tasks from filling up the hard disk on Mesos slaves. This applies also to tasks that are killed due to failing too many health checks.

maxLaunchDelaySeconds

The maximum amount of time (in seconds) to wait, after applying the backoffSeconds and backoffFactor values, before attempting to restart failed tasks. The backoffSeconds and backoffFactor values are multiplied until they reach the maxLaunchDelaySeconds value. After they reach that value, Marathon waits maxLaunchDelaySeconds before repeating this cycle exponentially. For example, if backoffSeconds: 3, backoffFactor: 2, and maxLaunchDelaySeconds: 3600, there will be ten attempts to launch a failed task, each three seconds apart. After these ten attempts, Marathon will wait 3600 seconds before repeating this cycle.

This prevents sandboxes associated with consecutively failing tasks from filling up the hard disk on Mesos slaves. This applies also to tasks that are killed due to failing too many health checks.

mesosphere/marathon#3035

해당 이슈에 보면, DCOS 무한 배포 시도에 관한 사용자 요구에 대해 자신들의 앱 영속성 사상에 맞지 않는다는 이유로 절대 안해줄 것 같음. 대신 backoffFactor , backoffSeconds , maxLaunchDelaySeconds 세가지 옵션으로 얼마나 자주 재배포 시도를 할 수 있는지는 조정가능하다.

무한 재배포 시도가 야기하는 문제는,

  • 타스크 실패 이력이 주키퍼에 계속 쌓임.
  • -> 메소스의 스테이터스를 불러오는 Rest api 에 실패 이력이 같이 올라옴. (필터 제공 안함)
  • -> 시간이 갈수록 네트워크 전송량이 누적됨.

특시 무한 재배포가 아니더라도 타스크 이력이 계속 Rest api 에 적제되어 수개월 운영시 바로 문제가 될 것이 확실해 보이는데, 무한 재배포 시도에 대해서는 사용자가 별도의 APM 구축 할 것과, 문제시 되는 이력에 대해서는 주키퍼 Znode 에 적재된 데이터를 직접 삭제해야 할 것으로 보인다. 두 가지 접근법 모두 단기일에 해결하지 못하므로 패키지 안정화 버젼 이후, 운영중 작업사항으로 미루도록 한다.

mesos_max_completed_tasks_per_framework 옵션 => 타스크 히스토리를 얼마만큼 메모리에 유지하느냐 문제.
TODO mesos_max_completed_tasks_per_framework 프로퍼티 변경

기 운영중인 mesos_max_completed_tasks_per_framework 변경

각 마스터 노드마다,

sudo vi /opt/mesosphere/etc/mesos-master

.
.
MESOS_MAX_COMPLETED_TASKS_PER_FRAMEWORK=5

sudo systemctl restart dcos-mesos-master