Pipeline stucks in "Scheduling"
juchiast opened this issue · comments
Installed on EKS, Kubernetes 1.26, 4x t3.large instances, arroyo-0.5.
values.yaml:
outputDir: "/tmp/arroyo-test"
volumes:
- name: checkpoints
hostPath:
path: /tmp/arroyo-test
type: DirectoryOrCreate
volumeMounts:
- name: checkpoints
mountPath: /tmp/arroyo-test
Logs:
deployment.apps/arroyo-controller:
2023-08-21T09:49:59.016441Z INFO arroyo_controller::states: starting state machine job_id="job_wHHzn9Ezp7"
2023-08-21T09:49:59.017458Z INFO arroyo_controller::states: state transition job_id="job_wHHzn9Ezp7" from="Created" to="Compiling" duration_ms=0
2023-08-21T09:49:59.048036Z INFO arroyo_controller::states::compiling: Compiling pipeline job_id="job_wHHzn9Ezp7" hash="kl7rbew88rogxeqk"
2023-08-21T09:49:59.048262Z INFO arroyo_controller::compiler: Compiling remotely on http://arroyo-compiler:9000
2023-08-21T09:49:59.048363Z INFO arroyo_controller::compiler: digraph {
0 [ label = "finnhub_0:WebsocketSource<wss://ws.finnhub.io/?token=...>" ]
1 [ label = "watermark_1:Watermark" ]
2 [ label = "sink_web_2:WebSink" ]
3 [ label = "fused_3:expression<sql_fused<value_project,value_project>:Record>" ]
0 -> 1 [ label = "() → finnhub :: ArroyoJsonRoot" ]
3 -> 2 [ label = "() → generated_struct_5412302300650363671" ]
1 -> 3 [ label = "() → finnhub :: ArroyoJsonRoot" ]
}
2023-08-21T09:50:44.708458Z INFO arroyo_controller::states: state transition job_id="job_wHHzn9Ezp7" from="Compiling" to="Scheduling" duration_ms=45690
2023-08-21T09:58:22.867727Z INFO arroyo_controller::states: starting state machine job_id="job_37XrqB8Fd4"
2023-08-21T09:58:22.868857Z INFO arroyo_controller::states: state transition job_id="job_37XrqB8Fd4" from="Created" to="Compiling" duration_ms=0
2023-08-21T09:58:22.891353Z INFO arroyo_controller::states::compiling: Compiling pipeline job_id="job_37XrqB8Fd4" hash="yolstlaxxnxwf88j"
2023-08-21T09:58:22.891399Z INFO arroyo_controller::compiler: Compiling remotely on http://arroyo-compiler:9000
2023-08-21T09:58:22.891520Z INFO arroyo_controller::compiler: digraph {
0 [ label = "finnhub_0:WebsocketSource<wss://ws.finnhub.io/?token=...>" ]
1 [ label = "watermark_1:Watermark" ]
2 [ label = "sink_web_2:WebSink" ]
3 [ label = "fused_3:expression<sql_fused<value_project,value_project>:Record>" ]
0 -> 1 [ label = "() → finnhub :: ArroyoJsonRoot" ]
3 -> 2 [ label = "() → generated_struct_5412302300650363671" ]
1 -> 3 [ label = "() → finnhub :: ArroyoJsonRoot" ]
}
2023-08-21T09:58:37.024947Z INFO arroyo_controller::states: state transition job_id="job_37XrqB8Fd4" from="Compiling" to="Scheduling" duration_ms=14156
2023-08-21T10:00:44.750722Z ERROR arroyo_controller::states: retryable state error job_id="job_wHHzn9Ezp7" state="Scheduling" error_message="timed out while waiting for job to start" error="timed out after 600s while waiting for worker startup" retries=3
2023-08-21T10:08:37.068349Z ERROR arroyo_controller::states: retryable state error job_id="job_37XrqB8Fd4" state="Scheduling" error_message="timed out while waiting for job to start" error="timed out after 600s while waiting for worker startup" retries=3
replicaset.apps/arroyo-worker-job-zgfuyex3ab-1:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/main.rs:16:73
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/main.rs:16:73
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(1), ...)', src/main.rs:66:21
Hi @juchiast thanks for checking out Arroyo!
In Kubernetes, we use a very small worker image (built by this Dockerfile) which is responsible for downloading the pipeline binary from S3 or the filesystem and starting it. That (not very clear) error message is saying that it failed to find the binary.
I believe the issue here is with your helm configuration. You've configured everything to use local paths (/tmp/arroyo-test) which will work when running everything on a single node (like in minikube) but not on a distributed cluster.
Instead you will need to configure an S3 bucket to store the pipeline artifacts, like in https://doc.arroyo.dev/deployment/kubernetes#example-eks-configuration.
Happy to help you synchronously on Discord as well if that's easier!
Thanks! I suggest adding to the docs so that users will know local config won't work when running on EKS.
Btw, I had to manually set K8S_WORKER_SERVICE_ACCOUNT_NAME env var of arroyo-controller to make it work with s3.