ArroyoSystems / arroyo

Distributed stream processing engine in Rust

Home Page:https://arroyo.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Validate filesystem URLs at planning time

mwylde opened this issue · comments

For filesystem sources created via SQL, we do not validate them as part of the SQL planning process. This causes panics at runtime when the source is instantiated on the worker:

2023-08-15T03:28:44.132894Z ERROR arroyo_server_common: panicked at 'called `Result::unwrap()` 
on an `Err` value: RelativeUrlWithoutBase', /opt/arroyo/src/arroyo-worker/src/connectors/filesystem/mod.rs:119:65 
panic.file="/opt/arroyo/src/arroyo-worker/src/connectors/filesystem/mod.rs" panic.line=119 panic.column=65

Hi @mwylde
I also would like to give this one a shot. Could you guide me the details how to start?
Perhaps could you tell me where the SQL planning is done?
IIUC, the planning is delegated to datafusion?

@mwylde I'm not able to reproduce this specific panic (RelativeUrlWithoutBase) exactly, but I did notice a few related issues when trying to reproduce it using ghcr.io/arroyosystems/arroyo-single:0.10-dev:

  1. Pipelines/previews succeed even if the path for filesystem source created via SQL does not exist. I'd expect there to be some sort of failure if the path does not exist.
  2. Path "file:///" for filesystem source created with SQL panics during query execution with ERROR arroyo_server_common: panicked at crates/arroyo-connectors/src/filesystem/source.rs:69:17: could not get next path: Generic LocalFileSystem error: Unable to walk dir: File system loop found: /sys/class/vtconsole/vtcon0/subsystem points to an ancestor /sys/class/vtconsole panic.file="crates/arroyo-connectors/src/filesystem/source.rs" panic.line=69 panic.column=17
  1. An S3 path without valid S3 creds for the filesystem source created with SQL panics during query execution with: panicked at crates/arroyo-connectors/src/filesystem/source.rs:69:17: could not get next path: Generic s3 error: Couldn't find AWS credentials in environment, credentials file, or IAM role. panic.file="crates/arroyo-connectors/src/filesystem/source.rs" panic.line=69 panic.colum
  2. Creating filesystem sources in the UI always succeeds, even if the inputted path is malformed. See:
    tokio::task::spawn(async move {
    let message = TestSourceMessage {
    error: false,
    done: true,
    message: "Successfully validated connection".to_string(),
    };
    tx.send(message).await.unwrap();
    });
  3. Kafka sources created via SQL panic if the topic does not exist. panicked at crates/arroyo-worker/src/lib.rs:622:14: called Result::unwrap()on anErr value: SendError { .. } panic.file="crates/arroyo-worker/src/lib.rs" panic.line=622 panic.column=14

What do you think about running the same connection test() logic that is run when creating connectors in the UI when planning sources during the scheduling phase? If each connector properly implements the test() logic, it should solve all of the problems above.