jayhack / prefect

The easiest way to coordinate your dataflow

Home Page:https://prefect.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pass a reference of an object in shared memory to a task

jayhack opened this issue · comments

First check

  • I added a descriptive title to this issue.
  • I used the GitHub search to find a similar request and didn't find it.
  • I searched the Prefect documentation for this feature.

Prefect Version

2.x

Describe the current behavior

The code below runs a Prefect flow that starts parallel tasks (with Dask) passing a list of 80-characters strings as argument (read_data) or a path to a file with the strings. The idea is to measure how long it takes to pass this amount of data directly to the task and how long it takes to pass the file path and open it inside the tasks.

Running the following code with python <FILE_NAME>.py 1000000 will show that passing this large amount of data takes much longer than passing the name of a file with the same amount of data and reading the file inside the task. Passing data directly takes more than 80 seconds, while passing a file path + reading the file doesn't even take 2 seconds.

import sys
import time
import psutil
import tempfile

import pandas as pd
from typing import List
from pathlib import Path

from dask import config as cfg
from prefect import flow, task
from prefect_dask import DaskTaskRunner

cfg.set({"distributed.scheduler.worker-ttl": None})

DEFAULT_RUNNER = DaskTaskRunner(
    cluster_kwargs={"n_workers": psutil.cpu_count(), "memory_limit": 1.0, "resources": {"process": 1}}
)

@task
def read_data(a: List[str]) -> None:
    print(f"Done! List size: {len(a)}.")

@task
def read_data_file(file_path: Path) -> None:
    a = pd.read_csv(file_path, header=None)[0].values.tolist()
    print(f"Done! List size: {len(a)}.")

@flow(task_runner=DEFAULT_RUNNER, validate_parameters=False)
def main_flow(data_size: int) -> None:
    a = ["C" * 80 for i in range(data_size)]
    with tempfile.TemporaryDirectory() as tmpdir:
        file_path = Path(tmpdir) / "file.txt"
        pd.Series(a).to_csv(file_path, index=False, header=False)
        
        starting_time = time.time()
        tasks_futures = []
        for _ in range(psutil.cpu_count()):
            tasks_futures.append(read_data.submit(a))
        for future in tasks_futures:
            future.wait()
        time_passing_data = time.time() - starting_time

        starting_time = time.time()
        tasks_futures = []
        for _ in range(psutil.cpu_count()):
            tasks_futures.append(read_data_file.submit(file_path))
        for future in tasks_futures:
            future.wait()
        time_passing_file_path = time.time() - starting_time

        print(f"Passing data directly: {time_passing_data}.")
        print(f"Passing file path: {time_passing_file_path}.")

if __name__ == "__main__":
    main_flow(data_size=int(sys.argv[1]))

Describe the proposed behavior

It would be very nice to have a way to pass just a reference to a large object in a read-only shared memory to a task in a situation where such an object will only be read, which is this case.

If this is not possible, is there any other way to more efficiently pass large amounts of data directly to the tasks?

Example Use

We could write something like

@task
def f(input_data_ref):
    # input_data_ref could only be read, not written.
    ...

data = <SOMETHING VERY LARGE>
data_ref = data.share_read_only_memory()
f.submit(data_ref)
...

Additional context

No response