ploomber / ploomber

The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

Home Page:https://docs.ploomber.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ploomber.tasks.Link is not usable

marr75 opened this issue · comments

With a source attribute, ploomber.tasks.Link cannot be instantiated. Without a source, the task fails validation.

It is not currently possible to use ploomber.tasks.Link in a pipeline spec.

I think it's because when we addded Link, we only had the Python API (not the pipeline.yaml API), and we never worked on ensuring it'd work with pipeline.yaml. Feel free to open a PR!

@edublancas I will. I could use a little guidance from you, though.

Locally, I've got this signature for Link:

class Link(Task):
    ...
    def __init__(self, source, product, dag, name=None):
        kwargs = dict(hot_reload=dag._params.hot_reload)
        self._source = type(self)._init_source(kwargs)
        super().__init__(product, dag, name, None)

And tasks using Link tend to look like:

  # Dummy task to wrap success stories exported from hubspot
  - name: success-stories
    source: ""
    product: "{{PRODUCTS_DIR}}/success-stories.csv"
    class: Link
    product_class: File

Which, isn't terrible but the blank source, the class, and the product_class could all be a little confusing.

I don't think I can get around the source issue without quite a bit of rewiring in the spec task validation (which strictly looks for source without OO/protocol based validation). The product_class issue may be solvable by trying to validate whether product is a pathlike or url-like.

I suppose I could make any string that matches source.lower() == "link" get a class of Link. Maybe that kills two birds with one stone?

Let me know your thoughts.