intake / intake

Intake is a lightweight package for finding, investigating, loading and disseminating data.

Home Page:https://intake.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Enhancement] Proposal for introducing centralized URI handling

kadykov opened this issue · comments

I'm not familiar with the structure of the project, but I have the impression that there are too many places where we do URI string parsing, protocol extraction, checking URI format, converting paths to POSIX format, and so on...

It seems tricky to maintain, and it could cause some weird bugs. I think there should be the only place where we do all the operations with URI strings.

For example, it could be done like that, in this very draft implementation of the URI class:

from pathlib import Path
from typing import Literal
from dataclasses import dataclass, InitVar, field

Protocols = Literal[
    "file",
    "sqlite",
    "postgresql",
    "http",
    "https",
    "s3",
]

@dataclass
class Credentials:
    username: str | None
    password: str | None

@dataclass
class URI:
    uri: InitVar[str]
    path: Path = field(init=False)
    protocol: Protocols = field(init=False)
    credentials: Credentials = field(init=False)

    def __post_init__(self, uri: str):
        # Extract protocol
        self.protocol = self._get_protocol(uri)
        # Extract authoritative notation
        self.credentials = self._get_credentials(uri)
        # Extract path
        self.path = Path(self._get_path(uri))
        ...

    def to_str(self, protocol: bool = True, credentials: bool = True, posix: bool = True) -> str:
        if posix:
            path = self.path.as_posix()
        if protocol:
            path = f"{protocol}://{path}"
        if credentials:
            path = ...
        ...
        return str(path)
    
    def _get_protocol(self, uri: str) -> Protocols:
        ...
        return uri.split(":/")[0]
    
    def _get_credentials(self, uri: str) -> Credentials:
        if self.protocol == "file":
            ...
        elif self.protocol == "http":
            ...
        return Credentials(None, None)
    
    def _get_path(self, uri: str) -> Path:
        ...
        return Path(uri)

uri = URI("https://github.com/intake/intake")
if uri.protocol == "file":
    open(uri.to_str(protocol=False))
elif uri.protocol == "postgresql":
    ...

Probably, intake/utils.py could be the right place for it...

What do you think if we create this missing abstraction layer where we do all the manipulations, and then we just pass the URI class instances everywhere?
I think this should simplify the code, ease maintenance, and promote code reuse.

Meta comments

  • 1: Why "Enchantment"? :)
  • 2: Many of the usages are in the v1 codebase and tests, and I am not worried about them
  • 3: Some of the string manipulations thrown up by the searches have nothing to do with paths (e.g., "package.module:class" forms)
  • 4: Some contexts will certainly need custom path handling where paths mean different things to different readers

Having said all of that, we should indeed have more consistency. We need centralised

  • is_fsspec(url, storage_options=None) to determine if a URL is fsspec-like
  • stripping and un-stripping
  • path split/join with consistency around terminal "/"

All of these should live in fsspec, I think, and most of the functionality already exists in some form there.

I am -1 on trying to enumerate all possible protocols and credential types here, or having protocol-dependant sets of if blocks rather than filesystem classes with methods.

Thank you for the correction, sometimes our typing tools fool us :)

I agree that dedicated classes with their methods like in fsspec filesystems are a much better idea than if statements.
The URI class that I showed is just an example, to illustrate the idea. Probably, it should be just a thin wrapper around fsspec.

My main idea is that we could simplify the code, by introducing an abstraction on top of the strings and using these objects as a base type for internal communication between the different parts.

Here are some projects that could be used for managing URI paths:

Whilst intake can afford to be relaxed about requirements and add extra packages if warranted, we'd have to be clear about what we're getting from them. As I said, I think that path string handling should ideally live over in fsspec, including the functions I outlined above, and that would solve everything for those readers that accept fsspec/file-like objects or filesystems. Other readers likely don't need any path handling at all, and Intake just passes the strings along unaltered.