dylan-profiler / visions

Type System for Data Analysis in Python

Home Page:https://dylan-profiler.github.io/visions/visions/getting_started/usage/types.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Leverage pandas' ExtensionDtype for defining efficient new types

sbrugman opened this issue · comments

Visions' currently supports defining custom types, such as Path, File and URL. These types inherit from object and are stored as uniquely defined classes. This for instance means that URL is stored as the namedtuple ParseResult that is returned by urlparse.

This strategy is effective in application where the series was converted to the object type anyway and doesn't pose a problem to small to medium sized datasets. For larger datasets we should consider an additional strategy, where a new (d)type is created as alias for an existing pandas.dtype. Allowing for these kind of abstractions addresses one of the major shortcomings in pandas at the moment. Custom dtypes generally reduces the memory complexity and the computational complexity of membership checks from O(n) to O(1). The same functionality could be maintained through an accessor (series.path just like series.dt).

Two implementation considerations:

  • pandas' StringDtype and ExtensionDtype are experimental and may change. The code for this enhancement should therefore be a minimal layer over the pandas interface.
  • The StringDType was introduced in pandas v1.0.0. The ExtensionDType however, was introduced earlier. Visions should provide backwards compability.

A type-agnostic solution is proposed in the linked PR.

Might be worth look at cyberpandas (https://github.com/ContinuumIO/cyberpandas) which implements an IPAddress extension array.

@jamesmyatt Thanks for thinking along! cyberpandas is an exellent demonstration of how adding new types can be useful. On the other hand, it demonstrates how involves adding a type can get with pandas. The pandas devs are (currently) not keen on supporting subclassing of other ExtensionDtypes.