JuliaStats / StatsBase.jl

Basic statistics for Julia

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adopting Transducers.jl as a Dependency

ParadaCarleton opened this issue · comments

Many functions--e.g. mean, variance, etc.--could be made parallelizable, faster, shorter, and more general--by accepting Transducers.jl as a dependency, and it would substantially simplify the implementation of some features. I find myself reaching for it but having to use clumsier iterators or broadcasting methods often.

Luckily, Transducers.jl is now being maintained by Mason Protter and the rest of the people working on the JuliaFolds ecosystem.

The package and its dependencies have been pared down substantially over time and should not be a major contributor to StatsBase.jl's loading time. Transducers is now lightweight, with only about 80ms load time for all dependencies (including indirect dependencies) on v1.10.

julia> @time_imports using Transducers
      0.2 ms  Adapt
      6.1 ms  MacroTools
      0.5 ms  StaticArraysCore
      0.3 ms  ConstructionBase
      6.4 ms  Setfield
      0.3 ms  ArgCheck
      0.1 ms  Compat
      0.1 ms  Compat → CompatLinearAlgebraExt
      6.4 ms  InitialValues
               ┌ 0.0 ms Requires.__init__() 
     32.5 ms  Requires 98.74% compilation time
               ┌ 0.0 ms BangBang.__init__() 
      4.7 ms  BangBang
      9.5 ms  Baselet
      0.2 ms  CompositionsBase
      0.2 ms  DefineSingletons
      2.9 ms  MicroCollections
     30.2 ms  Test
      4.5 ms  SplittablesBase
     14.2 ms  Transducers

The primary advantage would be to simplify the implementation of many features, enable in-place algorithms that can be substantially faster and more memory-efficient, and to use a more generic interface than the iterator interface (as transducers can operate on collections that are not themselves iterators).

BTW, @devmotion, the reason why I'm interested in Transducers.jl is I'm working on a PR that fixes all of the loops and uses of @inbounds in StatsBase.jl; Transducers.jl can replace most of these loops with faster (but less bug-prone) constructions. I think finally killing off @inbounds with no performance penalty (and in most cases a speedup) would be worth it.

Any remaining @inbounds issues could be fixed without switching to Transducers, so for me that's not a compelling argument for adopting such a large dependency (and I guess it's completely impossible for code that will be moved to the Statistics stdlib?). Even if StatsBase would use Transducers at some point, I think it would be good to keep bugfixes separate from a transition to/adoption of Transducers.