JuliaGPU / CUDA.jl

CUDA programming in Julia.

Home Page:https://juliagpu.org/cuda/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dependencies in `profile.jl` constitute a significant fraction of the load time

Sbozzolo opened this issue · comments

While trying to reduce load time for some of our packages, I timed using CUDA. I noticed that some dependencies are relatively heavy for what they provide. In particular, DataFrames and PrettyTables directly account for more than 20 % of the load time (without considering their dependencies). While this is not necessarily a lot of time (~1s in this example below), it seems to me that DataFrames and PrettyTables are exlcusively used in profile.jl and are not required for the operations of CUDA.jl. This might be a low-hanging fruit to reduce load times for CUDA.jl and downstream packages (removing the dependencies, or maybe with package extensions).

   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.9.4 (2023-11-14)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> @time_imports using CUDA
      2.7 ms  CEnum
      9.3 ms  Preferences
      0.3 ms  JLLWrappers
    195.0 ms  LLVMExtra_jll 98.84% compilation time (98% recompilation)
     30.3 ms  LLVM
      0.3 ms  ExprTools
     24.9 ms  TimerOutputs
      0.3 ms  Scratch
    174.0 ms  GPUCompiler 4.37% compilation time
      0.4 ms  Adapt
      0.1 ms  Reexport
      1.4 ms  GPUArraysCore
      0.6 ms  Statistics
     74.7 ms  GPUArrays
      0.2 ms  Requires
      3.9 ms  BFloat16s
      0.2 ms  LLVM → BFloat16sExt
      0.1 ms  LLVMLoopInfo
     75.0 ms  CUDA_Driver_jll 37.98% compilation time
      6.4 ms  CUDA_Runtime_jll
    140.1 ms  CUDA_Runtime_Discovery
     47.3 ms  FixedPointNumbers
     66.2 ms  ColorTypes
     48.7 ms  Colors
      1.2 ms  NVTX_jll
      0.6 ms  JuliaNVTXCallbacks_jll
     21.9 ms  NVTX
      9.7 ms  RandomNumbers
      2.9 ms  Random123
      0.2 ms  DataValueInterfaces
      1.2 ms  DataAPI
      0.2 ms  IteratorInterfaceExtensions
      0.1 ms  TableTraits
     25.7 ms  Tables
      0.2 ms  PrecompileTools
     10.3 ms  StringManipulation
     15.4 ms  Crayons
      0.6 ms  LaTeXStrings
    277.5 ms  PrettyTables
      0.3 ms  Compat
      0.2 ms  Compat → CompatLinearAlgebraExt
     59.5 ms  DataStructures
      1.1 ms  SortingAlgorithms
     17.9 ms  PooledArrays
      8.7 ms  Missings
      2.3 ms  InvertedIndices
     24.8 ms  SentinelArrays
     26.2 ms  Parsers
      6.7 ms  InlineStrings
    696.0 ms  DataFrames
     33.3 ms  AbstractFFTs
      0.4 ms  AbstractFFTs → AbstractFFTsTestExt
      4.5 ms  UnsafeAtomics
     11.3 ms  Atomix
      8.7 ms  MacroTools
      3.5 ms  StaticArraysCore
    407.8 ms  StaticArrays
      0.3 ms  Adapt → AdaptStaticArraysExt
      0.2 ms  StaticArrays → StaticArraysStatisticsExt
      3.4 ms  UnsafeAtomicsLLVM
     21.0 ms  KernelAbstractions
   1390.2 ms  CUDA 1.40% compilation time

With

julia> @time using CUDA
  4.312124 seconds (6.52 M allocations: 441.875 MiB, 4.30% gc time, 5.83% compilation time: 75% of which was recompilation)

An integrated profiler is a pretty fundamental part of a programming environment, IMO, so I'm not inclined to remove that functionality or put it in a separate package.

If anything, this seems like an issue with DataFrames.jl and PrettyTables.jl to optimize the load time of those packages? Although I would assume that DataFrames.jl has been optimized already. I'm personally not very familiar with load-time optimizations, so any help is welcome here.

Thank you for your quick repsonse!

If anything, this seems like an issue with DataFrames.jl and PrettyTables.jl to optimize the load time of those packages? Although I would assume that DataFrames.jl has been optimized already. I'm personally not very familiar with load-time optimizations, so any help is welcome here.

Agreed, but I think that it is easier for CUDA.jl to avoid using those packages than to optimize them. I don't know exactly how DataFrames.jl is used, but it might be possible to write some relatively small amount of code that implements the functionalities needed. Also, profile.jl could write "ugly tables" by default and switch to pretty_table when PrettyTables.jl is separately loaded. This way, users that are not using the profiler do not pay for the cost of loading the packages, without loss of functionality for those that do want to use the profiler.

An integrated profiler is a pretty fundamental part of a programming environment, IMO, so I'm not inclined to remove that functionality or put it in a separate package.

It is a very useful component, but I think it should bring as little additional overhead as possible when not used. Given that that the profiler is moslty used in interactive development and in tests, this affects the import time for downstream packages that do not use the profiler.

I opened this issue mostly to highlight that there's an opportunity to reduce loading time, but I don't have the solution.

If you're not developing in CUDA but only using the package for GPU support in say Flux, or for doing GPU inference with ONNXRunTime, the DataFrames and PrettyTables dependencies are quite annoying not only with respect to load times but also the general increase of your dependency tree.

For the technical aspects, all uses of DataFrames and PrettyTables are confined to src/profile.jl. Unfortunately they have open-ended using declarations so it's hard to immediately say what's used from them. Luckily ExplicitImports can help with determining that:

using DataFrames: DataFrames, DataFrame, PrettyTables, combine, groupby,
                  leftjoin, nrow, order, rename!
using PrettyTables: Highlighter, pretty_table

I'm not a user of DataFrames so I don't really know what those functions mean, but combine, groupby and leftjoin sound like they might not be entirely trivial to reimplement. I could be wrong though.

If it were considered acceptable to require a using CUDAProfiler, it would be possible to make a dummy package which only imports DataFrames and PrettyTables (and maybe Crayons for good measure), then move the profiler code to a package extension inside the CUDA package, where the only code changes would be to import DataFrames and PrettyTables via CUDAProfiler. On the downside it would only work for Julia 1.9+ and arguably be a misuse of a Pkg feature.