rikhuijzer / SIRUS.jl

Interpretable Machine Learning via Rule Extraction

Home Page:https://sirus.jl.huijzer.xyz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Various doc issues

sylvaticus opened this issue · comments

Hello, some suggestion concerning the documentation.

  • You could remove the quotes in the table of page https://sirus.jl.huijzer.xyz/dev/binary-classification/

  • I still think (personal and debatable opinion) that the "basic example" still looks more complex than it should be. If an user run PKGDIR = dirname(dirname(@__DIR__)) or DOCS_DIR = dirname(@__DIR__), it would find itself in a different directory than the one in the documentation script. Also, users are not supposed to develop a package in order to use it.

    If, for example, you add the (small) dataset to the repository as [SIRUS_FOLDER]/test/data/haberman.csv you could simply use something like that as minimal example for a user:

    using CSV, DataFrames, CategoricalArrays, MLJ, SIRUS
    data_filename = joinpath(dirname(pathof(SIRUS)),"..","test","data","haberman.csv")
    data = CSV.read(data_filename, DataFrame)
    X    = select(data, Not(:survival))
    y    = categorical(data.survival)
    
    model = StableRulesClassifier(max_depth=2, max_rules=8)
    mach  = machine(model, X, y)
    MLJ.fit!(mach)
    mach.fitresult
    predictions = predict(mach, X)
    resampling = CV(nfolds=10, shuffle=true);
    evaluate(model, X, y; resampling, measure=auc)

    This would also have the advantage that you could use the dataset easily in your package tests (without download from a remote server).
    I have also removed the StableRng as I think replicability is not the focus of a minimal example and the begin and let blocks.

  • I think some content of "Advanced example" could go to "Implementation overview". Also "Benchmarks" could perhaps go on its own section so that "advanced example" could focus on how to use the various options offered by the model (the hyperparameters)

openjournals/joss-reviews#5786

You could remove the quotes in the table of page https://sirus.jl.huijzer.xyz/dev/binary-classification/

Done.

I still think (personal and debatable opinion) that the "basic example" still looks more complex than it should be. If an user run PKGDIR = dirname(dirname(@DIR)) or DOCS_DIR = dirname(@DIR), it would find itself in a different directory than the one in the documentation script. Also, users are not supposed to develop a package in order to use it.

Fixed.

If, for example, you add the (small) dataset to the repository as [SIRUS_FOLDER]/test/data/haberman.csv you could simply use something like that as minimal example for a user:

I have considered this, but am hesitant to add data to the source code. I have also considered a less heavy approach via the Julia (Lazy) Artifact system. However, the main issue I have is with the question: If we add a dataset, shouldn't we add more datasets? I think no dataset is a requirement for the package and therefore they should not be linked.

This would also have the advantage that you could use the dataset easily in your package tests (without download from a remote server).

Datadeps does cache the file, so in many cases the test will not require a download. DataDeps is not yet cached via the GitHub Actions Cache (julia-actions/cache#40), but that might be fixed in the future.

I have also removed the StableRng as I think replicability is not the focus of a minimal example and the begin and let blocks.

Thanks for providing the example code, which made the example easy to understand. I have removed the begin and let blocks and replaced them by functions. Functions are often considered as superior to comments in code as they provide clarity while ensuring that the desciption matches the code.

I think some content of "Advanced example" could go to "Implementation overview". Also "Benchmarks" could perhaps go on its own section so that "advanced example" could focus on how to use the various options offered by the model (the hyperparameters)

I haven't done this yet, but start working on it.

One minor issue with the example in the documentation.
When you try to run the code as described in the example you will have the error on the line:

StableRulesClassifier = @load StableRulesClassifier pkg="SIRUS"
ERROR: cannot assign a value to imported variable SIRUS.StableRulesClassifier from module Main

Because you have already imported StableRulesClassifier in the first chunk of the script. Please remove using SIRUS: StableRulesClassifier from that first chunk.

Also, I had some errors in typing the using.. statements in that order, I suggest to write the using DataFrames BEFORE the line using DataDeps: DataDeps, DataDep, @datadep_str

I think some content of "Advanced example" could go to "Implementation overview". Also "Benchmarks" could perhaps go on its own section so that "advanced example" could focus on how to use the various options offered by the model (the hyperparameters)

Could we consider this out-of-scope for now? Then I can focus on the issues that directly affect the paper's content. I've opened an issue for this at #62.

StableRulesClassifier = @load StableRulesClassifier pkg="SIRUS"

Because you have already imported StableRulesClassifier in the first chunk of the script. Please remove using SIRUS: StableRulesClassifier from that first chunk.

It looks like JuliaAI/MLJModels.jl#526 hasn't fixed the issue that we ran into earlier because

pkg> activate --temp

pkg> add MLJModels
   Resolving package versions...
    Updating `/private/var/folders/nf/xynnllys6nl13xhj67sqgt7c0000gn/T/jl_AG8cXc/Project.toml`
  [d491faf4] + MLJModels v0.16.12
  [...]

pkg> add SIRUS
   Resolving package versions...
    Updating `/private/var/folders/nf/xynnllys6nl13xhj67sqgt7c0000gn/T/jl_AG8cXc/Project.toml`
  [cdeec39e] + SIRUS v1.3.3
  [...]

julia> using MLJModels

julia> RulesClassifier = @load StableRulesClassifier pkg="SIRUS" verbosity=0
SIRUS.MLJImplementation.StableForestClassifier

So I'm using the latest versions of MLJModels and SIRUS but still StableRulesClassifier loads the StableForestClassifier. This is a bug that needs further investigation (I've opened an issue at #61). For now, I've fixed the documentation by removing the @load in ca1c02b.

Also, I had some errors in typing the using.. statements in that order, I suggest to write the using DataFrames BEFORE the line using DataDeps: DataDeps, DataDep, @datadep_str

Did you get warnings during the loading of using DataFrames? Could it be that DataFrames was outdated in the global Julia environment? I just ran the following without warnings after updating my global environment:

pkg> activate --temp

julia> begin
           using CategoricalArrays: categorical
           using CSV: CSV
           using DataDeps: DataDeps, DataDep, @datadep_str
           using DataFrames
           using MLJ
           using StableRNGs: StableRNG
           using SIRUS: StableRulesClassifier
       end
 │ Packages [CategoricalArrays, DataDeps, MLJ, StableRNGs, SIRUS] not found, but packages named [CategoricalArrays, DataDeps, MLJ, StableRNGs, SIRUS] are available from a registry.
 │ Install packages?
 │   (jl_cHDhkF) pkg> add CategoricalArrays DataDeps MLJ StableRNGs SIRUS
 └ (y/n/o) [y]:
 [...]
        Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated -m`

julia>

Sorry, by the way, for not understanding earlier that you expect to be able to run the example from the command line. I personally almost never do data science from the command line, so that's why I didn't see it before.

I just ran the whole basic example on the command line and it should now work. Please let me know if you run into issues.