AtheMathmo / rulinalg

A linear algebra library written in Rust

Home Page:https://crates.io/crates/rulinalg

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Decide File IO / Serialization API

sinhrks opened this issue · comments

Currently we have no API for Matrix to be input / output to external sources. The issue intends to fix a standard IO API related to various sources.

I've prepared a PR #126 for CSV IO for discussion. Current specs are:

  • Uses optional feature gates io to support CSV IO.
  • Matrix now has read_csv (static) and write_csv to read / write CSVs.
  • Internally it uses csv crate, which is optional dependency when io feature is enabled.

For other sources can be added with:

  • Add read_xxx and write_xxx to Matrix appropriately.
  • The actual impl can depends on other crates as optional dependency.

Would like to ask whether the above direction looks OK.

I think that #126 and the proposal here are a good starting point - if we can't reach a consensus then the PR would make a strong contribution on it's own.

Now onto more direct discussion. One concern of this approach would be that this could lead to a very bloated API in the future if we want to support many data formats. Is there a way we can have a single read_from_file function, which takes as argument some abstraction we give over many data readers?

For now I think we should put off this approach until we know a little more about what we're dealing with. My vote is to tackle an approach at least similar to the one given here.

I think the PR is a great start. I have some comments and concerns which I will post soon, most likely during next week. Currently rather busy!

Some concerns:

  • Like @AtheMathmo, I'm a little concerned that the Matrix API will be very bloated if we add IO-related methods directly to its API.
  • Feature gated IO is a great idea, but I think it needs to be even more fine-grained: consider supporting HDF5, which requires access to a native library. If a user only wants to use CSV, it would be very inconvenient to have to install HDF5. Hence, I think each individual format needs to be feature-gated. That is, for the example I just mentioned, you'd have a csv feature, and a hdf5 feature.

Based on these concerns, I'd like to make the following suggestions:

Create an io module within rulinalg (without feature-gating), which contains the generic IO documentation, explaining the overall philosophy and how to enable various IO functionality. Enabling various IO features would then enable submodules within io, i.e. rulinalg::io::csv. Here you might have free functions such as csv::read_matrix(...) or types such as csv::MatrixReader, depending on whatever is appropriate. In any case I think they should be external to the Matrix API.

I think it's hard to come up with generic traits for IO, because they might have very different requirements. For example, a CSV file is very simple and so doesn't need much configuration, whereas more complicated storage formats might require far more configuration options. Although it would be desirable to have a unified API for different formats, it might be more practical to simply let each storage format involve separately within the io module for now, and then we can try to unify them later.

I'm coming back to look at some of these issues and PRs again now - hopefully I can start getting things resolved!

I think @Andlon makes a good point - we don't want our users to have to install a bunch of io-stuff to access small parts of that functionality (HDF5 vs csv as the given example). Though I'm not convinced that feature gating each component is the best approach, as this could potentially get messy. You also pointed out that generic IO traits aren't a silver bullet - even if it were doable (and/or pretty).

My vote for now would be to merge #126 as-is and see how this story evolves. This issue can stay open and be a reference point if/when this discussion receives more attention. Thoughs @Andlon @sinhrks ?

(Also thank you both for your patience)

Yeah, currently we can define io to install all io related sub-features. We may split it when we add others in future, but io should be still usable to install all of them. So no problem from user's point of view.

I'm OK with it either way! As you say, I think #126 is a good start for the IO story. The only thing I'm not a huge fan of is adding yet more methods to the Matrix API, but for now this is bearable.