Quantco / tabmat

Efficient matrix representations for working with tabular data

Home Page:https://tabmat.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A qc.matrix design proposal

tbenthompson opened this issue · comments

A qc.matrix design proposal:

I've been uncomfortable with the qc.matrix direction lately. This morning, I've thinking API and some of the recent feature addition/API consistency changes. I think the API surface area for qc.matrix is way too large. I don't want to create a huge library at the moment. If we do want to do that, I think we're several months away from being able to release the library.

I'd like to propose that we adopt a design goal for qc.matrix: the project is built as a backend for matrix operations in an algorithmic setting and should not be used for constructing or manipulating your data matrix -- similar in spirit to a LightGBM Dataset object. In light of that, I'd like to propose that the public API consist of two subsets:

  1. A "user"-facing API
    1. from_pandas as the main entrypoint. This is probably the most common way for users to touch qc.matrix. Most users will never use anything else.
    2. also add from_sparse and from_dense entrypoints based on the csc_to_split function.
  2. A algorithm developer facing API:
    1. The matvec, transpose_matvec, sandwich and standardize methods of SplitMatrix and StandardizedMatrix. These are the features used by quantcore.glm.
    2. DenseMatrix, SparseMatrix, CategoricalMatrix should not be used directly.
    3. Instead, they will be wrapped in a SplitMatrix by the from_pandas, from_sparse, and from_dense entrypoints and then standardized into a StandardizedMatrix by standardize. This minimizes the API and means that the three base matrix types can be entirely private to the package. It also means that StandardizedMatrix needs to only handle a SplitMatrix as its underlying unstandardized component.

Overall the goal here is to minimize the API surface area to a total of three free functions, two classes and four methods on those classes. Behavior would be consistent and clear. I want to provide a small number of operations that are each very powerful. Deep systems with a small surface area are useful, easy to understand and easy to maintain.

If the above proposal has everyone's support, then a lot of our current issues are actually no longer needed:

  • #101: I'd vote that we explicitly not support getitem.
  • #105 -- see above! "should we define __matmul__ for all the classes? Should we define the more basic __mul__ and __rmul__?" --> I would say "no and no". If we have only SplitMatrix as outward-facing API, then these methods will not exist and will naturally throw errors. That's great.
  • #104 -- No, we should not support those methods.

I think this will be a fairly quick modification/refactoring and will make the release process much less stressful.

My thoughts (mostly the same as when we first talked about this @tbenthompson):

  • Overall, I agree with you that we should limit the API in the short-term.
  • However, I'm a bit sad about this. I got really excited about quantcore.matrix recently. When I was trying to write a function that would take in both dense numpy array and sparse scipy matrix, I ran into a hell of incompatibility. I believe that quantcore.matrix has the potential to become the go-to data storage for modeling. But we are a very long way from there.
  • I'm not convinced that we need to go as far as removing Dense-, Sparse-, and CategoricalMatrix classes from the public-facing API in order to streamline the release. A couple of points regarding this:
    • If we want to implement something with SplitMatrix, it needs to be implemented with the other underlying classes. Currently, do we need to work on something with those underlying classes? I thought most of the work was with SplitMatrix anyway.
    • For a data scientist with limited knowledge of things like AVX instructions or multiprocessing, CategoricalMatrix is a super nice feature that is easy to understand. Having a public-facing API makes it more salient.
    • Would this mean that the underlying type of X in quantcore.glm will always be a SplitMatrix? If so, we should make sure that we are not much slower compared to a direct DenseMatrix or SparseMatrix.
  • Having a broader goal allows us to gradually improve it. I feel like qc.matrix is a good way for interns or new employees to have a 1-week task to learn about git, review process, scientific computing, modeling.

Overall:

  • We should drop the idea to support basic arithmetic operations. Let's keep the classes like the Dataset class of LightGBM (good analogy btw)
  • @tbenthompson, can you explain in more details what would become easier by dropping the "public" support for Dense, Sparse, and CategoricalMatrix? Right now I lean towards keeping them, but maybe it's because I'm not seeing an obvious and large benefit to this.

Thanks Marc!!

First, an alternate proposal: don't change anything but make it clear which methods are "supported" and which are simply accidentally inherited from their parent classes (np.ndarray and scipy.sparse.csr_matrix). I would be pretty happy with this option! In some ways, I prefer it.

This alternate proposal is also the least amount of work.

On to the original proposal.


@tbenthompson, can you explain in more details what would become easier by dropping the "public" support for Dense, Sparse, and CategoricalMatrix? Right now I lean towards keeping them, but maybe it's because I'm not seeing an obvious and large benefit to this.

My basic logic is that this will:

  1. Remove API inconsistencies.
  2. be easier to maintain/understand because the API is smaller.

I'll expand on both these points.

Removing API inconsistencies
Currently, the DenseMatrix inherits from np.ndarray and, as a result, has the entire API of a numpy array. The SparseMatrix inherits from scipy.sparse.csr_matrix and has the entire API of a scipy.sparse matrix. The Categorical, Split and Standardized Matrix classes are written from scratch and have a much smaller API. Having Split and Standardized as the only user-facing classes would make having a consistent API easy.

Smaller API
As a general rule, smaller APIs are easier to understand and maintain. Tons of costs grow in proportion to the number of methods provided: backwards compatibility, testing, error/input checking. The last one is a concrete example here. For any of our "user" facing methods, we need to check things like input types, input shapes, etc. See, for example L124-L134 in categorical_matrix.py. Currently, if we reduce to just supporting matvec/transpose_matvec/sandwich/standardize, we will have 20 methods where we need to do that error and input checking. By reducing that to just one or two classes, we can feel safe about the correctness of inputs in many more parts of the code.


You also asked:

Would this mean that the underlying type of X in quantcore.glm will always be a SplitMatrix?

Yes, that or a StandardizedMatrix. The overhead is very low. I just measured it and it seems to be on the order of 100 microseconds. We could probably make that smaller, but it's very small already.

I would rather ditch inherited methods than preserve an inconsistent API. Nevertheless, I think it's reasonable at this point to just clarify which methods are officially supported (i.e. changes will first prompt a deprecation warning and imply a major update) and which aren't (i.e. changes may occur without warning).

I wouldn't mind if the SplitMatrix became the only user-facing class in the package. I'd also love if we gradually augmented it till it took over NumPy and SciPy. 😬