why do we limit X to be a list of csr_matrix for training ?

Question

why do we limit X to be a list of csr_matrix for training ?

yupbank opened this issue 6 years ago · comments

https://github.com/Refefer/fastxml/blob/master/fastxml/trainer.py#L383

Andrew Stanton · Answer 1 · Wed Aug 15 2018 13:15:27 GMT+0800 (China Standard Time)

Great question. Because at each stage of the tree we end up re-splitting the dataset, if you give it a sparse matrix Python, will keep having to recreate each of the CSR rows individually. This is incredibly slow and wastes several factors more memory.

I enforce the data to be a list of sparse matrices so we don't have to do a full memory copy to convert it from a csr_matrix to a list of csr matrices.