really great code but do you have it coded in Python?

Question

really great code but do you have it coded in Python?

Sandy4321 opened this issue 3 years ago · comments

and also will it work for bid data like
us-used-cars-dataset 9 GB 3ml rows 66 features predict price
https://www.kaggle.com/ananaymital/us-used-cars-dataset

Gertjan van den Burg · Answer 1 · Wed Sep 01 2021 04:51:52 GMT+0800 (China Standard Time)

Hi @Sandy4321,

Thanks for your kind words and your question. I don't have an equivalent package in Python, but the core algorithm is not too complex so perhaps you could consider coding it up yourself.

Regarding the dataset: in terms of features this shouldn't be a problem, but you'll likely run into memory issues due to the large number of rows. Perhaps you could consider separating the data into chunks and creating an ensemble model?

Sandy4321 · Answer 2 · Wed Sep 01 2021 05:18:26 GMT+0800 (China Standard Time)

Perhaps you could consider separating the data into chunks and creating an ensemble model?

great idea thanks
can you please share some link for such a python code - for any ML algorithm even for regression or random forest
how to divide to chunks and create an ensemble model?

Gertjan van den Burg · Answer 3 · Wed Sep 01 2021 05:25:25 GMT+0800 (China Standard Time)

Have you tried scikit learn? https://scikit-learn.org/stable/modules/ensemble.html#bagging

Sandy4321 · Answer 4 · Wed Sep 01 2021 05:46:50 GMT+0800 (China Standard Time)

I see what you meaning
but imagine your self that the same logic group of rows is spread to different chunks
then we have many weak classifiers
like many not professional in music people even million can compose music like one Mozart

for examlple
https://thenewstack.io/the-big-data-debate-batch-processing-vs-streaming-processing/

I thought something like
warm_start
reuse the solution of the previous call to fit as initialization

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet

Gertjan van den Burg · Answer 5 · Wed Sep 01 2021 06:00:01 GMT+0800 (China Standard Time)

I'm sorry, but I don't think this is related to SparseStep anymore. For general advice on fitting a machine learning model, please ask on places such as Cross Validated.

Sandy4321 · Answer 6 · Wed Sep 01 2021 06:01:01 GMT+0800 (China Standard Time)

it is about SparseStep
how to use SparseStep with big data?

Gertjan van den Burg · Answer 7 · Wed Sep 01 2021 08:35:33 GMT+0800 (China Standard Time)

This works for me, no ensemble necessary:

> X <- as.matrix(rnorm(3e6, 66))
> y <- as.vector(rnorm(3e6))
> library(sparsestep)
> fit <- path.sparsestep(X, y)