pydata / sparse

Sparse multi-dimensional arrays for the PyData ecosystem

Home Page:https://sparse.pydata.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Additional Random Generator

smldub opened this issue · comments

Problem:
The sparse.random function is very speedy for most tensors below the size of around 1E9 elements, but after that point the two current methods run into some drawbacks:

  1. random_state.choice requires enough memory to store an array of the same size as the number of elements in the tensor, and in my experience this takes quite some time for tensors larger than 1E9
  2. The set/hashing method used when the density is below .3 is limited by being a python for loop and takes quite some time to generate any more than 1E6 random elements (which is below the .3 threshold of a 1E9 element tensor). I think this is likely the result of needing to refer to a hash table every loop.

Potential Solution:
1). Generate a list using np.random.randint that is nnz (the number of nonzero entries desired) long.
2). Then use np.unique to both sort and take out any repeats.
3). go back to step 1, but decrease the desired length to number of nnz elements still needed and np.hstack the resulting matrices.

np. unique also sorts the results making the construction of the COO matrix a little faster.

Example Code:

def rand (shape, nnz = None, density = .1):
    elements  = np.prod(shape)
    if nnz == None:
        nnz = int(elements*density)
    if nnz > elements: #my sad attempt at preventing errors
        print("fail")
        return None
    out = np.random.randint(elements, size = nnz) #generate the initial guess for indicies
    out = np.unique(out) #remove the repeated indices
    nnztemp = len(out)
    while nnztemp<nnz: #loop to get the rest of the elements
        out = np.hstack((out,np.random.randint(elements,size=int(nnz-nnztemp))
        out = np.unique(out)
        nnztemp = len(out)
    out = np.array(np.unravel_index(out,shape), dtype = np.int64) #converts index from 1d to Nd
    return sparse.COO(out, data = np.random.rand(out.shape[1]), shape = shape)

Here is a plot for a tensor with size 1E9 (100,100,100,1000) where I compare the speed of generating a random matrix with the default sparse function (in orange) vs the above function (in blue).

Figure_1

The new method also works relatively well at the small limit as well for the particularly large tensors because the first guess often has no repeated values.

I would be interested to hear any feedback on the idea before I try to implement something.

  1. random_state.choice requires enough memory to store an array of the same size as the number of elements in the tensor, and in my experience this takes quite some time for tensors larger than 1E9

I find this quite surprising, I would have imagined it takes O(nnz) time since it can use this algorithm.

Maybe you could experiment with Numba to implement it? 😉 https://numba.readthedocs.io/en/stable/reference/numpysupported.html#random

Those definitely look interesting, but I wonder if they are operating with an unnecessary handicap because the size of resevoir is unknown, while it is known in our case? I'll test it against some other sample without replacement algorithms after I do a little research.

What about Fisher-Yates, or a variation of it?

After a little bit of research, I came across these two articles, which layout some interesting paths.
https://arxiv.org/abs/1610.05141
https://arxiv.org/abs/2104.05091
I've implemented the D algorithm (by Vitter), which is supposedly the slowest really fast algorithm according to the first article, but has the nice property that it is a constant time algorithm.

I've plotted the performance of it creating a sparse matrix vs the traditional sparse.random with the size of the tensor listed on top. (Algorithm D in blue, sparse.random in orange)

Screen Shot 2022-03-12 at 12 09 47 AM

I think that the parallel algorithms listed in the first paper would be interesting, but I think the code required to do a simple random matrix would explode pretty fast.
The first paper claims the B algorithm
https://dl.acm.org/doi/pdf/10.1145/214392.214402
could be implemented faster than D even without parallelization, so that might be worth checking out too.

Here is the D algorithm if you don't want to type it out yourself.

import numpy as np
import numba
import sparse
@numba.jit(nopython=True, nogil = True)
def algD(n,N):
    n = int(n+1)
    N = int(N)
    j = -1
    qu1 = N-n+1
    negalphainv = -13
    threshold = -negalphainv * n

    nreal = np.float(n)
    Nreal = np.float(N)
    nmin1inv = 1.0 / (n - 1)
    Vprime = np.exp(np.log(np.random.rand())/n)
    qu1real = 1 - nreal + Nreal
    a = False; b = False;
    i=0
    arr = np.zeros(n-1)
    while n > 1:
        nmin1inv = 1/ (nreal-1)
        while a == False:
            while b == False:
                X = Nreal * (-Vprime + 1)
                S = np.floor(X)
                if S < qu1:
                    break
                Vprime = np.exp(np.log(np.random.rand())/n)
            U = np.random.rand()
            negSreal = -S
            y1 = np.exp(np.log(U*Nreal/qu1real)*nmin1inv)
            Vprime = y1 * (-X/Nreal + 1) * (qu1real/ (negSreal+qu1real))
            if Vprime <=1:
                break
            y2 = 1
            top = Nreal - 1
            if n-1 > S:
                bottom = Nreal-nreal
                limit = (-S+N)
            else:
                bottom = negSreal+Nreal-1
                limit = qu1

            t = N-1
            while t>= limit:
                y2 *=top/bottom
                top -=1
                bottom -=1
                t -=1
            if Nreal/(-X+Nreal) >= y1*np.exp(np.log(y2)/nmin1inv):
                Vprime = np.exp(np.log(np.random.rand())*nmin1inv)
                break
            Vprime = np.exp(np.log(np.random.rand())/n)
        j += S+1
        arr[i]=j
        i+=1
        N = -S+(N-1)
        Nreal = negSreal +(-1+Nreal)
        n-=1
        nreal-=1
        ninv = nmin1inv
        qu1 = -S +qu1
        qu1real = negSreal+qu1real
        threshold +=negalphainv
    return arr

Please go ahead and add it! 😄 I still think we have some minor kinks to work out, but that can be done better on the PR with comments.

I have uploaded the code I've gotten written so far to my branch, but pytest isn't happy with it yet. Should I push that to the main branch even though it will be full of errors? Sorry not really familiar with the whole github workflow.

When you open a pull request, it isn't automatically merged. In fact, that's the best way to resolve errors together since we can see the changes and work on fixes.

Also the tests run on Continuous Integration, so I can see what they are.

In addition, you can open a pull request from a branch other than main -- That's usually how it's done. Feel free to ask follow up questions, I'm happy to help.

Also, feel free to use Gitter for higher-frequency communication.