serban / kmeans

A CUDA implementation of the k-means clustering algorithm

Home Page:http://serban.org/software/kmeans

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wish to run omp_kmeans on 100G dataset

opened this issue · comments

Hi, I am planning to run this program on a dataset sized almost 100GB on my server(more than 200GB of mem).
Could you please tell how to implement is cause I constantly getting a 'segmentation fault' error message when the memory exceed 4GB.
I have checked the BLAS and LAPACK libraries are all 64 bits version.
omp_kmeans is also compiled using a 64bit gcc compiler.

Thank you for your kindness.

Hi, there.

The build does not require any BLAS or LAPACK libraries, so don't worry about those.

Are you trying to use the CUDA version? If so, your dataset must fit in the RAM available to your GPU, which typically maxes out at 4 to 8 GB. A dataset of 100 GB is simply too large. I'd go so far as to say that CUDA won't bring you much benefit if you can't fit your dataset in the GPU memory because the time it takes to copy the data back and forth between the CPU memory and GPU memory would cripple the performance of the application.

Let me know if I can be of anymore help.

Serban

On Oct 20, 2012, at 8:24 AM, meloom notifications@github.com wrote:

Hi, I am planning to run this program on a dataset sized almost 100GB on my server(more than 200GB of mem).
Could you please tell how to implement is cause I constantly getting a 'segmentation fault' error message when the memory exceed 4GB.
I have checked the BLAS and LAPACK libraries are all 64 bits version.
omp_kmeans is also compiled using a 64bit gcc compiler.

Thank you for your kindness.


Reply to this email directly or view it on GitHub.

Hi,
Thank you for replying.

I did the following instruction

./omp_main -i ~/feature.txt -n 50 -p 12 -o

Getting the following result(when the total Ram is over 3.7GB).

Segmentation fault

The feature.txt consists of 87 GB of data. each vector has about 300,000 features.
I am not using CUDA.
Could you please tell me how to fix this error?

Thank you in advance.

Hi there,

I think I found out where caused the error.

In File_io.c, the program malloc a whole bulk of memory to objects, error occurs when the continuous memory allocated is too large.

Then instead, I allocated memory for each object.

Thank you anyway for replying.

Nvidia is ramping up their deep learning efforts and you can get up to 96GB of graphic memory. It would be really cool if you could consider looking into eliminating the 32bit restriction for the cuda code. For example I noticed on a g2.2xlarge machine with Nvidia CUDA AMI (https://aws.amazon.com/marketplace/pp/B01LZMLK1K) that the read call in cuda_io.cu (binary file) was limited to read 2^31 bytes. It's a bit weird, because the machine supports 64bit

FWIW g2.2xlarge uses a single GPU with 4GB of RAM:

High-performance NVIDIA GPUs, each with 1,536 CUDA cores and 4GB of video memory

Also, 64-bit performance would suffer heavily because it needs the double-precision unit. Most GPUs (but the newest/upcoming Teslas) have miniscule capabilities for that, so a server CPU may easily outperform them. I agree the upcoming Pascals will be better suited for 64-bit though (currently ~5 TFlops: http://www.nvidia.com/object/tesla-p100.html).

EDIT: it can in fact address 64-bit with multiple-instruction sequences, but that again may decrease the performance: https://developer.nvidia.com/cuda-faq

EDIT2: double-precision performance refers to the ALU, for integers you'd need to rely on multi-instruction sequences.

EDIT3: well, double-precision can be used to manipulate any integers up to 2^53 without loss of precision. It's more of a hack though, and may not be well-suited for memory-addressing.

They also recently released p2 instances, although the rollout doesn't seem to have finished in practice: https://aws.amazon.com/ec2/instance-types/p2/

Sorry for the shameless promotion, but all people stuck with 4GB memory limit should try https://github.com/src-d/kmcuda It supports as much memory as your GPU has, runs on multiple GPUs in parallel and is capable to handle the data in float16 format with Kahan summation (hence doubled data size) . Yet still 100GB is too much, of course. I would do the following. Pick "best" X GB from 100GB where X is the amount of mem your GPU has, cluster them, and then use the centroids to assign the rest of the dataset.