3rd Try- someone answer please - Not able to fetch Million records from cassandra

Question

3rd Try- someone answer please - Not able to fetch Million records from cassandra

idofmrsandeep opened this issue 7 years ago · comments

Hi Team,

We have a requirement of fetching 1 million/1 billion records from cassandra db. I we are able to achieve best performance from kundera, we will plan to uptake kundera in production. Using Kundera I am not able to fetch the records. we have 3 nodes cassandra setup and 3 node Weblogic server setup. we cannot use spark for fetching the data, we are following JPA java code approach. without any configuration If I try to fetch it is returning only 100 records. In the end I captured the code which I am using.

Issue1: If I set MaxResults Value to >3000, I am getting time out error from cassandra. Current read time out value is 10 seconds. Do I need to increase the read timeout value to minutes in cassandra ? Will it help ? Is it the best way to do so? If you have code sample to read millions of records, can you please share?

Issue 2: If I set maxResults to <3000, I am able to read data, but I need to create connection in loop and start getting 100 or 1000 records at a time and I need to add all the records to collection. But these many iterations are really consuming a lot of time.

Issue 3: Do we need to use multi threading code to read records, If so how to distribute the no of records to read per thread, will it not throw timeout error from cassandra ? Can you please share multi threading code if you have ?

Clarification 3: we have 3 nodes setup to read data, is there any node vs data read capability? like one 16GB RAM Octa core machine can read 1 Billion records etc., If so can you please let me know.

Sample Code:
String mainquery = "Select p from MachineAlertAllData p where p.key.reading_date between "+startReadingDate+" AND "+ endingDate;
Query q = em.createQuery(mainquery);
List results = q.getResultList();

Also I made the setting mentioned here :
https://github.com/Impetus/Kundera/wiki/Using-multiple-node-support-and-load-balancing-policy-in-Kundera

Pedro Santos · Answer 1 · Mon May 07 2018 19:46:42 GMT+0800 (China Standard Time)

Hi @idofmrsandeep
Cassandra itself limits the number of records returned, so a batch is the best approach. How you distribute the queries depends largely on your data's organization and partition keys. If you are using UUIDs for partition keys, as they are time related, each node in the cluster will be responsible for specific ranges.
You can also try making a first query to get the UUIDs alone and then, splitting the retrieval of the full objects (or rows, as you choose to call them) by threads. Please note that this may not increase the performance as you send the query to the cluster and you do not know which range belongs to which node.

Regards