bulbalion / caspanda

Using Pandas easily with Cassandra

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build Status codecov.io

  ___ __ _ ___ _ __   __ _ _ __   __| | __ _ 
 / __/ _` / __| '_ \ / _` | '_ \ / _` |/ _` |
| (_| (_| \__ \ |_) | (_| | | | | (_| | (_| |
 \___\__,_|___/ .__/ \__,_|_| |_|\__,_|\__,_|

Aaron Benz
Charlie Hack
Spring 2015


Pandas interface for Cassandra.

##What is it? caspanda is a Python module combines Apache Cassandra with Python's Pandas module... aka caspanda. Its overall goal is to give the user the ability to seperate Cassandra's NoSQL backend from the user's front end experience. Ultimately, it hopes to provide Data Scientists who use Pandas the ability to easily use Cassandra.

It is still very early in its developement, but it plans on using the multi-indexing/pivot ability and the time series functionality available in Pandas to automatically sort and organize a data coming from Cassandra according to its schema. Additionally, it hopes to allow the user to easily insert data back into cassandra without ever having to speak CQL.

Main Features

Here are a few of the things caspanda currently does:

- Puts queried data into a Pandas Dataframe
- Stores data into Cassandra using CassandraFrames (uses sync and async methods)
- Describes the structure of Cassandra Tables in a hierarchical way


One of the main objectives of Caspandas is being able to easily understand and use Cassandra. Unfortunately, many can be misled or lack the understanding of how Cassandra actually stores it's data. The attempt below is meant to give you a conceptual understanding of the hierarchy that the data is really stored in.

The example table sold_cars demonstrates a data model that might exist if you wanted to store the information about sold cars. It stores information about a sale according to the make and state of the car, and then stores the information by day and time. So, the query pattern would specify the make and state, and then give you the ability to choose a date range.

Conceptually this might make since, but the way in which it is written down in CQL if often difficult to grasp for anyone not seasoned in Cassandra. So, we have tried to make this much more simple. First, connect to Cassandra and create the table sold_cars

from caspanda.bear import CasPanda

cl = CasPanda()
session = cl.connect()
session.execute("""CREATE KEYSPACE IF NOT EXISTS tests WITH REPLICATION = { 'class' : 'SimpleStrategy',
                    'replication_factor' : 1 };""")
session.execute("""CREATE TABLE IF NOT EXISTS sold_cars (
    make text,
    state text,
    day timestamp,
    event_time timestamp,
    dealership text,
    salesman text,
    year int,
    account_lead text static,
    distributor_lead text static,
    PRIMARY KEY ((make, state), day, event_time));""")

Now that the table has been created, let's visualize it. This breaks down the names of the columns in a hierarchical fashion that demonstrates how it is actually stored. So for example, The make and state columns define a group of data. That group is ordered and stored by day, and then by event_time. Then, for each event_time, there are fields for a dealership, year, and salesman. Additionally, there is a single value column stored on the same level as day, which is distributor and account_lead.

Said differently, for every make and state, there is one distributor_lead and one account_lead. Also, for every make and state, there can be a combination of dealership, year, and salesman defined by (indexed by) a day and then by an event_time

print cl.keyspaces["tests"].tables["albums"]

#	make text partition_key
#	state text partition_key
#		day timestamp clustering_key
#			event_time timestamp clustering_key
#				dealership text 
#				year int 
#				salesman text 
#		distributor_lead text static
#		account_lead text static

The traditional method for viewing this in CQL is this:

print cl.metadata.keyspaces["tests"].tables["sold_cars"].export_as_string()

#CREATE TABLE tests.sold_cars (
#    make text,
#    state text,
#    day timestamp,
#    event_time timestamp,
#    account_lead text static,
#    dealership text,
#    distributor_lead text static,
#    salesman text,
#    year int,
#    PRIMARY KEY ((make, state), day, event_time)

With that being said, please feel free to reach out to us for comments/suggestions/questions.

There are also some more examples for calling data from Cassandra and inserting it back using only a Pandas Dataframe (which we called a CassandraFrame), in bin/example.py

Example of using Caspanda for selecting data

Running a select from a Cassandra table will automatically return a Pandas Dataframe, even for simple selects. Let's say you have a keyspace called tr_data and you create one table tr_minute with the following columns:

cqlsh:tr_data> create table tr_minute (
 ccypair text,
 gmt_timestamp timestamp,
 mid_rate double,
 ric text static,
 PRIMARY KEY (ccypair, gmt_timestamp) );

Connect to the Cassandra database as usual, then switch to the tr_data keyspace. Any keywords controlling the connection such as the port or using compression are added as arguments to the initial CasPanda() call.

from caspanda.bear import CasPanda
cl = CasPanda(contact_points=['',], port=9042, compression=True)
cpsession = cl.connect()
select_ccys_distinct = """select distinct ccypair from tr_minute"""
ccys = cpsession.execute(select_ccys_distinct)

Now select some time-series data from the table:

select_minute_wlimit = """select ccypair,gmt_timestamp,ric,mid_rate from tr_minute
where ccypair = 'EURUSD' and gmt_timestamp >= '2015-05-01 00:00:00+0000'
and gmt_timestamp < '2015-06-01 00:00:00+0000' LIMIT 5"""
ccyA = cpsession.execute(select_minute_wlimit)
0 EURUSD 2015-05-01 00:00:00.001000 EUR= 1.121370
1 EURUSD 2015-05-01 00:01:00.001000 EUR= 1.120950
2 EURUSD 2015-05-01 00:02:00.001000 EUR= 1.121032
3 EURUSD 2015-05-01 00:03:00.001000 EUR= 1.121001
4 EURUSD 2015-05-01 00:04:00.001000 EUR= 1.120950

The dataframe returned is exactly the same layout as the table, though the pandas index is just the row number. If you want the index to be the timestamp, this has to be done explicitly:


Large result sets

By default the underlying python driver will switch to using paged-result sets if the number of returned rows is greater than 5,000 rows. This will not currently work with caspanda, because the results are not automatically returned by cassandra. The db 'waits' until the driver starts to request the results by page. To get around this you can increase the default select size:

cpsession.default_fetch_size = 50000

However note that cassandra also has a default server-side read timeout of 5 seconds. If you cannot retrieve all rows within this limit you will be timed out.

Parallel sessions

If you need to select basic data that does not really make sense in a dataframe (for instance a string of values to be re-used in another select), you can create another 'parallel' cassandra session, at the same time:

from cassandra.cluster import Cluster
cconnection = Cluster()
csession = cconnection.connect()
cccys = csession.execute(select_ccys_distinct)
# This returns a list of cassandra 'row-type'
ccy_string = ''
for row in cccys:
    ccy_string = ccy_string + row.ccypair +','
print ccy_string

and the results can be pulled directly from the response. You can use both in the same session, according to the type of results needed..


$ python setup.py install or $ pip install -e . You'll also need Cassandra:

$ brew install cassandra


There are some unit and integration tests in the caspanda/tests/ directory.

Run from the command line with

$ nosetests


  • grep -r TODO .
ezoic increase your site revenue


Using Pandas easily with Cassandra

License:MIT License


Language:Python 97.6%Language:Shell 2.4%