dumblob / hash-db

distributed keyvalue database imitating dynamodb querying with SQL support, distributed joins and Cypher graph support

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

hash-db

This is an experimental project.

  • You can create a basic database with a hash table and a prefix trie. See hash-db.py It's a very small database.
  • This reflects dynamodb style querying.
  • You can create a basic distributed database with consistent hashing. See client.py and server.py. Data is rebalanced onto nodes as new servers are added.
  • SQL is parsed on the server and work distributed to data nodes. Please see this blog post. For how distributed join works: see this blog post.
  • Cypher is parsed on the server and distributed to a data node for processing. Graphs only live on one data node at a time. I haven't worked out how to distribute their processing yet.

This project demonstrates how simple a database can be. Do not use for serious data, it's only stored in memory and there is no persistence.

See this stackoverflow question

Also see the Java version here

This project uses Google's pygtrie and Michaeln Nielsen's consistent hashing code

Running

Run pip install -r requirements.txt

Run ./start-all.sh to start server with 3 data nodes. See example.py for the tests that I run as part of development.

Distributed joins

Data is distributed across the cluster, with rows being on one server each. I haven't gotten around to load balancing the data.

First, register a join with the server:

create join
    inner join people on people.id = items.people
    inner join products on items.search = products.name

print("create join sql")
statement = """create join
    inner join people on people.id = items.people
    inner join products on items.search = products.name
    """
url = "http://{}/sql".format(args.server)
response = requests.post(url, data=json.dumps({
    "sql": statement
    }))
print(url)
print(response.text)

Insert data. The join is maintained as you insert data. Data is spread out across the cluster.

In join on one server and then ask for missing data from the other data nodes.

curl -H"Content-type: application/json" -X POST http://localhost:1005/sql --data-ascii '{"sql": "select products.price, people.people_name, items.search from items inner join people on items.people = people.id inner join products on items.search = products.name"}'

API standard

Sort key begins with value

http://localhost:1005/query_begins/people-100/messages/asc

Pk begins and Sk begins

http://localhost:1005/query_pk_sk_begins/people/messages/desc

Sort key between hash_values

http://localhost:1005/query_between/people-100/messages-101/messages-105/desc

Partition key and sort key between values

http://localhost:1005/both_between/people-100-2020-05/people-100-2020-07/friends-2019/friends-2020-06-~~/desc

SQL Interface

curl -H"Content-type: application/json" -X POST http://localhost:1005/sql --data-ascii '{"sql": "select * from people"}'
print("4 insert sql")                                                                                                  |            elif last_id == identifier:                                                                               
url = "http://{}/sql".format(args.server)                                                                              |                table_metadata["current_record"][field_name] = data[lookup_key]                                       
response = requests.post(url, data=json.dumps({                                                                        |                                                                                                                      
    "sql": "insert into people (people_name, age) values ('Sam', 29)"                                                  |                                                                                                                      
    }))                                                                                                                |        field_reductions = []                                                                                         
print(url)                                                                                                             |        for index, pair in enumerate(table_datas):                                                                    
print(response.text) 

Cypher interface

For simplicity, we only support Cypher triples. That is, (node)-[:relationship]-(node) separated by commas. But the sum of the triples can produce the same output as if the Cypher was all in one line.

curl -H"Content-type: application/json" -X POST http://localhost:1005/cypher --data-ascii '{"key": "1", "cypher": "match (start:Person)-[:FRIEND]->(end:Person), (start)-[:LIKES]->(post:Post), (end)-[:POSTED]->(post) return start, end, post"}'
query = """match (start:Person)-[:FRIEND]->(end:Person), (start)-[:LIKES]->(post:Post), (end)-[:POSTED]->(post) return start, end, post"""
print(query)
url = "http://{}/cypher".format(args.server)
response = requests.post(url, data=json.dumps({
    "key": "1",
    "cypher": query
    }))
print(url)
print(response.text)

About

distributed keyvalue database imitating dynamodb querying with SQL support, distributed joins and Cypher graph support


Languages

Language:Python 99.3%Language:Shell 0.7%