mfussenegger / cr8

CLI collection of utilities for working with CrateDB or PostgreSQL. Benchmark queries, insert data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

num records 10000 and bulk size 1000 only writes 1000 keys in crate

asicoe opened this issue · comments

Running this:
cr8 insert-fake-data --hosts crate-1:4200 --table t1 --bulk-size 1000 --num-records 10000

results in only 1000 records in a 1 node CrateDB 2.1.5.

Expected behaviour is to find 10000 records.

Could you post the schema of the table and show the output that you get with cr8? I can't reproduce this.

Hi,

Sure, sorry for not doing it already.

I am using Python 3.6.2.

git clone https://github.com/mfussenegger/cr8.git
cd cr8
python3.6 -m pip install -e .
docker run -d -p 4200:4200 -p 4300:4300 -p 5432:5432 crate:2.1.5
docker exec -ti crate crash --command "CREATE TABLE t1 (a STRING, b STRING, c STRING, d STRING, PRIMARY KEY (a, b)) CLUSTERED BY (a) with (number_of_replicas = 0);"

cr8 insert-fake-data --table t1 --bulk-size 100 --num-records 1000 //Expect 1000 rows in t1.
Found schema:
{
"a": "string",
"b": "string",
"c": "string",
"d": "string"
}
Using insert statement:
insert into "doc"."t1" ("a", "b", "c", "d") values (?, ?, ?, ?)
Will make 10 requests with a bulk size of 100
Generating fake data and executing inserts
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 9.96 requests/s]

docker exec -ti crate crash --command "REFRESH TABLE t1;"
+----------------------------+----------------+---------+-----------+---------+
| server_url | node_name | version | connected | message |
+----------------------------+----------------+---------+-----------+---------+
| http://localhost:4200 | Dômes de Miage | 2.1.5 | TRUE | OK |
+----------------------------+----------------+---------+-----------+---------+
REFRESH OK, 1 row affected (0.006 sec)

docker exec -ti crate crash --hosts docker-machine ip --command "select * from t1;"
+----------------------------+----------------+---------+-----------+---------+
| server_url | node_name | version | connected | message |
+----------------------------+----------------+---------+-----------+---------+
| http://192.168.99.100:4200 | Dômes de Miage | 2.1.5 | TRUE | OK |
+----------------------------+----------------+---------+-----------+---------+
+--------------+-------------+----------------+----------------+
| a | b | c | d |
+--------------+-------------+----------------+----------------+
| dolorem | facilis | commodi | nesciunt |
| debitis | assumenda | aperiam | repellendus |
| repellat | laborum | expedita | similique |
| nesciunt | asperiores | ea | numquam |
| explicabo | harum | cum | autem |
| ipsa | libero | perferendis | porro |
| illum | sequi | nisi | beatae |
| eveniet | aperiam | magnam | libero |
| dolorem | consectetur | saepe | ullam |
| repellat | debitis | placeat | libero |
| dolores | officia | est | doloremque |
| possimus | molestiae | a | velit |
| amet | repellendus | commodi | inventore |
| nobis | eius | enim | nostrum |
| cum | quia | ipsam | reiciendis |
| expedita | blanditiis | nam | unde |
| laudantium | nam | et | exercitationem |
| ut | suscipit | culpa | voluptatibus |
| repellat | repellat | eius | eos |
| nesciunt | reiciendis | fugiat | maxime |
| natus | quam | non | recusandae |
| saepe | odio | iure | eum |
| voluptas | officiis | officiis | reiciendis |
| saepe | commodi | sapiente | ab |
| id | in | id | corrupti |
| voluptate | asperiores | doloremque | perspiciatis |
| quidem | asperiores | incidunt | ea |
| tenetur | qui | earum | harum |
| ut | numquam | harum | enim |
| minima | unde | ratione | sapiente |
| ullam | asperiores | voluptas | ratione |
| dolorum | sunt | modi | nobis |
| maxime | laborum | enim | sed |
| voluptatem | ducimus | explicabo | dolor |
| laboriosam | repellendus | fuga | doloribus |
| veritatis | aspernatur | exercitationem | quo |
| quam | excepturi | culpa | tenetur |
| cumque | perferendis | molestias | culpa |
| maiores | vero | ad | cum |
| consectetur | vero | sed | necessitatibus |
| accusamus | officia | possimus | repudiandae |
| quaerat | maiores | fugiat | deserunt |
| ullam | rerum | maiores | cum |
| autem | in | ab | ea |
| odit | sed | reprehenderit | unde |
| ea | ipsum | doloremque | pariatur |
| maiores | cumque | sunt | illo |
| voluptatem | eligendi | temporibus | quo |
| adipisci | excepturi | saepe | natus |
| eum | non | ipsa | enim |
| deserunt | voluptatem | sit | repellendus |
| quibusdam | doloremque | quia | earum |
| odit | ipsa | nobis | animi |
| cumque | consectetur | possimus | temporibus |
| dolore | quasi | aliquid | quidem |
| repudiandae | inventore | saepe | ipsa |
| cumque | inventore | debitis | cumque |
| ad | ipsa | eligendi | perspiciatis |
| esse | quo | numquam | neque |
| sint | sapiente | alias | consectetur |
| non | sint | suscipit | voluptatem |
| et | minima | ducimus | accusantium |
| inventore | nulla | ducimus | consequatur |
| ex | facilis | dolorum | exercitationem |
| delectus | sequi | minima | laborum |
| dolor | iure | vitae | voluptatem |
| placeat | doloribus | officiis | optio |
| quisquam | aut | nesciunt | necessitatibus |
| ipsum | neque | pariatur | itaque |
| itaque | quae | nobis | consequuntur |
| incidunt | optio | iure | nemo |
| suscipit | atque | magni | explicabo |
| magnam | vero | quos | quas |
| perferendis | sunt | nemo | vitae |
| recusandae | vero | provident | minus |
| quod | deserunt | porro | perspiciatis |
| ad | harum | fugiat | doloremque |
| nostrum | tempore | facilis | nisi |
| pariatur | consequatur | quasi | mollitia |
| consequuntur | iusto | exercitationem | dolore |
| iusto | et | neque | fuga |
| dicta | laudantium | suscipit | animi |
| qui | dicta | consequatur | eligendi |
| modi | quam | placeat | atque |
| sapiente | soluta | fugit | architecto |
| praesentium | illo | occaecati | labore |
| illo | aperiam | optio | porro |
| unde | nulla | repellat | delectus |
| modi | natus | enim | rerum |
| ratione | molestiae | inventore | eos |
| est | error | ipsam | nihil |
| odio | incidunt | eligendi | temporibus |
| magni | quas | ipsa | praesentium |
| illo | amet | nihil | eum |
| doloribus | porro | dolore | atque |
| doloremque | officia | nulla | aliquid |
| magni | reiciendis | molestiae | suscipit |
| ratione | similique | molestias | minima |
| nam | aliquid | maiores | voluptas |
| nulla | voluptates | rerum | eligendi |
+--------------+-------------+----------------+----------------+
SELECT 100 rows in set (0.013 sec)

Thanks

I was able to reproduce it with your schema. The problem is the primary key
definition. The default random value provider for string columns doesn't
deliver enough unique values.

If you remove the primary key definition you'll get 1000 records inserted, but
for example, if you then do a select distinct a from t1 you'll see that you
get a much lower number of records (74 in my case)

So you can either:

  • Not use a primary key at all
  • Add a column named "id" of type string as primary key. cr8 will then pick a uuid provider.
  • Make use of the --mapping-file option to choose a different provider for the PK columns. (See cr8 insert-fake-data --help)

Hi,

Thanks for the reply.

Ok, I see your point but if it's like you say and the culprit is the randomness of the keys, why do I always get exactly the same number of records in CrateDb as the --bulk-size I pass in. For instance the exact same example above ran with --bulk-size 10 and --num-records 100 would give 10 records in CrateDb.

As for your suggestions:

  1. I have to use a primary key so I cannot change the schema.
  2. I now have id in the schema and indeed uuids are generated but it still generates exactly bulk-size records in crate:

docker exec -ti crate crash -c "CREATE TABLE t1 (id STRING, b STRING, c STRING, d STRING, PRIMARY KEY (id, b)) CLUSTERED BY (id) with (number_of_replicas = 0);"
+----------------------------+----------------+---------+-----------+---------+
| server_url | node_name | version | connected | message |
+----------------------------+----------------+---------+-----------+---------+
| http://localhost:4200 | Dômes de Miage | 2.1.5 | TRUE | OK |
+----------------------------+----------------+---------+-----------+---------+
CREATE OK, 1 row affected (0.112 sec)

cr8 insert-fake-data --table t1 --bulk-size 10 --num-records 100
Found schema:
{
"b": "string",
"c": "string",
"d": "string",
"id": "string"
}
Using insert statement:
insert into "doc"."t1" ("b", "c", "d", "id") values (?, ?, ?, ?)
Will make 10 requests with a bulk size of 10
Generating fake data and executing inserts
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 58.71 requests/s]

docker exec -ti crate crash --command "REFRESH TABLE t1;"
+----------------------------+----------------+---------+-----------+---------+
| server_url | node_name | version | connected | message |
+----------------------------+----------------+---------+-----------+---------+
| http://localhost:4200 | Dômes de Miage | 2.1.5 | TRUE | OK |
+----------------------------+----------------+---------+-----------+---------+
REFRESH OK, 1 row affected (0.001 sec)

docker exec -ti crate crash --command "SELECT * FROM t1;"
+----------------------------+----------------+---------+-----------+---------+
| server_url | node_name | version | connected | message |
+----------------------------+----------------+---------+-----------+---------+
| http://localhost:4200 | Dômes de Miage | 2.1.5 | TRUE | OK |
+----------------------------+----------------+---------+-----------+---------+
+--------------+----------+----------------+--------------------------------------+
| b | c | d | id |
+--------------+----------+----------------+--------------------------------------+
| quae | deleniti | in | aee02efe-fec2-1784-5bd1-bf2ca01bddf3 |
| quae | ratione | voluptas | 8ad6939c-93f4-d5c8-c104-a59becf41ab1 |
| dolorum | illo | sapiente | fc285551-4dff-cd1a-e471-633e62777318 |
| consectetur | vero | laudantium | 52df3934-c558-920d-dc64-a197b9fdc4de |
| minus | expedita | iste | 5ca91c31-8b85-28d2-c8f5-42b9d387fe26 |
| accusantium | maxime | consequatur | b8378220-6a38-4ca3-cd7c-ab8cb1b733a1 |
| consequuntur | illum | expedita | 4bb1618f-3bff-bae1-c9b6-1492df3dd989 |
| cupiditate | soluta | quidem | 3ff136c0-6b23-e1d9-f2b3-38362ca722e0 |
| quaerat | dolorem | exercitationem | 20e32c03-50d5-124d-2023-d2573731c3d4 |
| sapiente | officia | recusandae | 8e5ce352-7f48-81e7-73f4-2f13d11fc90f |
+--------------+----------+----------------+--------------------------------------+

This same thing happens if I remove b from the primary key and also then I remove the CLUSTERED BY clause.

  1. I haven't tried option 3 yet as option 2 did not work.

Thanks

Hi Mathias,

Were you able to reproduce the above?

Thanks

Yep I'm able to reproduce it. Thanks.

It seems that in https://github.com/joke2k/faker the seed handling changed slightly. Now in cr8 each bulk requests appears to generate data based on the same starting seed.

As a workaround you could downgrade faker until I've adopted cr8

(to downgrade you can use e.g. pip install Faker==0.7.1)

Should be fixed via #175 - although the data generation will be a bit slower.

Thanks Mathias.
pip install Faker==0.7.1
worked for me.