cannot load data

Question

cannot load data

usametov opened this issue 2 years ago · comments

Hi,
I have json file, 87MB .
I am not able to load it.
here is my code:

(ns my.kn
  (:require [asami.core :as d])
  (:require [clojure.java.io :as io])
  (:require [cheshire.core :refer :all]))

(defn import-data
  [json-file db-uri]
  (let [data (parsed-seq (io/reader json-file) true)]
    (d/transact (d/connect db-uri) {:tx-data data})))

(def db-uri "asami:local://my-data")  
(d/create-database db-uri)   
(import-data "my-file.json" db-uri)

here is log:

; eval (current-form): (import-data "my-file ...
; (err) Execution error (ExceptionInfo) at asami.core/eval19083$transact$fn (core.cljc:278).
; (err) Transaction timeout

could you please give me a hint how to debug this issue?

Paula Gearon · Answer 1 · Wed Nov 02 2022 03:33:05 GMT+0800 (China Standard Time)

Sorry! I did not see this before now.

I usually debug by interacting with the code as it goes, which may not be much help for you.

The way transact works is to start an async operation, which returns a future. It then waits until the operation is finished, or it times out. You're seeing the timeout, which defaults at 100 seconds. (Actually, making this configurable at runtime is probably better. I can add that). It's entirely possible that it finished successfully some time after your timeout exception returns. Did you try looking in the graph after some time had passed?

Assuming that it is still working, the options would be:

Try with a small file, checking if it's really running as expected, and your issue is just timing out too soon.
Increase the timeout. That can be configured by setting asami.txTimeoutMsec in the Java System properties. (It also accepts datomic.txTimeoutMsec)
Call transact-async instead of transact. The response is a future. You can either wait indefinitely on it, via: (deref transaction-future), or you can wait with your own timeout: (deref transaction-future 3600000 ::timeout) if waiting for an hour.

If you look at transact, you'll see that it just launches an async operation, and waits for it to finish.

Ulan Sametov · Answer 2 · Wed Nov 02 2022 14:09:36 GMT+0800 (China Standard Time)

OK thanks. I will try that.
do you have an example of multiple inserts?
Or, may be, it is straightforward do-seq?
Also, my json has a lot of repeated strings.
Shall I "normalize" them, like, insert those strings first and then refer to nodes created?

Paula Gearon · Answer 3 · Wed Nov 02 2022 23:25:29 GMT+0800 (China Standard Time)

I would usually do multiple inserts by either concatting the data and doing a single insert with a long timeout. The reason for that is so you don't end up with multiple checkpoints in the indexes. That matters for on-disk storage because each checkpoint makes the nodes in the index trees immutable. This means that you'll need to start copying nodes again as the next transaction starts, which is both slower and uses more disk space.

Repeated strings are just fine. As data comes in, every string is converted to a number, and the number is stored in the triple. It is important that the same string will always return the same number. This is done in 2 ways:

for short strings (7 characters or fewer) then it gets encoded into a long
for all other strings, they are stored in a tree index that acts as a key/value store

Each time a longer string is used, the index is searched:

if the string is found, then the associated number is returned.
if the string is not found, the location in the tree where it wasn't found (between 2 tree nodes) is used as the insertion point for that string, along with the number it is to be associated with. That number is then returned.

(I can explain what the number is too, but it doesn't really matter in the context of your question)