ahyatt / ekg

The emacs knowledge graph, app for notes and structured data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Failure Running ekg-embedding-generate-all - Model Max Content Length

jayrajput opened this issue · comments

Trying to run ekg-embedding-generate-all, results in following error

[error] request--callback: peculiar error: 400
error in process sentinel: let*: Problem calling Open AI: (http 400). type: invalid_request_error message: This model’s maximum context length is 8191 tokens, however you requested 61535 tokens (61535 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

I realized it is some note which has a lot of words. But how do I find that note? Is there an SQL query which I can run to find the note. I also have note export to logseq, so I can scan that as well.

Great error, thanks. I can maybe just use the first n tokens, or provide a function for the user to more intelligently select the right input of the right size. Let me know if you have a preference.

Anything will work for me. Eventually I want the large notes to be flagged. So that I can fix it. I am overall sold on the idea of creating smaller notes. So not an issue to create smaller notes. I realized that once I enabled embeddings, the capture-finalize also does not work. So I worked around the code by truncating the text to 4K tokens. See code below:

(defun truncate-string-to-8191-tokens (string)
  (let* ((tokens (split-string string))
         (trimmed-tokens (cl-loop for i below (min 4000 (length tokens))
                                 collect (nth i tokens))))
    (mapconcat #'identity trimmed-tokens " ")))

(defun ekg-embedding-openai (text)
  "Get an embedding of TEXT from Open AI API."
  (unless ekg-embedding-api-key
    (error "To call Open AI API, provide the ekg-embedding-api-key"))
  (setq text (truncate-string-to-8191-tokens text)) ; truncate the text before calling openai
  (let ((resp (request "https://api.openai.com/v1/embeddings"
                :type "POST"
                :headers `(("Authorization" . ,(format "Bearer %s" ekg-embedding-api-key))
                           ("Content-Type" . "application/json"))
                :data (json-encode `(("input" . ,text) ("model" . "text-embedding-ada-002")))
                :parser 'json-read
                :error (cl-function (lambda (&key error-thrown data &allow-other-keys)
                                      (error (format "Problem calling Open AI: %s, type: %s message: %s"
                                                     (cdr error-thrown)
                                                     (assoc-default 'type (cdar data))
                                                     (assoc-default 'message (cdar data))))))
                :sync t)))
    (cdr (assoc 'embedding (aref (cdr (assoc 'data (request-response-data resp))) 0)))))

The other funbit is that I never tried ChatGpt or OpenAi before so now I am using ChatGpt to write some emacs lisp code which is pretty good. I am sold on the OpenAi and excited to get ekg to work with the embedding to see something magic.

The challenge I am facing now is that even after truncating, I can see random failures with ekg-embedding-generate-all.

":PROPERTIES: :ID: 98664234-db46-4ef6-8c91-b800980e4600 :END: #+title: emacs meetup https://emacsconf.org/2022/talks/meetups/ Talk by Bhavin Gandhi on how to find a meetup. He also host the [[id:fcf3c72d-d63d-4429-b970-248df1a573b0][emacs apac]] meetup. The emacsconf link has refernce to other useful links like usergroups."
 [2 times]
323
 [2 times]
[error] request--curl-sync: semaphore never called
assoc: Wrong type argument: arrayp, nil

The last two lines is the error. Lines before those are the debug statement which I add in the code.

The request-curl-sync: semaphore never called is random and if I run ekg-embedding-generate-all again, it will continue and then fail at other place

 [2 times]
":PROPERTIES: :ID: bc530418-ea93-40af-9042-6fa422d061a3 :END: #+title: gmi sync Need to run gmi sync from /home/jay/Mail/jayrajput #+begin_src bash cd /home/jay/Mail/jayrajput gmi sync #+end_src"
 [2 times]
193
 [2 times]
[error] request--curl-sync: semaphore never called
assoc: Wrong type argument: arrayp, nil

I ran ekg-embedding-generate-all multiple times and can finally see this error:

Generated 35 embeddings

which I assume that the function complete successfully. I will suggest to provide some message to indicate the progress and the final success.

Please test out the latest from the develop branch, which should fix the initial problem.

The issues with request-curl-sync are mysterious to me, and seem like some other issue.

I hit another issue. The issue seems to be with both main/develop branch. Seems to be related with sorting the notes based on modified time. I am not sure why the one of the notes have modified time as nil. Looking more into it. Any guidance is appreciated.

Debugger entered--Lisp error: (wrong-type-argument number-or-marker-p nil)
  <(1681211683 nil)
  (closure (ekg-notes-mode-abbrev-table ekg-notes-mode-syntax-table cl-struct-ekg-inline-tags cl-struct-ekg-note-tags t) (a b) "Used to pass to `sort', which will supply A and B." (< (progn (or (progn (and (memq ... cl-struct-ekg-note-tags) t)) (signal 'wrong-type-argument (list 'ekg-note a))) (aref a 5)) (progn (or (progn (and (memq ... cl-struct-ekg-note-tags) t)) (signal 'wrong-type-argument (list 'ekg-note b))) (aref b 5))))(#s(ekg-note :id "744cc648-ae83-4eb5-998e-b8972a32dbc2" :text ":PROPERTIES:\n:ID:       744cc648-ae83-4eb5-998e-b8..." :mode org-mode :tags ("health") :creation-time 1681211683 :modified-time 1681211683 :properties (:embedding/embedding [-0.0013930731 -0.01287319 0.023329144 -0.0070584714 0.0029618174 0.023497788 -0.0008265333 -0.018030899 -0.048513375 -0.022148633 0.011552142 0.025999347 -0.019815719 0.027404718 -0.008959235 0.014587742 0.039153613 -0.045843173 0.023019962 -0.017946577 0.0011840243 0.011566196 0.0020694076 0.016414722 0.019829772 0.006419028 0.018283864 -0.022373492 -0.0070584714 -0.0035046418 0.037804455 -0.0116926795 -0.017510911 -0.014840708 0.0020395434 -0.0015634743 -0.018101167 0.0069530685 0.049384706 -0.003973684 0.009092744 0.007996556 -0.009015449 0.009858672 -0.019970309 0.018733583 0.022457814 -0.022584299 -0.018747637 0.019885987 ...]) :inlines nil) #s(ekg-note :id 33619865685 :text nil :mode nil :tags nil :creation-time nil :modified-time nil :properties nil :inlines nil))
  sort((#s(ekg-note :id 33619865685 :text nil :mode nil :tags nil :creation-time nil :modified-time nil :properties nil :inlines nil) #s(ekg-note :id "744cc648-ae83-4eb5-998e-b8972a32dbc2" :text ":PROPERTIES:\n:ID:       744cc648-ae83-4eb5-998e-b8..." :mode org-mode :tags ("health") :creation-time 1681211683 :modified-time 1681211683 :properties (:embedding/embedding [-0.0013930731 -0.01287319 0.023329144 -0.0070584714 0.0029618174 0.023497788 -0.0008265333 -0.018030899 -0.048513375 -0.022148633 0.011552142 0.025999347 -0.019815719 0.027404718 -0.008959235 0.014587742 0.039153613 -0.045843173 0.023019962 -0.017946577 0.0011840243 0.011566196 0.0020694076 0.016414722 0.019829772 0.006419028 0.018283864 -0.022373492 -0.0070584714 -0.0035046418 0.037804455 -0.0116926795 -0.017510911 -0.014840708 0.0020395434 -0.0015634743 -0.018101167 0.0069530685 0.049384706 -0.003973684 0.009092744 0.007996556 -0.009015449 0.009858672 -0.019970309 0.018733583 0.022457814 -0.022584299 -0.018747637 0.019885987 ...]) :inlines nil) #s(ekg-note :id "c:/Users/jayra/AppData/Roaming/org/20230326231957-..." :text ":PROPERTIES:\n:ID:       744cc648-ae83-4eb5-998e-b8..." :mode org-mode :tags ("health") :creation-time 1681211690 :modified-time 1685099827 :properties (:embedding/embedding [-0.0011666946 0.003570555 0.012904111 -0.020100903 0.0025943946 0.023928983 -0.0057212403 0.0018740194 -0.030123513 -0.027255934 0.0129806725 0.023873301 -0.020699475 0.017400365 -0.003191227 0.020699475 0.02662952 -0.031515542 0.018681033 -0.022439511 0.00029798126 0.015061757 -0.0161893 0.0053071114 0.010816067 0.010579422 0.020114822 -0.0116373645 -0.012778829 -0.0089229075 0.034438804 -0.030763846 -0.024109947 -0.018207742 0.002048023 0.0075517586 -0.0142683 -0.003469633 0.04045237 0.0038280804 0.0010979631 0.01140768 0.004812941 0.022091504 -0.010273176 0.011365919 0.018152062 -0.014379662 -0.020351468 0.009674603 ...]) :inlines nil)) ekg-sort-by-creation-time)
  (closure ((tags "health")) nil (sort (ekg-get-notes-with-tags tags) #'ekg-sort-by-creation-time))()
  funcall((closure ((tags "health")) nil (sort (ekg-get-notes-with-tags tags) #'ekg-sort-by-creation-time)))
  (mapc #'(lambda (note) (ewoc-enter-last ewoc note)) (funcall notes-func))
  (let ((ewoc (ewoc-create #'ekg-display-note-insert (propertize name 'face 'ekg-notes-mode-title)))) (mapc #'(lambda (note) (ewoc-enter-last ewoc note)) (funcall notes-func)) (ekg-notes-mode) (progn (set (make-local-variable 'ekg-notes-ewoc) ewoc) (set (make-local-variable 'ekg-notes-fetch-notes-function) notes-func) (set (make-local-variable 'ekg-notes-name) name) (set (make-local-variable 'ekg-notes-hl) (make-overlay 1 1)) (set (make-local-variable 'ekg-notes-tags) tags)) (overlay-put ekg-notes-hl 'face hl-line-face) (forward-line 1) (ekg--note-highlight))
  ekg--show-notes(#("tags (all): health" 12 18 (face ekg-tag)) (closure ((tags "health")) nil (sort (ekg-get-notes-with-tags tags) #'ekg-sort-by-creation-time)) ("health"))

I debugged this further and found that the problem is how the subject identifier are stored in the sqlite database. Any old subject created before I migrated to emacs 29 plus bunch of other things which was auto-generated is stored as an integer value. Whereas the code expects them to be stored as strings. Here are the two ids

; This is new ID and works with string comparison and can be reached via ekg interface
sqlite> SELECT COUNT() FROM triples where subject = '33671338550';
12
sqlite> SELECT COUNT(
) FROM triples where subject = 33671338550;
0

; This is old ID and works with integer comparison and cannot be reached via ekg interface
sqlite> SELECT COUNT() FROM triples where subject = 33681008573;
12
sqlite> SELECT COUNT(
) FROM triples where subject = '33681008573';
0

@ahyatt Please suggest how to fix this?

I think this is the same as #57, which I think is a problem in the triples module (ahyatt/triples#2). I'm still looking into this, I think it is some incompatibility between the emacsqlite and built-in sqlite.

Duplicate of #57

Please see that bug for updates.