e-p-armstrong / augmentoolkit

Convert Compute And Books Into Instruct-Tuning Datasets (or classifiers)!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

All works well until the end -- create_conversation

livanos opened this issue · comments

commented

Program is great until the very end when I get a bunch of errors. I'm running in assistant mode.

-------------- QUESTIONS REVISED ------------- STATS SO FAR:
Nones: 117
Non-nones: 860
Total: 977
---------------- ONTO EXAMPLES GENERATION-------------------
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 243/243 [00:00<00:00, 2958.79it/s]
0%| | 0/676 [00:00<?, ?it/s]No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
Traceback (most recent call last):
File "/home/mike/augmentoolkit/augmentoolkit/control_flow_functions/control_flow_functions.py", line 1973, in create_conversation
conv = await make_multiturn_conversation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/augmentoolkit/augmentoolkit/control_flow_functions/control_flow_functions.py", line 1560, in make_multiturn_conversation
"charname": charname.strip(),
^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'strip'
Had an error, retrying... 'NoneType' object has no attribute 'strip'
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
Traceback (most recent call last):
File "/home/mike/augmentoolkit/augmentoolkit/control_flow_functions/control_flow_functions.py", line 1973, in create_conversation
conv = await make_multiturn_conversation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/augmentoolkit/augmentoolkit/control_flow_functions/control_flow_functions.py", line 1560, in make_multiturn_conversation
"charname": charname.strip(),
^^^^^^^^^^^^^^

Thanks for using Augmentoolkit and posting an issue!

As for your error... hmm.. that error output looks like the regex failed to extract a character name... could you share what the outputs of your multiturn_card_generations is? This might be an issue with the model you're using not following the format well enough, so the regex fails to extract a character name (leading to a nonetype) that then causes this error to print. (I just reran a sample dataset and it worked decently well).

So basically,

  1. please share the problem multiturn_card_generations
  2. what model are you using?
  3. out of curiosity -- did this error interrupt the whole pipeline, or did it continue generating after that issue?

Thank you!

commented

Thanks for quick response and amazing work here.

Quick question...I have "ASSISTANT_MODE: True" in the config.yaml file.

The note says:
"# If off, the conversations generated are between a user and an AI assistant. If on, the generated convs are between fictional characters in historical or fictional settings, with randomized personalities (some are nsfw by default, because a lot of model creators make models for that purpose. Change this (or amplify it) in ./augmentoolkit/generation_functions/special_instructions.py, it only requires changes to some strings.)"

Does "ASSISTANT_MODE: True" = "on" or "off" in the comment? I presumed "ASSISTANT_MODE: True" meant no characters (ie "off")...although I could have misread this. Either way...

  1. I don't have any multiturn_card_generations folder...I do have multi_turn_convs folder (it is empty) and I have multi_turn_convs_info (there are 676 items in it). Here is an example (multi_turn_convs_info/info_0_0.json). The original text is a public company financial filing:

[
[
[
"What was the settlement Goldy Metals Toronto paid for causing or permitting an oil spill that polluted a nearby creek in 2014?",
"Goldy Metals Toronto paid approximately CAD $94,000 for this settlement.",
"Commitments and Contingencies Leases Goldy Metals Toronto pays rent to one of the shareholders under a month to month lease agreement. Lease expense under this operating lease was approximately $65,000, $70,000, and 72,000 for the years ended December 31, 2014, 2013 and 2012, respectively. Port Hope leases office space to multiple tenants for aggregate annual rent of approximately $60,000. These leases range from terms of one to three years. Litigation and Related Contingencies The Company is subject to a variety of environmental and pollution control laws and regulations incident to the ordinary course of business. As of December 31, 2013 the Company had approximately CAD $94,000 accrued related to an environmental matter for causing or permitting an oil spill which polluted a nearby creek at the Goldy Metals Toronto location. This CAD $94,000 settlement was paid in 2014. The Company currently expects that the resolution of any other contingencies arising from compliance with these laws and regulations will not materially affect its combined financial position, results of operations or cash flows. The Province of Ontario has filed a civil lawsuit against Goldy Metals Toronto and the owner of the land on which the facility is located claiming damages of CAD $10.5 million plus pre- and post-judgment interest and court costs, for alleged historical and spill- related contamination and property encroachment damage.",
"./raw_txt_input/fenix"
],
[
"What is the total annual rent paid by multiple tenants to Port Hope for its office spaces?",
"The total annual rent for the office spaces of Port Hope is approximately $60,000.",
"Commitments and Contingencies Leases Goldy Metals Toronto pays rent to one of the shareholders under a month to month lease agreement. Lease expense under this operating lease was approximately $65,000, $70,000, and 72,000 for the years ended December 31, 2014, 2013 and 2012, respectively. Port Hope leases office space to multiple tenants for aggregate annual rent of approximately $60,000. These leases range from terms of one to three years. Litigation and Related Contingencies The Company is subject to a variety of environmental and pollution control laws and regulations incident to the ordinary course of business. As of December 31, 2013 the Company had approximately CAD $94,000 accrued related to an environmental matter for causing or permitting an oil spill which polluted a nearby creek at the Goldy Metals Toronto location. This CAD $94,000 settlement was paid in 2014. The Company currently expects that the resolution of any other contingencies arising from compliance with these laws and regulations will not materially affect its combined financial position, results of operations or cash flows. The Province of Ontario has filed a civil lawsuit against Goldy Metals Toronto and the owner of the land on which the facility is located claiming damages of CAD $10.5 million plus pre- and post-judgment interest and court costs, for alleged historical and spill- related contamination and property encroachment damage.",
"./raw_txt_input/fenix"
],
[
"What is the civil lawsuit claim against Goldy Metals Toronto and the owner of the land on which the facility is located?",
"The Province of Ontario has filed a civil lawsuit claiming CAD $10.5 million plus pre- and post-judgment interest and court costs for historical and spill-related contamination and property encroachment damage.",
"Commitments and Contingencies Leases Goldy Metals Toronto pays rent to one of the shareholders under a month to month lease agreement. Lease expense under this operating lease was approximately $65,000, $70,000, and 72,000 for the years ended December 31, 2014, 2013 and 2012, respectively. Port Hope leases office space to multiple tenants for aggregate annual rent of approximately $60,000. These leases range from terms of one to three years. Litigation and Related Contingencies The Company is subject to a variety of environmental and pollution control laws and regulations incident to the ordinary course of business. As of December 31, 2013 the Company had approximately CAD $94,000 accrued related to an environmental matter for causing or permitting an oil spill which polluted a nearby creek at the Goldy Metals Toronto location. This CAD $94,000 settlement was paid in 2014. The Company currently expects that the resolution of any other contingencies arising from compliance with these laws and regulations will not materially affect its combined financial position, results of operations or cash flows. The Province of Ontario has filed a civil lawsuit against Goldy Metals Toronto and the owner of the land on which the facility is located claiming damages of CAD $10.5 million plus pre- and post-judgment interest and court costs, for alleged historical and spill- related contamination and property encroachment damage.",
"./raw_txt_input/fenix"
],
[
"How much did Goldy Metals Toronto pay in rent for each of the years ended December 31, 2014, 2013, and 2012?",
"The rent for Goldy Metals Toronto was $65,000, $70,000, and $72,000 for the years ending December 31, 2014, 2013 and 2012, respectively.",
"Commitments and Contingencies Leases Goldy Metals Toronto pays rent to one of the shareholders under a month to month lease agreement. Lease expense under this operating lease was approximately $65,000, $70,000, and 72,000 for the years ended December 31, 2014, 2013 and 2012, respectively. Port Hope leases office space to multiple tenants for aggregate annual rent of approximately $60,000. These leases range from terms of one to three years. Litigation and Related Contingencies The Company is subject to a variety of environmental and pollution control laws and regulations incident to the ordinary course of business. As of December 31, 2013 the Company had approximately CAD $94,000 accrued related to an environmental matter for causing or permitting an oil spill which polluted a nearby creek at the Goldy Metals Toronto location. This CAD $94,000 settlement was paid in 2014. The Company currently expects that the resolution of any other contingencies arising from compliance with these laws and regulations will not materially affect its combined financial position, results of operations or cash flows. The Province of Ontario has filed a civil lawsuit against Goldy Metals Toronto and the owner of the land on which the facility is located claiming damages of CAD $10.5 million plus pre- and post-judgment interest and court costs, for alleged historical and spill- related contamination and property encroachment damage.",
"./raw_txt_input/fenix"
]
],
"will",
"be",
"replaced",
"980a4915-1806-4ead-aeb0-23675eaa313c"
]

  1. mixtral 8x7b instruct 4bit served via oobabooga locally hosted API. It worked on all the other functions of the program without issue (the program ran for 18 hours before getting to these errors).

  2. this error was repeated 100's of times until it got to the end:

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 676/676 [00:00<00:00, 2808.33it/s]
Conversion complete. Master list written to ./output/master_list.jsonl. Simplified data written to ./output/simplified_data.jsonl.
Conversion complete. The processed master list is written to 'processed_master_list.json'.

These files were empty.

Is there any way to "resume" processing from the previous step (understand if this feature isn't implemented)?

Thanks again!

Sounds like a real code issue where it isn't properly using the assistant mode prompts, I'll get onto fixing that sometime today, thanks for the report!

Also yes Augmentoolkit resumes from the previous step by default.

Should be fixed now. Tested it on my end with chat mode and assistant mode. Try it now?

commented

Thanks! That seems fixed....now I'm getting this error repeatedly (for every generation at this step of the program...

Response:

Conversation that answers the provided question (be sure that you do not change the questions or answers themselves; AI will answer the questions, not ask them; the questions and answers provided should be copied word for word, and surrounded by compelling conversation):

AI Assistant: Hello! I'm here to answer any questions you might have. Just let me know what you'd like to know.
User: I'm curious about the pro forma financial results of the independent Founding Companies. Why might these not be indicative of the combined entity's future financial or operating results?
AI Assistant: That's a valid point. The pro forma financial results of the independent Founding Companies are based on their operations during different periods when they were not under common control or management. As such, these results may not accurately predict the financial performance of the combined entity. Is there anything else you'd like to know about the combined entity?
User: What could happen to the trading price of the common stock if the newly combined enterprise is not able to manage the operations of the Founding Companies successfully or report results in a timely manner?
AI Assistant: If the newly combined enterprise is not able to manage the operations of the Founding Companies successfully or report results in a timely manner, the trading price of the common stock could decline, leading to a potential loss of investment for the shareholders. Do you have any other concerns related to the common stock trading price?
User: What is the potential risk for investors if the recently assembled management team is not successful in managing the combined entity?
AI Assistant: If the recently assembled management team is not successful in managing the combined entity, the investors could potentially lose all or part of their investment. Is there anything else you'd like to know about the management team?
Traceback (most recent call last):
File "/home/mike/augmentoolkit/augmentoolkit/generation_functions/generation_step_class.py", line 100, in generate
filtered_response = re.search(self.regex, response).group(1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'group'
Traceback (most recent call last):
File "/home/mike/augmentoolkit/augmentoolkit/control_flow_functions/control_flow_functions.py", line 1978, in create_conversation
conv = await make_multiturn_conversation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/augmentoolkit/augmentoolkit/control_flow_functions/control_flow_functions.py", line 1559, in make_multiturn_conversation
conv, conv_output = await multi_turn_conv_generator.generate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/augmentoolkit/augmentoolkit/generation_functions/generation_step_class.py", line 115, in generate
raise Exception("Generation step failed -- too many retries!")

Alright it should finally be fixed now. Sorry for making you retry so many times 😔

commented

All good. Appreciate your help. Resumed and started processing again...

  1. I'm getting "no name found" after each call (not sure if this actually impacts anything)
  2. I get the below error...maybe because I'm resuming generation now a couple of times and underling data is bad or something else?

Skipped generating ./output/multi_turn_convs/conv_638.json as it already exists
No name found, retrying with different regex
Skipped generating ./output/multi_turn_convs/conv_163.json as it already exists
No name found, retrying with different regex
Skipped generating ./output/multi_turn_convs/conv_435.json as it already exists
No name found, retrying with different regex
25%|██████████████████████████████████████████▉ | 174/702 [00:00<00:00, 5269.04it/s]
Traceback (most recent call last):
File "/home/mike/augmentoolkit/processing.py", line 440, in
asyncio.run(main())
File "/home/mike/miniconda3/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/home/mike/miniconda3/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/miniconda3/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/home/mike/augmentoolkit/processing.py", line 403, in main
await future
File "/home/mike/miniconda3/lib/python3.11/asyncio/tasks.py", line 605, in _wait_for_one
return f.result() # May raise f.exception().
^^^^^^^^^^
File "/home/mike/augmentoolkit/processing.py", line 105, in run_task_with_limit
return await task
^^^^^^^^^^
File "/home/mike/augmentoolkit/augmentoolkit/control_flow_functions/control_flow_functions.py", line 2009, in create_conversation
data = json.load(f)
^^^^^^^^^^^^
File "/home/mike/miniconda3/lib/python3.11/json/init.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^^^^^^^^
File "/home/mike/miniconda3/lib/python3.11/json/init.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/miniconda3/lib/python3.11/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/miniconda3/lib/python3.11/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
sys:1: RuntimeWarning: coroutine 'create_conversation' was never awaited

  1. No name found should be fine. It's a quirk of one of the settings you have on. I should probably remove that print...
  2. the underlying data should be fine as it's not really being changed, just added to. It's strange that you're running into another error, I seemingly can't reproduce this. Also by your logs it looks like it partially generated some conversations?

Either way I've added a quick try except on file loading to handle the case where that fails for some reason. It's possible that something went wrong at some point and so proper json wasn't written so that's causing errors reading the json, but... I can't really tell with more information. Either way it should no longer error in a way that interrupts running. Let me know if it works now, and thanks for the continued feedback!

commented

Thanks. So I restarted it and it got going. I got to the end one one step and this happened:

Output written to ./output/multiturn_conversation_generations/79e4f28b-2c5c-42ef-8fcd-6c83cb1d6da7.txt
No name found, retrying with different regex
Skipped generating ./output/multi_turn_convs/conv_661.json as it already exists
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 702/702 [25:47<00:00, 2.20s/it]No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
No name found, retrying with different regex
Traceback (most recent call last):
File "/home/mike/augmentoolkit/processing.py", line 440, in
asyncio.run(main())
File "/home/mike/miniconda3/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/home/mike/miniconda3/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/miniconda3/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/home/mike/augmentoolkit/processing.py", line 412, in main
control_flow_functions.convert_directory_to_list(
File "/home/mike/augmentoolkit/augmentoolkit/control_flow_functions/control_flow_functions.py", line 2026, in convert_directory_to_list
data = json.load(file)
^^^^^^^^^^^^^^^
File "/home/mike/miniconda3/lib/python3.11/json/init.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^^^^^^^^
File "/home/mike/miniconda3/lib/python3.11/json/init.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/miniconda3/lib/python3.11/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mike/miniconda3/lib/python3.11/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

that is weird .. i have some time over the weekend to drill it down / worst case you can find us on discord and we can solve that way faster there channel is in the readme

Yeah I bet that something's not being escaped that should be. This is exactly why I'm switching to YAML for the prompts and inputs of all future augmentoolkit things I make -- that format's much more resilient.

Might backport it.

As for this specific issue, it looks like it might be a problem with the data being fed to the pipeline. Strangely enough it looks like some of that might be None. Is the data confidential? It might be easier for me to repro if you could send over you convs_info folder (if not that's totally OK and I understand).

Contacts are in the README.

commented

Public data. Will send.

Received and fixed. Sent your finished data over. 700 convs is really cool and really large! Will close this issue once you acknowledge that your requirements are met and the issue is solved on your end.

Thanks again!

commented

All set thank you!