ValueError: not enough values to unpack (expected 8, got 6)

Question

ValueError: not enough values to unpack (expected 8, got 6)

bkrdmr opened this issue 2 years ago · comments

Hello, I have been experimenting with data from different social media platforms. I am following your guidelines in this repo. lately, I've tried processing youtube comments. so reply_id and urls columns are empty. I am seeing the following ValueError in preprocessing phase. do you have any suggestions to overcome this?

ValueError                                Traceback (most recent call last)
<ipython-input-16-e8dc9a9d85db> in <module>()
      1 db = "comments.db"
      2 file = "comments.csv"
----> 3 coord_net_tk.preprocess.preprocess_csv_files(db, [file])

/home/bkrdmr/anaconda3/envs/co/lib/python3.6/site-packages/coordination_network_toolkit/preprocess.py in preprocess_csv_files(db_path, input_filenames)
     20             # Skip header
     21             next(reader)
---> 22             preprocess_data(db_path, reader)
     23 
     24         print(f"Done preprocessing {message_file} into {db_path}")

/home/bkrdmr/anaconda3/envs/co/lib/python3.6/site-packages/coordination_network_toolkit/preprocess.py in preprocess_data(db_path, messages)
     72         )
     73 
---> 74         for row in processed:
     75             db.execute(
     76                 "insert or ignore into edge values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",

/home/bkrdmr/anaconda3/envs/co/lib/python3.6/site-packages/coordination_network_toolkit/preprocess.py in <genexpr>(.0)
     69                 urls.split(" ") if urls else [],
     70             )
---> 71             for message_id, user_id, username, repost_id, reply_id, message, timestamp, urls in messages
     72         )
     73 

ValueError: not enough values to unpack (expected 8, got 6)

Sam Hames · Answer 1 · Tue Feb 22 2022 08:41:27 GMT+0800 (China Standard Time)

so reply_id and urls columns are empty

Are the columns present, but empty (ie, they have comma delimiters but empty strings)? The error looks like the columns aren't present in the file, or have been messed up somehow by the import code.

Out of interest - are you converting JSON data collected via the YouTube API into a CSV and using that? If you can share the code doing the JSON -> CSV conversion (just a gist or something) I might be able to add native support for the format, similar to the Twitter format.

bkrdmr · Answer 2 · Tue Feb 22 2022 09:06:48 GMT+0800 (China Standard Time)

Columns are present. I've tried with dummy values but got the same result.
Data is stored in regular dbs of my lab. I extracted it as csv files, re-ordered the columns in pandas per your guideline, before saving it to a new csv for preprocessing.

df = df[['comment_id', 'commenter_id', 'commenter_name', 'video_id', 'reply_to', 'comment_displayed', 'published_date']]
df['urls'] = ""
df['reply_to'] = ""
df['published_date'] = pd.to_datetime(df['published_date'])
df['published_date'] = (df['published_date'] - pd.Timestamp("1970-01-01 00:00:00+00:00")) // pd.Timedelta('1s')
df.to_csv('comments.csv', index=False, encoding='utf-8')

Sam Hames · Answer 3 · Tue Feb 22 2022 09:11:02 GMT+0800 (China Standard Time)

Thanks for confirming - I'll try and take a look at what's going on today or tomorrow.

bkrdmr · Answer 4 · Tue Feb 22 2022 09:11:44 GMT+0800 (China Standard Time)

Thank you! will check again.

Sam Hames · Answer 5 · Wed Feb 23 2022 07:22:57 GMT+0800 (China Standard Time)

I had a quick look into this - I wonder if the problem is the CSV file is being misinterpreted within the toolkit?

I think two things to try are:

Try working with just a small sample of rows - if it works for that it's probably a representation problem with specific rows. df.head().to_csv('comments.csv', index=False, encoding='utf-8')
Quote all fields in the CSV output: df.to_csv('comments.csv', index=False, encoding='utf-8', quoting=1)

If either of those don't help, I might ask you to share an example file with me so I can debug for you.

Sam Hames · Answer 6 · Wed Feb 23 2022 07:48:39 GMT+0800 (China Standard Time)

Alternatively, since you're already writing Python, you can cut out the CSV middle man and work directly from the dataframe via the toolkit as a Python library. These functions are safe to use and aren't expected to change, I just haven't had any time to write documentation apart from the snippet in the readme.

from coordination_network_toolkit.preprocess import preprocess_data

# Create a generator of pandas rows, since iterrows returns an index and the row content
rows = (row for (i, row) in df.iterrows())

preprocess_data('youtube_comments.db', rows)

bkrdmr · Answer 7 · Thu Feb 24 2022 08:14:00 GMT+0800 (China Standard Time)

Yes. It is now working. Using preprocess_data() solved the issue. I guess something was wrong in the csv.
I wonder why you chose directed graphs instead of undirected graphs for co-retweet behavior though.

Thank you for the prompt response and quick fix. This is definitely helpful.