QUT-Digital-Observatory / coordination-network-toolkit

A small command line tool and set of functions for studying coordination networks in Twitter and other social media data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ValueError: not enough values to unpack (expected 8, got 6)

bkrdmr opened this issue · comments

Hello, I have been experimenting with data from different social media platforms. I am following your guidelines in this repo. lately, I've tried processing youtube comments. so reply_id and urls columns are empty. I am seeing the following ValueError in preprocessing phase. do you have any suggestions to overcome this?

ValueError                                Traceback (most recent call last)
<ipython-input-16-e8dc9a9d85db> in <module>()
      1 db = "comments.db"
      2 file = "comments.csv"
----> 3 coord_net_tk.preprocess.preprocess_csv_files(db, [file])

/home/bkrdmr/anaconda3/envs/co/lib/python3.6/site-packages/coordination_network_toolkit/preprocess.py in preprocess_csv_files(db_path, input_filenames)
     20             # Skip header
     21             next(reader)
---> 22             preprocess_data(db_path, reader)
     23 
     24         print(f"Done preprocessing {message_file} into {db_path}")

/home/bkrdmr/anaconda3/envs/co/lib/python3.6/site-packages/coordination_network_toolkit/preprocess.py in preprocess_data(db_path, messages)
     72         )
     73 
---> 74         for row in processed:
     75             db.execute(
     76                 "insert or ignore into edge values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",

/home/bkrdmr/anaconda3/envs/co/lib/python3.6/site-packages/coordination_network_toolkit/preprocess.py in <genexpr>(.0)
     69                 urls.split(" ") if urls else [],
     70             )
---> 71             for message_id, user_id, username, repost_id, reply_id, message, timestamp, urls in messages
     72         )
     73 

ValueError: not enough values to unpack (expected 8, got 6)

so reply_id and urls columns are empty

Are the columns present, but empty (ie, they have comma delimiters but empty strings)? The error looks like the columns aren't present in the file, or have been messed up somehow by the import code.

Out of interest - are you converting JSON data collected via the YouTube API into a CSV and using that? If you can share the code doing the JSON -> CSV conversion (just a gist or something) I might be able to add native support for the format, similar to the Twitter format.

Columns are present. I've tried with dummy values but got the same result.
Data is stored in regular dbs of my lab. I extracted it as csv files, re-ordered the columns in pandas per your guideline, before saving it to a new csv for preprocessing.

df = df[['comment_id', 'commenter_id', 'commenter_name', 'video_id', 'reply_to', 'comment_displayed', 'published_date']]
df['urls'] = ""
df['reply_to'] = ""
df['published_date'] = pd.to_datetime(df['published_date'])
df['published_date'] = (df['published_date'] - pd.Timestamp("1970-01-01 00:00:00+00:00")) // pd.Timedelta('1s')
df.to_csv('comments.csv', index=False, encoding='utf-8')

Thanks for confirming - I'll try and take a look at what's going on today or tomorrow.

Thank you! will check again.

I had a quick look into this - I wonder if the problem is the CSV file is being misinterpreted within the toolkit?

I think two things to try are:

  1. Try working with just a small sample of rows - if it works for that it's probably a representation problem with specific rows. df.head().to_csv('comments.csv', index=False, encoding='utf-8')
  2. Quote all fields in the CSV output: df.to_csv('comments.csv', index=False, encoding='utf-8', quoting=1)

If either of those don't help, I might ask you to share an example file with me so I can debug for you.

Alternatively, since you're already writing Python, you can cut out the CSV middle man and work directly from the dataframe via the toolkit as a Python library. These functions are safe to use and aren't expected to change, I just haven't had any time to write documentation apart from the snippet in the readme.

from coordination_network_toolkit.preprocess import preprocess_data

# Create a generator of pandas rows, since iterrows returns an index and the row content
rows = (row for (i, row) in df.iterrows())

preprocess_data('youtube_comments.db', rows)

Yes. It is now working. Using preprocess_data() solved the issue. I guess something was wrong in the csv.
I wonder why you chose directed graphs instead of undirected graphs for co-retweet behavior though.

Thank you for the prompt response and quick fix. This is definitely helpful.