oliver006 / elasticsearch-gmail

Index your Gmail Inbox with Elasticsearch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

info on how to run from cli

spaceman10 opened this issue · comments

I tried moving through this and it appears to be failing on the import... I am running a vagrant-vm and get everything installed just fine.
I don't know how to invoke the script properly...
I've tried so many ways. This seems like it would give results... though it does nothing much.
python index_emails.py test.mbox
Any help or tips are appreciated! This has been a fun project so far. Stumbling at the end. Thanks!

What's the error you're seeing?

it was just a blank terminal. (nothing returned)

I ended up running it like this:

python2.7 index_emails.py -vvvvvv

and it spit out options.

then I ran this... and it worked..

python2.7 index_emails.py --infile=test.mbox

It does not seem to want to spit out options if you give it incomplete directions (no options or blank cli)

is this the expected output?

root@precise32:/vagrant# python2.7 index_emails.py --infile=test.mbox
Errors during upload: False - upload took: 1486ms, total messages uploaded: 500
Errors during upload: False - upload took: 614ms, total messages uploaded: 1000
Errors during upload: False - upload took: 621ms, total messages uploaded: 1500
Errors during upload: False - upload took: 353ms, total messages uploaded: 2000
Errors during upload: False - upload took: 369ms, total messages uploaded: 2500
Errors during upload: False - upload took: 373ms, total messages uploaded: 3000
Errors during upload: False - upload took: 295ms, total messages uploaded: 3500
Errors during upload: False - upload took: 289ms, total messages uploaded: 4000
Errors during upload: False - upload took: 377ms, total messages uploaded: 4500
Errors during upload: False - upload took: 452ms, total messages uploaded: 5000
Errors during upload: False - upload took: 297ms, total messages uploaded: 5500
Errors during upload: False - upload took: 500ms, total messages uploaded: 6000
Errors during upload: False - upload took: 273ms, total messages uploaded: 6500
Errors during upload: False - upload took: 435ms, total messages uploaded: 7000
Errors during upload: False - upload took:

above entry ended with the following

Errors during upload: False - upload took: 365ms, total messages uploaded: 22000
Errors during upload: False - upload took: 306ms, total messages uploaded: 22500
Errors during upload: False - upload took: 242ms, total messages uploaded: 23000
Errors during upload: False - upload took: 236ms, total messages uploaded: 23500
Errors during upload: False - upload took: 202ms, total messages uploaded: 24000
Traceback (most recent call last):
File "index_emails.py", line 179, in
IOLoop.instance().run_sync(load_from_file)
File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 418, in run_sync
return future_cell[0].result()
File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
raise_exc_info(self._exc_info)
File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 399, in run
result = func()
File "index_emails.py", line 136, in load_from_file
item = convert_msg_to_json(msg)
File "index_emails.py", line 102, in convert_msg_to_json
tz = tt[9] or 0
TypeError: 'NoneType' object has no attribute 'getitem'

a bunch of stuff ended up in the elastic search instance. So I'm unsure if this is all expected.

It added the first ~ 24k emails to the index but then failed with an error.
I updated the src file to do a bit more robust error checking during tz parsing, can you try again?

I also change that it outputs the --help info blurb if no parameters are passed.

Thanks. just did a git pull and running now. I think it takes about 10 minutes on my setup. will report back in a few.

latest run

Upload: OK - upload took: 491ms, total messages uploaded: 25000
Upload: OK - upload took: 435ms, total messages uploaded: 25500
Traceback (most recent call last):
File "index_emails.py", line 178, in
IOLoop.instance().run_sync(load_from_file)
File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 418, in run_sync
return future_cell[0].result()
File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
raise_exc_info(self._exc_info)
File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 399, in run
result = func()
File "index_emails.py", line 135, in load_from_file
item = convert_msg_to_json(msg)
File "index_emails.py", line 103, in convert_msg_to_json
result['date_ts'] = int(calendar.timegm(tt) - tz) * 1000
TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'

running it with -vvvv

redoing... scrollback buffer was ... too small. capturing logfile this go around.

did an strace on the process while it is running. seeing this every so often

ead(9, "W0OHn+\r\nKAywzHpJtqzQypD4NLRlcJ3D"..., 4096) = 4096
read(9, "kw5DxMEDejhXUjIpjaa1zfhHPe9TIGKc"..., 4096) = 4096
_llseek(9, 3475939328, [3475939328], SEEK_SET) = 0
read(9, "uu4/\r\npB41IpTitShkGejrkw0/DCr04q"..., 4096) = 4096
read(9, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
brk(0xac8c000) = 0xac8c000
mmap2(NULL, 8269824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb672b000
brk(0xa248000) = 0xa248000
mmap2(NULL, 8269824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb5f48000
munmap(0xb672b000, 8269824) = 0
_llseek(9, 3475947520, [3475947520], SEEK_SET) = 0
_llseek(9, 3475947520, [3475947520], SEEK_SET) = 0
read(9, "AAAAAAAAAAAAAAAA

i suspect it is the upload taking place

only difference with -vvvv is this bit at the end.

1367 Upload: OK - upload took: 314ms, total messages uploaded: 21500
1368 Upload: OK - upload took: 331ms, total messages uploaded: 22000
1369 Upload: OK - upload took: 290ms, total messages uploaded: 22500
1370 Upload: OK - upload took: 283ms, total messages uploaded: 23000
1371 Upload: OK - upload took: 227ms, total messages uploaded: 23500
1372 Upload: OK - upload took: 255ms, total messages uploaded: 24000
1373 Upload: OK - upload took: 241ms, total messages uploaded: 24500
1374 Upload: OK - upload took: 271ms, total messages uploaded: 25000
1375 Upload: OK - upload took: 417ms, total messages uploaded: 25500
1376 Traceback (most recent call last):
1377 File "index_emails.py", line 178, in
1378 IOLoop.instance().run_sync(load_from_file)
1379 File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 418, in run_sync
1380 return future_cell[0].result()
1381 File "/usr/local/lib/python2.7/dist-packages/tornado/concurrent.py", line 109, in result
1382 raise_exc_info(self.exc_info)
1383 File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 399, in run
1384 result = func()
1385 File "index_emails.py", line 135, in load_from_file
1386 item = convert_msg_to_json(msg)
1387 File "index_emails.py", line 103, in convert_msg_to_json
1388 result['date_ts'] = int(calendar.timegm(tt) - tz) * 1000
1389 TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'
1390 # clear builtin.

1391 # clear sys.path
1392 # clear sys.argv
1393 # clear sys.ps1
1394 # clear sys.ps2
1395 # clear sys.exitfunc

latest run command looks like this
python index_emails.py --infile=../../test.mbox --log_file_prefix=./real.log --logging=debug

same errors with above command. real.log is empty.
let me know if you have some ideas.

Interesting. I pushed up another version, this time catching all errors in tz parsing with a try/except - let me know if that fixes the issue.

cool. This worked. I am interested to see what it was that caused the fix. I noticed you made two distinct changes. that derive from return values. One that seems to sanitize the bool vs int (removal of return False) and the other that returns 'None' if the value is outside spec...and not an int.. at first glance.

I could run again reverting one or the other to see if it fails on one portion or the other if you want.

I'm interested to figure out how to validate the data now that it is inside elastic search.

Great work, and thanks much for the help !

You could add a print msg between line 104 and 105 and run the import again. That would output the message that causes the exception. I suspect it's an archived GChat transcript, I've seen them cause trouble in the past due to not having a timestamp.

It might be weird or foreign letters.
Date: ������, 26 ��� 2008 08:21:36 -0900

The other failures look like this
Date: Tue, 24 Apr 2007 01:01:10 GMT-07:00

The second one looks alright, not sure why it'd fail on that, weird.

GMT string data / formatting issues?

I edited my code local to print tt seperating out each message with a 'mushroom'

if "date" in result:
    try:
        tt = email.utils.parsedate_tz(result['date'])
        tz = tt[9] if len(tt) == 10 else 0
        result['date_ts'] = int(calendar.timegm(tt) - tz) * 1000
    except:
        print "\n\n\nmushroom \n\n\n"
        #print msg
        print tt
        #print tz
        return None

This is the output:

Upload: OK - upload took: 265ms, total messages uploaded: 21000
Upload: OK - upload took: 420ms, total messages uploaded: 21500
Upload: OK - upload took: 326ms, total messages uploaded: 22000
Upload: OK - upload took: 329ms, total messages uploaded: 22500
Upload: OK - upload took: 330ms, total messages uploaded: 23000
Upload: OK - upload took: 229ms, total messages uploaded: 23500
Upload: OK - upload took: 341ms, total messages uploaded: 24000

mushroom

None
Upload: OK - upload took: 173ms, total messages uploaded: 24500
Upload: OK - upload took: 310ms, total messages uploaded: 25000
Upload: OK - upload took: 283ms, total messages uploaded: 25500

mushroom

(2007, 4, 24, 1, 1, 10, 0, 1, -1, None)

mushroom

(2007, 4, 25, 0, 58, 28, 0, 1, -1, None)

mushroom

(2007, 4, 28, 1, 0, 21, 0, 1, -1, None)
Upload: OK - upload took: 349ms, total messages uploaded: 26000
Upload: OK - upload took: 287ms, total messages uploaded: 26500
Upload: OK - upload took: 200ms, total messages uploaded: 27000
Upload: OK - upload took: 74ms, total messages uploaded: 27222
Done - total count 27245

Thanks for the detailed log, that helps.
We can handle this case (2007, 4, 25, 0, 58, 28, 0, 1, -1, None) - I added a bit of code for that.

clean run. solid. Thanks!!!

Nice, guess we can close this.