There are three ways to preprocess the question, what is the difference ?

Question

There are three ways to preprocess the question, what is the difference ?

gaopeng-eugene opened this issue 7 years ago · comments

    if nlp == 'nltk':
        ex['question_words'] = word_tokenize(str(s).lower())
    elif nlp == 'mcb':
        ex['question_words'] = tokenize_mcb(s)
    else:
        ex['question_words'] = tokenize(s)

Remi · Answer 1 · Sun Oct 08 2017 10:17:27 GMT+0800 (China Standard Time)

Sorry for the late answer @gaopeng-eugene

mcb (what we are using by default)
(https://github.com/Cadene/vqa.pytorch/blob/master/vqa/datasets/vqa_processed.py#L45)

import re

def tokenize_mcb(s):
    t_str = s.lower()
    for i in [r'\?',r'\!',r'\'',r'\"',r'\$',r'\:',r'\@',r'\(',r'\)',r'\,',r'\.',r'\;']:
        t_str = re.sub( i, '', t_str)
        print(t_str)
    for i in [r'\-',r'\/']:
        t_str = re.sub( i, ' ', t_str)
        print(t_str)
    q_list = re.sub(r'\?','',t_str.lower()).split(' ')
    print(q_list)
    q_list = list(filter(lambda x: len(x) > 0, q_list))
    print(q_list)
    return q_list


tokenize_mcb("Hello I'm a co_ol ro-b-ot, wO0O0T ? !")

> hello i'm a co_ol ro-b-ot, wo0o0t  !
hello i'm a co_ol ro-b-ot, wo0o0t  
hello im a co_ol ro-b-ot, wo0o0t  
hello im a co_ol ro-b-ot, wo0o0t  
hello im a co_ol ro-b-ot, wo0o0t  
hello im a co_ol ro-b-ot, wo0o0t  
hello im a co_ol ro-b-ot, wo0o0t  
hello im a co_ol ro-b-ot, wo0o0t  
hello im a co_ol ro-b-ot, wo0o0t  
hello im a co_ol ro-b-ot wo0o0t  
hello im a co_ol ro-b-ot wo0o0t  
hello im a co_ol ro-b-ot wo0o0t  
hello im a co_ol ro b ot wo0o0t  
hello im a co_ol ro b ot wo0o0t  
['hello', 'im', 'a', 'co_ol', 'ro', 'b', 'ot', 'wo0o0t', '', '']
['hello', 'im', 'a', 'co_ol', 'ro', 'b', 'ot', 'wo0o0t']

Else (https://github.com/Cadene/vqa.pytorch/blob/master/vqa/datasets/vqa_processed.py#L42)

[-."',:? !$#@~()*&^%;[]/\+<>\n=]
Easy!

I should have documented that, sorry haha... anyway it returns this:

> ['Hello', 'I', "'", 'm', 'a', 'co_ol', 'ro', '-', 'b', '-', 'ot', ',', 'wO0O0T', '?', '!']

nltk

convert to lowercase and tokenize the string to split off punctuation other than periods
source: http://www.nltk.org/api/nltk.tokenize.html