Code script to add distances, run for wiki and conceptnet, analyze

Question

Code script to add distances, run for wiki and conceptnet, analyze

kudkudak opened this issue 7 years ago · comments

Stanislaw Jastrzebski commented 7 years ago

We have two test sets: Wiki test has 1.7M triplets, conceptnet one has 3k. We need to add to each distance (based on given embedding) to closest example from train set (100k examples).

This is feasible computation, if we use properties of metric. Worst case we can subsample wiki corpora, shouldn't be an issue. Start with script https://github.com/kudkudak/common-sense-prediction/blob/master/scripts/evaluate/augment_with_closest.py. Script should have similar arguments

Distance function: +INF if relation is different, max( ||tail_a, tail_b||^2, ||head_a - head_b||^2). Note that this can be a bit sped up significantly by the following tricks

precomputing first and second term of max for each head and tail. There are 25k unique heads, and 50k unique tails, so this should speed up computation by 3x.
dividing training set by relations (because distance is +INF if relations are different)

Suggestions:

Parallelize only if needed (warning about memory problem when parallelizing glove). Probably easiest parallelization is use gnu parallel and add to your script some sort of range parameters which would indicate which rows to compute

After computing the distance, for random examples from wiki corpora, please fetch 5 closest + their scores to get a feel how this conforms to intuition of what is novel, and what is trivial.

This should conclude test set evaluation

Subtasks:

Code script (and commit to master)
Run it
Create gdoc with some examples

Arian Hosseini · Answer 1 · Sat Dec 09 2017 02:45:50 GMT+0800 (China Standard Time)

how to run (is also in the doc string of the code):
If you want to dump the computed unique head and tail distances you need to have atleast 80GB of memory. I also used 8 cores for CPU.
python scripts/evaluate/closest_neighbour.py $DATA_DIR/LiACL/conceptnet/train100k.txt $DATA_DIR/LiACL/tuples.cn.txt $DATA_DIR/embeddings/LiACL/e mbeddings_OMCS.txt ./closest_results/tuplescn_minimal_dist_examples_ver2.txt True 10
some useful stuff:
- I will put the head and tail (unique) precomputed distances in my MILA directory (about 9GB and 37GB). Put them in your save_dir if you want to load them. the path is: /data/milatmp1/hosseise/csp_data/precomputed_distances
- I will also put the five closest examples file there

Stanislaw Jastrzebski · Answer 2 · Sat Dec 09 2017 02:46:29 GMT+0800 (China Standard Time)

awesome possum. did u end up using something like OMP_NUM_THREADS?

Arian Hosseini · Answer 3 · Sat Dec 09 2017 02:47:21 GMT+0800 (China Standard Time)

yeah, I set it to 16 at the beginning of the file, but it doesn't make that much of a difference.

Stanislaw Jastrzebski · Answer 4 · Sat Dec 09 2017 02:48:12 GMT+0800 (China Standard Time)

It is probably a bug, it should make a big difference. I am not sure u can set it within file, I think it has to be env variable set before either running python script (so it is in your env), or *perhaps* before importing numpy 2017-12-08 19:47 GMT+01:00 Arian Hosseini <notifications@github.com>:

…

yeah, I set it to 16 at the beginning of the file, but it doesn't make that much of a difference. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#46 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABf8aYhPCiaPsFmaqbRHS-rBCS3QuZFMks5s-YQ6gaJpZM4Qp4BU> .

Arian Hosseini · Answer 5 · Sat Dec 09 2017 02:49:48 GMT+0800 (China Standard Time)

yes, it's an env variable and I set it like this:

THREADS_LIMIT_ENV = 'OMP_NUM_THREADS'
os.environ[THREADS_LIMIT_ENV] = '16'

I read somewhere about setting it like this.

Stanislaw Jastrzebski · Answer 6 · Sat Dec 09 2017 02:50:56 GMT+0800 (China Standard Time)

I would double check, there is some magic on whether u set it before or after numpy import. might not be important, might be, you can check it quickly

Stanislaw Jastrzebski · Answer 7 · Tue Dec 19 2017 04:35:28 GMT+0800 (China Standard Time)

Note: we will need this to work for wiki at some not too distant future. Worst case is subsampling wiki to conceptnet50k size, but it would be nonideal

Stanislaw Jastrzebski · Answer 8 · Sun Dec 24 2017 17:12:17 GMT+0800 (China Standard Time)

Note for future: This code can be speed up a lot by just using matrix broadcastng:

 for thead, trel, ttail in tqdm.tqdm(zip(target_heads, target_rels, target_tails)):
            top_five = [('', '', np.inf) for _ in range(5)]
            for shead, srel, stail in zip(source_heads, source_rels, source_tails):
                dist = np.inf if srel.lower() != trel.lower() else max(
                    head_distances[source_h_keys[shead], target_h_keys[thead]], \
                    tail_distances[source_t_keys[stail], target_t_keys[ttail]])

Something along lines of:

def calculate_distance(ex, S, same_rel=False):
    # Assumes that featurization is [head, rel, tail]
    D = S.shape[1]/3
    dist1 = np.linalg.norm(ex.reshape(1, -1)[:, 0:D] - S[:,0:D], axis=1)
    dist2 = np.linalg.norm(ex.reshape(1, -1)[:, -D:] - S[:,-D:], axis=1)
    if same_rel:
        dist3 = np.linalg.norm(ex.reshape(1, -1)[:, D:2*D] - S[:,D:2*D], axis=1)
        same_rel_id = (dist3==0).astype("int")
        return same_rel_id*(dist1 + dist2) + (1-same_rel_id)*1000000000
    else:
        return (dist1 + dist2)

def calculate_distances(df, df_feat, train_feat, calc_isfar=True, same_rel=False):
    scores = []
    for id in tqdm.tqdm(range(len(df)), total=len(df)):
        scores.append(calculate_distance(df_feat[id], train_feat, same_rel=same_rel))
    scores_min = np.array([a.min() for a in scores])
    mean_per_relation = {}
    for r in relations:
        scores_rel = scores_min[np.where(df.rel.values == r)[0]]
        mean_per_relation[r.lower()] = scores_rel.mean()
    scores_isfar = []
    
    if calc_isfar:
        for id in range(len(df)):
            scores_isfar.append(mean_per_relation[df.rel.iloc[id].lower()] < scores_min[id])
        scores_isfar= np.array(scores_isfar)

    return scores, scores_min, scores_isfar