kudkudak / common-sense-prediction

Common sense prediction using DL.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Code script to add distances, run for wiki and conceptnet, analyze

kudkudak opened this issue · comments

We have two test sets: Wiki test has 1.7M triplets, conceptnet one has 3k. We need to add to each distance (based on given embedding) to closest example from train set (100k examples).

This is feasible computation, if we use properties of metric. Worst case we can subsample wiki corpora, shouldn't be an issue. Start with script https://github.com/kudkudak/common-sense-prediction/blob/master/scripts/evaluate/augment_with_closest.py. Script should have similar arguments

Distance function: +INF if relation is different, max( ||tail_a, tail_b||^2, ||head_a - head_b||^2). Note that this can be a bit sped up significantly by the following tricks

  • precomputing first and second term of max for each head and tail. There are 25k unique heads, and 50k unique tails, so this should speed up computation by 3x.
  • dividing training set by relations (because distance is +INF if relations are different)

Suggestions:

  • Parallelize only if needed (warning about memory problem when parallelizing glove). Probably easiest parallelization is use gnu parallel and add to your script some sort of range parameters which would indicate which rows to compute

After computing the distance, for random examples from wiki corpora, please fetch 5 closest + their scores to get a feel how this conforms to intuition of what is novel, and what is trivial.

This should conclude test set evaluation

Subtasks:

  • Code script (and commit to master)
  • Run it
  • Create gdoc with some examples
  • how to run (is also in the doc string of the code):
    If you want to dump the computed unique head and tail distances you need to have atleast 80GB of memory. I also used 8 cores for CPU.
    python scripts/evaluate/closest_neighbour.py $DATA_DIR/LiACL/conceptnet/train100k.txt $DATA_DIR/LiACL/tuples.cn.txt $DATA_DIR/embeddings/LiACL/e mbeddings_OMCS.txt ./closest_results/tuplescn_minimal_dist_examples_ver2.txt True 10

  • some useful stuff:

    • I will put the head and tail (unique) precomputed distances in my MILA directory (about 9GB and 37GB). Put them in your save_dir if you want to load them. the path is: /data/milatmp1/hosseise/csp_data/precomputed_distances
    • I will also put the five closest examples file there

awesome possum. did u end up using something like OMP_NUM_THREADS?

yeah, I set it to 16 at the beginning of the file, but it doesn't make that much of a difference.

yes, it's an env variable and I set it like this:

THREADS_LIMIT_ENV = 'OMP_NUM_THREADS'
os.environ[THREADS_LIMIT_ENV] = '16'

I read somewhere about setting it like this.

I would double check, there is some magic on whether u set it before or after numpy import. might not be important, might be, you can check it quickly

Note: we will need this to work for wiki at some not too distant future. Worst case is subsampling wiki to conceptnet50k size, but it would be nonideal

Note for future: This code can be speed up a lot by just using matrix broadcastng:

 for thead, trel, ttail in tqdm.tqdm(zip(target_heads, target_rels, target_tails)):
            top_five = [('', '', np.inf) for _ in range(5)]
            for shead, srel, stail in zip(source_heads, source_rels, source_tails):
                dist = np.inf if srel.lower() != trel.lower() else max(
                    head_distances[source_h_keys[shead], target_h_keys[thead]], \
                    tail_distances[source_t_keys[stail], target_t_keys[ttail]])

Something along lines of:

def calculate_distance(ex, S, same_rel=False):
    # Assumes that featurization is [head, rel, tail]
    D = S.shape[1]/3
    dist1 = np.linalg.norm(ex.reshape(1, -1)[:, 0:D] - S[:,0:D], axis=1)
    dist2 = np.linalg.norm(ex.reshape(1, -1)[:, -D:] - S[:,-D:], axis=1)
    if same_rel:
        dist3 = np.linalg.norm(ex.reshape(1, -1)[:, D:2*D] - S[:,D:2*D], axis=1)
        same_rel_id = (dist3==0).astype("int")
        return same_rel_id*(dist1 + dist2) + (1-same_rel_id)*1000000000
    else:
        return (dist1 + dist2)

def calculate_distances(df, df_feat, train_feat, calc_isfar=True, same_rel=False):
    scores = []
    for id in tqdm.tqdm(range(len(df)), total=len(df)):
        scores.append(calculate_distance(df_feat[id], train_feat, same_rel=same_rel))
    scores_min = np.array([a.min() for a in scores])
    mean_per_relation = {}
    for r in relations:
        scores_rel = scores_min[np.where(df.rel.values == r)[0]]
        mean_per_relation[r.lower()] = scores_rel.mean()
    scores_isfar = []
    
    if calc_isfar:
        for id in range(len(df)):
            scores_isfar.append(mean_per_relation[df.rel.iloc[id].lower()] < scores_min[id])
        scores_isfar= np.array(scores_isfar)

    return scores, scores_min, scores_isfar