The dataset used for evaluation

Question

The dataset used for evaluation

skye95git opened this issue 3 years ago · comments

skye95git commented 3 years ago

Hi, I have a few questions about the evaluation:

I want to evaluate deepcs with the CosBench dataset, the wiki of CosBench shows that "The code is based on guxd/deep-code-search， and we added some evaluation code based on our evaluation needs. The author of the paper has updated the code in their Github project", may I ask which version you have updated? keras or pytorch?
Can the existing Keras version evaluation code be used for evaluation of CosBench dataset?
I can't find evaluation code in "Code Structures" of pytorch version, could you provide it?
Thanks!

skye95git · Answer 1 · Mon Jul 12 2021 19:31:56 GMT+0800 (China Standard Time)

I found another problem: 'SearchEngine' object has no attribute 'eval' in main.py of Keras version.

Xiaodong Gu · Answer 2 · Mon Jul 12 2021 21:32:51 GMT+0800 (China Standard Time)

Hi, thanks for the questions!

We updated both Keras and PyTorch versions. For the Keras version, we updated it in order to adapt to TensorFlow. The PyTorch version is a bleeding-edge version so it has been frequently updated for model refactoring, bug fix, and hyperparameter tuning.
You can use the existing Keras version for evaluating the CosBench dataset.
You can perform an automatic evaluation using the validate function in train.py

or you can run the search.py script and perform a manual evaluation based on the search results.

Xiaodong Gu · Answer 3 · Mon Jul 12 2021 21:35:46 GMT+0800 (China Standard Time)

Please modify engine.eval(..) to engine.valid(..)

skye95git · Answer 4 · Tue Jul 13 2021 11:55:00 GMT+0800 (China Standard Time)

Hi, thanks for the questions!

We updated both Keras and PyTorch versions. For the Keras version, we updated it in order to adapt to TensorFlow. The PyTorch version is a bleeding-edge version so it has been frequently updated for model refactoring, bug fix, and hyperparameter tuning.

You can use the existing Keras version for evaluating the CosBench dataset.

You can perform an automatic evaluation using the validate function in train.py

or you can run the search.py script and perform a manual evaluation based on the search results.

Thank you for your answer! I have a couple of other questions:

After modifying engine.eval(..) to engine.valid(..), I evaluate the dummy dataset with the valid function, the result shows "name 'acc' is not defined". Should it return np.mean(accs), np.mean(mrrs), np.mean(maps), np.mean(ndcgs)?
I am using the epo500_desc.h5 and epo500_code.h5 models on Google cloud disk. Are they pre-trained models for Keras version? Were they trained on real dataset? Can we evaluate them directly with the Cosbench dataset?
When I evaluated with the Keras version, the valid dataset were "test.methName.h5, test.apiseq.h5, test.tokens.h5, test.desc.h5", but the CosBench dataset doesn't have this type of file. The QAset in cosbench is a Json file. Do I need to convert the QASET file to the above 4 files to perform the evaluation?
I have another question about the data. The training data is used for training. The valid data is used for evaluation. The use data is used to calculate the code vectors of the search codebase. The results data is used to hold code vectors. I don't understand what vocabulary info is used for, can you tell me ?

Xiaodong Gu · Answer 5 · Tue Jul 13 2021 21:56:36 GMT+0800 (China Standard Time)

Yes, you should return np.mean(accs) instead of acc
Yes, they were trained on a real dataset. To evaluate them, you need to binarize the Cosbench data using our vocabulary.
You should do data processing to feed for the pre-trained model. In particular, you should use the vocabulary file we provided to binarize your dataset into numbers and create an h5 file.
vocabulary is used to convert text into numbers (token indices). It is useful for converting a natural language query into a sequence of token indices.

skye95git · Answer 6 · Wed Jul 14 2021 21:08:23 GMT+0800 (China Standard Time)

Yes, you should return np.mean(accs) instead of acc

Yes, they were trained on a real dataset. To evaluate them, you need to binarize the Cosbench data using our vocabulary.

You should do data processing to feed for the pre-trained model. In particular, you should use the vocabulary file we provided to binarize your dataset into numbers and create an h5 file.

vocabulary is used to convert text into numbers (token indices). It is useful for converting a natural language query into a sequence of token indices.

I have tried two versions of the evaluation according to the method you mentioned, and encountered some problems:
1.In keras version, when I use the pre-training model to query the real dataset, the similarity obtained is not high, only 0.12.
`Input Query: convert an inputstream to a string
How many results? 5
('public String getInvocation ( ) { StringBuilder b = new StringBuilder ( String . format ( "%s(" , name ) ) ; int countdown = placeholders . size ( ) ; for ( PlaceholderWriter ph : placeholders ) { b . append ( ph . getValue ( ) ) ; if ( -- countdown > 0 ) { b . append ( "," ) ; } } b . append ( ")" ) ; return b . toString ( ) ; } \n', 0.12840500827015766)

('public void assertIsChild ( Element elem ) { assert elem . getParentElement ( ) . getParentElement ( ) == this . parentElem : "Element-is-not-a-child-of-this-layout" ; } \n', 0.12840500827015766)

("private static Set < String > expandPropertyRefs ( Set < String > refs ) { if ( refs == null ) { return Collections . emptySet ( ) ; } Set < String > toReturn = new TreeSet < String > ( ) ; for ( String raw : refs ) { for ( int idx = raw . length ( ) ; idx >= 0 ; idx = raw . lastIndexOf ( '.' , idx - 1 ) ) { toReturn . add ( raw . substring ( 0 , idx ) ) ; } } return toReturn ; } \n", 0.12840500827015766)

('public List < AbstractEditorDelegate < ? , ? >> getRaw ( Object key ) { return map . get ( key ) ; } \n', 0.12840500827015766)

('public static byte [ ] getBytes ( String s ) { try { return s . getBytes ( DEFAULT_ENCODING ) ; } catch ( UnsupportedEncodingException e ) { throw new RuntimeException ( "The-JVM-does-not-support-the-compiler's-default-encoding." , e ) ; } } \n', 0.12840500827015766)
2.In keras version, when I evaluated the real data set with the pre-training model, there was a big gap between the results and those in the paper.ACC=0.0007, MRR=0.00019595238095238094, MAP=0.00019595238095238094, nDCG=0.0003094175888481896`

3.In pytorch version, when I evaluated the real data set with the pre-training model, the result is nan.
Code for evaluation:
`import os
import sys
import traceback
import numpy as np
import argparse
import threading
import codecs
from tqdm import tqdm
import logging
logger = logging.getLogger(name)
logging.basicConfig(level=logging.INFO, format="%(message)s")
import torch
from utils import normalize, similarity, sent2indexes
from data_loader import CodeSearchDataset, load_dict, load_vecs
import models, configs

def parse_args():
parser = argparse.ArgumentParser("Train and Test Code Search(Embedding) Model")
parser.add_argument('--data_path', type=str, default='./data/', help='location of the data corpus')
parser.add_argument('--model', type=str, default='JointEmbeder', help='model name')
parser.add_argument('-d', '--dataset', type=str, default='github', help='name of dataset.java, python')
parser.add_argument('-t', '--timestamp', type=str, help='time stamp')
parser.add_argument('--reload_from', type=int, default=-1, help='step to reload from')
parser.add_argument('--chunk_size', type=int, default=2000000, help='codebase and code vector are stored in many chunks. '
'Note: should be consistent with the same argument in the repr_code.py')
parser.add_argument('-g', '--gpu_id', type=int, default=0, help='GPU ID')
return parser.parse_args()

Evaluation

def validate(valid_set, model, pool_size, K, sim_measure):
"""
simple validation in a code pool.
@param: poolsize - size of the code pool, if -1, load the whole test set
"""
def ACC(real,predict):
sum=0.0
for val in real:
try: index=predict.index(val)
except ValueError: index=-1
if index!=-1: sum=sum+1
return sum/float(len(real))
def MAP(real,predict):
sum=0.0
for id, val in enumerate(real):
try: index=predict.index(val)
except ValueError: index=-1
if index!=-1: sum=sum+(id+1)/float(index+1)
return sum/float(len(real))
def MRR(real, predict):
sum=0.0
for val in real:
try: index = predict.index(val)
except ValueError: index=-1
if index!=-1: sum=sum+1.0/float(index+1)
return sum/float(len(real))
def NDCG(real, predict):
dcg=0.0
idcg=IDCG(len(real))
for i, predictItem in enumerate(predict):
if predictItem in real:
itemRelevance = 1
rank = i+1
dcg +=(math.pow(2,itemRelevance)-1.0)(math.log(2)/math.log(rank+1))
return dcg/float(idcg)
def IDCG(n):
idcg=0
itemRelevance=1
for i in range(n): idcg+=(math.pow(2,itemRelevance)-1.0)(math.log(2)/math.log(i+2))
return idcg
model.eval()
device = next(model.parameters()).device
data_loader = torch.utils.data.DataLoader(dataset=valid_set, batch_size=10000,
shuffle=True, drop_last=True, num_workers=1)
accs, mrrs, maps, ndcgs=[],[],[],[]
code_reprs, desc_reprs = [], []
n_processed = 0
for batch in tqdm(data_loader):
if len(batch) == 10: # names, name_len, apis, api_len, toks, tok_len, descs, desc_len, bad_descs, bad_desc_len
code_batch = [tensor.to(device) for tensor in batch[:6]]
desc_batch = [tensor.to(device) for tensor in batch[6:8]]
else: # code_ids, type_ids, code_mask, good_ids, good_mask, bad_ids, bad_mask
code_batch = [tensor.to(device) for tensor in batch[:3]]
desc_batch = [tensor.to(device) for tensor in batch[3:5]]
with torch.no_grad():
code_repr=model.code_encoding(*code_batch).data.cpu().numpy().astype(np.float32)
desc_repr=model.desc_encoding(*desc_batch).data.cpu().numpy().astype(np.float32) # [poolsize x hid_size]
if sim_measure=='cos':
code_repr = normalize(code_repr)
desc_repr = normalize(desc_repr)
code_reprs.append(code_repr)
desc_reprs.append(desc_repr)
n_processed += batch[0].size(0)
code_reprs, desc_reprs = np.vstack(code_reprs), np.vstack(desc_reprs)

for k in tqdm(range(0, n_processed, pool_size)):
    code_pool, desc_pool = code_reprs[k:k+pool_size], desc_reprs[k:k+pool_size] 
    for i in range(min(10000, pool_size)): # for i in range(pool_size):
        desc_vec = np.expand_dims(desc_pool[i], axis=0) # [1 x dim]
        n_results = K    
        if sim_measure=='cos':
            sims = np.dot(code_pool, desc_vec.T)[:,0] # [pool_size]
        else:
            sims = similarity(code_pool, desc_vec, sim_measure) # [pool_size]
            
        negsims=np.negative(sims)
        predict = np.argpartition(negsims, kth=n_results-1)#predict=np.argsort(negsims)#
        predict = predict[:n_results]   
        predict = [int(k) for k in predict]
        real = [i]
        accs.append(ACC(real,predict))
        mrrs.append(MRR(real,predict))
        maps.append(MAP(real,predict))
        ndcgs.append(NDCG(real,predict))
logger.info(f'accs={accs}')
logger.info(f'ACC={np.mean(accs)}, MRR={np.mean(mrrs)}, MAP={np.mean(maps)}, nDCG={np.mean(ndcgs)}')                        
return {'acc':np.mean(accs), 'mrr': np.mean(mrrs), 'map': np.mean(maps), 'ndcg': np.mean(ndcgs)}

if name == 'main':
args = parse_args()
device = torch.device(f"cuda:{args.gpu_id}" if torch.cuda.is_available() else "cpu")
config = getattr(configs, 'config_'+args.model)()

##### Define model ######
logger.info('Constructing Model..')
model = getattr(models, args.model)(config)#initialize the model
ckpt=f'./output/{args.model}/{args.dataset}/{args.timestamp}/models/step{args.reload_from}.h5'
model.load_state_dict(torch.load(ckpt, map_location=device))
model.eval()
data_path = args.data_path+args.dataset+'/'

valid_set = eval(config['dataset_name'])(data_path,
                              config['valid_name'], config['name_len'],
                              config['valid_api'], config['api_len'],
                              config['valid_tokens'], config['tokens_len'],
                              config['valid_desc'], config['desc_len'])
logger.info("validating..")                  
valid_result = validate(valid_set, model, -1, 1, config['sim_measure'])  
logger.info(valid_result)`

Result:
RuntimeWarning: Mean of empty slice {'acc': nan, 'mrr': nan, 'map': nan, 'ndcg': nan}
4.How to generate train.methname.h5、train.desc.h5、train.apiseq.h5、train.tokens.h5 from the original code snippets, can you share the code?

skye95git · Answer 7 · Wed Jul 14 2021 21:14:52 GMT+0800 (China Standard Time)

Yes, you should return np.mean(accs) instead of acc

Yes, they were trained on a real dataset. To evaluate them, you need to binarize the Cosbench data using our vocabulary.

You should do data processing to feed for the pre-trained model. In particular, you should use the vocabulary file we provided to binarize your dataset into numbers and create an h5 file.

vocabulary is used to convert text into numbers (token indices). It is useful for converting a natural language query into a sequence of token indices.

I have a few more questions:
5. How to binarize the Cosbench data using your vocabulary, can you share the code?
6. Is there only one ground truth for each description?
7. In train.methname.h5, train.desc.h5, train.apiseq.h5, train.tokens.h5, the same line corresponds to the same piece of data, right?

Xiaodong Gu · Answer 8 · Wed Jul 14 2021 21:25:54 GMT+0800 (China Standard Time)

There might be some inconsistency between your data and mine.
Here is the test screenshot of pytorch in my local machine:

and my pretrained checkpoint is stored in this folder:

I run it with

python search.py --reload_from 4000000 -t 202106140524

skye95git · Answer 9 · Wed Jul 14 2021 21:56:07 GMT+0800 (China Standard Time)

There might be some inconsistency between your data and mine.
Here is the test screenshot of pytorch in my local machine:

and my pretrained checkpoint is stored in this folder:

I run it with
python search.py --reload_from 4000000 -t 202106140524

Thank you. In pytorch version, when I use the pre-training model to query the real dataset, the similarity is 0.94. But there are problems with the evaluation. It shows "{'acc': nan, 'mrr': nan, 'map': nan, 'ndcg': nan}, RuntimeWarning: Mean of empty slice".
`Input Query: convert an inputstream to a string
How many results? 5
('@ Override public Matrix like ( ) { return new SparseRowMatrix ( rowSize ( ) , columnSize ( ) ) ; } \n', 0.9421381)

('public static File getFile ( String token ) { File file = null ; if ( tokenToFileMap . containsKey ( token ) ) { file = tokenToFileMap . get ( token ) ; } return file ; } \n', 0.9421381)

('public String toString ( ) { CharArrayList theKeys = keys ( ) ; StringBuilder buf = new StringBuilder ( ) ; buf . append ( '[' ) ; int maxIndex = theKeys . size ( ) - 1 ; for ( int i = 0 ; i <= maxIndex ; i ++ ) { char key = theKeys . get ( i ) ; buf . append ( String . valueOf ( key ) ) ; if ( i < maxIndex ) { buf . append ( ",-" ) ; } } buf . append ( ']' ) ; return buf . toString ( ) ; } \n', 0.9421381)

('private static void setText ( AutoCompleteTextView view , CharSequence text , boolean filter ) { try { Method method = AutoCompleteTextView . class . getMethod ( "setText" , CharSequence . class , boolean . class ) ; method . setAccessible ( true ) ; method . invoke ( view , text , filter ) ; } catch ( Exception e ) { view . setText ( text ) ; } } \n', 0.9415401)

('public void addSysproperty ( Environment . Variable sysp ) { sysProperties . addVariable ( sysp ) ; } \n', 0.93729883)
`

Xiaodong Gu · Answer 10 · Wed Jul 14 2021 22:07:45 GMT+0800 (China Standard Time)

For Keras, you should modify the reload in the config.py to 500 before code representation and search

skye95git · Answer 11 · Wed Jul 14 2021 22:30:37 GMT+0800 (China Standard Time)

For Keras, you should modify the reload in the config.py to 500 before code representation and search

Thanks for your reply, I will try it out tomorrow and give you feedback.

skye95git · Answer 12 · Thu Jul 15 2021 22:12:43 GMT+0800 (China Standard Time)

For Keras, you should modify the reload in the config.py to 500 before code representation and search

Thank you for your answer. It's very helpful. After modifying the reload in the config.py to 500, the similarity is about 0.4. The results of the evaluation are as follows:
ACC=0.6724, MRR=0.2699894444444445, MAP=0.2699894444444445, nDCG=0.3651292111728052
It seems a little different from the results in the paper.
I have a puzzle: There is only one ground truth for each query in the dataset, but there may be more than one query specific answer in the code base. The model may return relevant answers, but it is not ground truth. Does it affect the evaluation of the model?

Xiaodong Gu · Answer 13 · Sun Jul 18 2021 16:03:19 GMT+0800 (China Standard Time)

@skye95git That is one of the potential threats of automatic evaluation. So we also ask real developers to manually inspected the returned results. The reported MRR is calculated based on human labeling.

skye95git · Answer 14 · Wed Jul 21 2021 17:48:28 GMT+0800 (China Standard Time)

@skye95git That is one of the potential threats of automatic evaluation. So we also ask real developers to manually inspected the returned results. The reported MRR is calculated based on human labeling.

Thank you! I have another problem. When I used the pre-trained model in keras version to evaluate on your dataset, the result is
ACC=0.6724, MRR=0.2699894444444445, MAP=0.2699894444444445, nDCG=0.3651292111728052
But when I evaluate on cosbench dataset, the result is
ACC=0.0365,MRR=0.0752,MAP=0.007.
And the result of each query in the QAset of cosbench is different from that of the original paper. Do you know why? I have searched many times, but still couldn't find the reason.

Evaluation result of the original paper:

My evaluation result:

The results differ by an order of magnitude.

Xiaodong Gu · Answer 15 · Thu Jul 22 2021 21:24:04 GMT+0800 (China Standard Time)

I do not know. Maybe there is something wrong with data preprocessing?

skye95git · Answer 16 · Mon Jul 26 2021 10:56:00 GMT+0800 (China Standard Time)

I use the pre-train model to evaluate, but the results are different from those in the paper.
The results in keras version:
ACC=0.6724, MRR=0.2699894444444445, MAP=0.2699894444444445, nDCG=0.3651292111728052

The results in pytorch version:
ACC=0.3727, MRR=0.3727, MAP=0.3727, nDCG=0.3727

The results in the paper:

Am I getting a normal result? The MRR gap is a little big. What adjustments do I need to make to the model to achieve the results in the paper?

Xiaodong Gu · Answer 17 · Tue Jul 27 2021 12:39:41 GMT+0800 (China Standard Time)

Please note that the results presented in the paper (Table 2) were manually computed from a test set (top 50 questions from Stack Overflow) that is different from the validation set in the repository (a subset of code-comment pairs from github).

skye95git · Answer 18 · Wed Jul 28 2021 12:59:17 GMT+0800 (China Standard Time)

Could I know how much memory is needed to run the repr_code.py in pytorch version? I run it for a while and then I get an error: RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED. It could be out of memory, or it could be that the pyTorch and CUDA versions don't match. I'm not sure.

skye95git · Answer 19 · Wed Jul 28 2021 15:19:42 GMT+0800 (China Standard Time)

Please note that the results presented in the paper (Table 2) were manually computed from a test set (top 50 questions from Stack Overflow) that is different from the validation set in the repository (a subset of code-comment pairs from github).

Sorry, I don't notice that. I query the 50 questions in Table 1 and then compare it to Table 2.

skye95git · Answer 20 · Thu Jul 29 2021 11:27:18 GMT+0800 (China Standard Time)

Could I know how much memory is needed to run the repr_code.py in pytorch version? I run it for a while and then I get an error: RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED. It could be out of memory, or it could be that the pyTorch and CUDA versions don't match. I'm not sure.

I find the reason. The versions of pytorch and cuda don't match.