guxd / deep-code-search

DeepCS: Deep Code Search

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The dataset used for evaluation

skye95git opened this issue · comments

Hi, I have a few questions about the evaluation:

  1. I want to evaluate deepcs with the CosBench dataset, the wiki of CosBench shows that "The code is based on guxd/deep-code-search, and we added some evaluation code based on our evaluation needs. The author of the paper has updated the code in their Github project", may I ask which version you have updated? keras or pytorch?
  2. Can the existing Keras version evaluation code be used for evaluation of CosBench dataset?
  3. I can't find evaluation code in "Code Structures" of pytorch version, could you provide it?
    Thanks!

I found another problem: 'SearchEngine' object has no attribute 'eval' in main.py of Keras version.

Hi, thanks for the questions!

  1. We updated both Keras and PyTorch versions. For the Keras version, we updated it in order to adapt to TensorFlow. The PyTorch version is a bleeding-edge version so it has been frequently updated for model refactoring, bug fix, and hyperparameter tuning.
  2. You can use the existing Keras version for evaluating the CosBench dataset.
  3. You can perform an automatic evaluation using the validate function in train.py
    image
    or you can run the search.py script and perform a manual evaluation based on the search results.

Please modify engine.eval(..) to engine.valid(..)
image

Hi, thanks for the questions!

  1. We updated both Keras and PyTorch versions. For the Keras version, we updated it in order to adapt to TensorFlow. The PyTorch version is a bleeding-edge version so it has been frequently updated for model refactoring, bug fix, and hyperparameter tuning.
  2. You can use the existing Keras version for evaluating the CosBench dataset.
  3. You can perform an automatic evaluation using the validate function in train.py
    image
    or you can run the search.py script and perform a manual evaluation based on the search results.

Thank you for your answer! I have a couple of other questions:

  1. After modifying engine.eval(..) to engine.valid(..), I evaluate the dummy dataset with the valid function, the result shows "name 'acc' is not defined". Should it return np.mean(accs), np.mean(mrrs), np.mean(maps), np.mean(ndcgs)?
  2. I am using the epo500_desc.h5 and epo500_code.h5 models on Google cloud disk. Are they pre-trained models for Keras version? Were they trained on real dataset? Can we evaluate them directly with the Cosbench dataset?
  3. When I evaluated with the Keras version, the valid dataset were "test.methName.h5, test.apiseq.h5, test.tokens.h5, test.desc.h5", but the CosBench dataset doesn't have this type of file. The QAset in cosbench is a Json file. Do I need to convert the QASET file to the above 4 files to perform the evaluation?
  4. I have another question about the data. The training data is used for training. The valid data is used for evaluation. The use data is used to calculate the code vectors of the search codebase. The results data is used to hold code vectors. I don't understand what vocabulary info is used for, can you tell me ?
  1. Yes, you should return np.mean(accs) instead of acc
  2. Yes, they were trained on a real dataset. To evaluate them, you need to binarize the Cosbench data using our vocabulary.
  3. You should do data processing to feed for the pre-trained model. In particular, you should use the vocabulary file we provided to binarize your dataset into numbers and create an h5 file.
  4. vocabulary is used to convert text into numbers (token indices). It is useful for converting a natural language query into a sequence of token indices.
  1. Yes, you should return np.mean(accs) instead of acc
  2. Yes, they were trained on a real dataset. To evaluate them, you need to binarize the Cosbench data using our vocabulary.
  3. You should do data processing to feed for the pre-trained model. In particular, you should use the vocabulary file we provided to binarize your dataset into numbers and create an h5 file.
  4. vocabulary is used to convert text into numbers (token indices). It is useful for converting a natural language query into a sequence of token indices.

I have tried two versions of the evaluation according to the method you mentioned, and encountered some problems:
1.In keras version, when I use the pre-training model to query the real dataset, the similarity obtained is not high, only 0.12.
`Input Query: convert an inputstream to a string
How many results? 5
('public String getInvocation ( ) { StringBuilder b = new StringBuilder ( String . format ( "%s(" , name ) ) ; int countdown = placeholders . size ( ) ; for ( PlaceholderWriter ph : placeholders ) { b . append ( ph . getValue ( ) ) ; if ( -- countdown > 0 ) { b . append ( "," ) ; } } b . append ( ")" ) ; return b . toString ( ) ; } \n', 0.12840500827015766)

('public void assertIsChild ( Element elem ) { assert elem . getParentElement ( ) . getParentElement ( ) == this . parentElem : "Element-is-not-a-child-of-this-layout" ; } \n', 0.12840500827015766)

("private static Set < String > expandPropertyRefs ( Set < String > refs ) { if ( refs == null ) { return Collections . emptySet ( ) ; } Set < String > toReturn = new TreeSet < String > ( ) ; for ( String raw : refs ) { for ( int idx = raw . length ( ) ; idx >= 0 ; idx = raw . lastIndexOf ( '.' , idx - 1 ) ) { toReturn . add ( raw . substring ( 0 , idx ) ) ; } } return toReturn ; } \n", 0.12840500827015766)

('public List < AbstractEditorDelegate < ? , ? >> getRaw ( Object key ) { return map . get ( key ) ; } \n', 0.12840500827015766)

('public static byte [ ] getBytes ( String s ) { try { return s . getBytes ( DEFAULT_ENCODING ) ; } catch ( UnsupportedEncodingException e ) { throw new RuntimeException ( "The-JVM-does-not-support-the-compiler's-default-encoding." , e ) ; } } \n', 0.12840500827015766)
2.In keras version, when I evaluated the real data set with the pre-training model, there was a big gap between the results and those in the paper.ACC=0.0007, MRR=0.00019595238095238094, MAP=0.00019595238095238094, nDCG=0.0003094175888481896`

3.In pytorch version, when I evaluated the real data set with the pre-training model, the result is nan.
Code for evaluation:
`import os
import sys
import traceback
import numpy as np
import argparse
import threading
import codecs
from tqdm import tqdm
import logging
logger = logging.getLogger(name)
logging.basicConfig(level=logging.INFO, format="%(message)s")
import torch
from utils import normalize, similarity, sent2indexes
from data_loader import CodeSearchDataset, load_dict, load_vecs
import models, configs

def parse_args():
parser = argparse.ArgumentParser("Train and Test Code Search(Embedding) Model")
parser.add_argument('--data_path', type=str, default='./data/', help='location of the data corpus')
parser.add_argument('--model', type=str, default='JointEmbeder', help='model name')
parser.add_argument('-d', '--dataset', type=str, default='github', help='name of dataset.java, python')
parser.add_argument('-t', '--timestamp', type=str, help='time stamp')
parser.add_argument('--reload_from', type=int, default=-1, help='step to reload from')
parser.add_argument('--chunk_size', type=int, default=2000000, help='codebase and code vector are stored in many chunks. '
'Note: should be consistent with the same argument in the repr_code.py')
parser.add_argument('-g', '--gpu_id', type=int, default=0, help='GPU ID')
return parser.parse_args()

Evaluation

def validate(valid_set, model, pool_size, K, sim_measure):
"""
simple validation in a code pool.
@param: poolsize - size of the code pool, if -1, load the whole test set
"""
def ACC(real,predict):
sum=0.0
for val in real:
try: index=predict.index(val)
except ValueError: index=-1
if index!=-1: sum=sum+1
return sum/float(len(real))
def MAP(real,predict):
sum=0.0
for id, val in enumerate(real):
try: index=predict.index(val)
except ValueError: index=-1
if index!=-1: sum=sum+(id+1)/float(index+1)
return sum/float(len(real))
def MRR(real, predict):
sum=0.0
for val in real:
try: index = predict.index(val)
except ValueError: index=-1
if index!=-1: sum=sum+1.0/float(index+1)
return sum/float(len(real))
def NDCG(real, predict):
dcg=0.0
idcg=IDCG(len(real))
for i, predictItem in enumerate(predict):
if predictItem in real:
itemRelevance = 1
rank = i+1
dcg +=(math.pow(2,itemRelevance)-1.0)(math.log(2)/math.log(rank+1))
return dcg/float(idcg)
def IDCG(n):
idcg=0
itemRelevance=1
for i in range(n): idcg+=(math.pow(2,itemRelevance)-1.0)
(math.log(2)/math.log(i+2))
return idcg
model.eval()
device = next(model.parameters()).device
data_loader = torch.utils.data.DataLoader(dataset=valid_set, batch_size=10000,
shuffle=True, drop_last=True, num_workers=1)
accs, mrrs, maps, ndcgs=[],[],[],[]
code_reprs, desc_reprs = [], []
n_processed = 0
for batch in tqdm(data_loader):
if len(batch) == 10: # names, name_len, apis, api_len, toks, tok_len, descs, desc_len, bad_descs, bad_desc_len
code_batch = [tensor.to(device) for tensor in batch[:6]]
desc_batch = [tensor.to(device) for tensor in batch[6:8]]
else: # code_ids, type_ids, code_mask, good_ids, good_mask, bad_ids, bad_mask
code_batch = [tensor.to(device) for tensor in batch[:3]]
desc_batch = [tensor.to(device) for tensor in batch[3:5]]
with torch.no_grad():
code_repr=model.code_encoding(*code_batch).data.cpu().numpy().astype(np.float32)
desc_repr=model.desc_encoding(*desc_batch).data.cpu().numpy().astype(np.float32) # [poolsize x hid_size]
if sim_measure=='cos':
code_repr = normalize(code_repr)
desc_repr = normalize(desc_repr)
code_reprs.append(code_repr)
desc_reprs.append(desc_repr)
n_processed += batch[0].size(0)
code_reprs, desc_reprs = np.vstack(code_reprs), np.vstack(desc_reprs)

for k in tqdm(range(0, n_processed, pool_size)):
    code_pool, desc_pool = code_reprs[k:k+pool_size], desc_reprs[k:k+pool_size] 
    for i in range(min(10000, pool_size)): # for i in range(pool_size):
        desc_vec = np.expand_dims(desc_pool[i], axis=0) # [1 x dim]
        n_results = K    
        if sim_measure=='cos':
            sims = np.dot(code_pool, desc_vec.T)[:,0] # [pool_size]
        else:
            sims = similarity(code_pool, desc_vec, sim_measure) # [pool_size]
            
        negsims=np.negative(sims)
        predict = np.argpartition(negsims, kth=n_results-1)#predict=np.argsort(negsims)#
        predict = predict[:n_results]   
        predict = [int(k) for k in predict]
        real = [i]
        accs.append(ACC(real,predict))
        mrrs.append(MRR(real,predict))
        maps.append(MAP(real,predict))
        ndcgs.append(NDCG(real,predict))
logger.info(f'accs={accs}')
logger.info(f'ACC={np.mean(accs)}, MRR={np.mean(mrrs)}, MAP={np.mean(maps)}, nDCG={np.mean(ndcgs)}')                        
return {'acc':np.mean(accs), 'mrr': np.mean(mrrs), 'map': np.mean(maps), 'ndcg': np.mean(ndcgs)}

if name == 'main':
args = parse_args()
device = torch.device(f"cuda:{args.gpu_id}" if torch.cuda.is_available() else "cpu")
config = getattr(configs, 'config_'+args.model)()

##### Define model ######
logger.info('Constructing Model..')
model = getattr(models, args.model)(config)#initialize the model
ckpt=f'./output/{args.model}/{args.dataset}/{args.timestamp}/models/step{args.reload_from}.h5'
model.load_state_dict(torch.load(ckpt, map_location=device))
model.eval()
data_path = args.data_path+args.dataset+'/'

valid_set = eval(config['dataset_name'])(data_path,
                              config['valid_name'], config['name_len'],
                              config['valid_api'], config['api_len'],
                              config['valid_tokens'], config['tokens_len'],
                              config['valid_desc'], config['desc_len'])
logger.info("validating..")                  
valid_result = validate(valid_set, model, -1, 1, config['sim_measure'])  
logger.info(valid_result)`

Result:
RuntimeWarning: Mean of empty slice {'acc': nan, 'mrr': nan, 'map': nan, 'ndcg': nan}
4.How to generate train.methname.h5、train.desc.h5、train.apiseq.h5、train.tokens.h5 from the original code snippets, can you share the code?

  1. Yes, you should return np.mean(accs) instead of acc
  2. Yes, they were trained on a real dataset. To evaluate them, you need to binarize the Cosbench data using our vocabulary.
  3. You should do data processing to feed for the pre-trained model. In particular, you should use the vocabulary file we provided to binarize your dataset into numbers and create an h5 file.
  4. vocabulary is used to convert text into numbers (token indices). It is useful for converting a natural language query into a sequence of token indices.

I have a few more questions:
5. How to binarize the Cosbench data using your vocabulary, can you share the code?
6. Is there only one ground truth for each description?
7. In train.methname.h5, train.desc.h5, train.apiseq.h5, train.tokens.h5, the same line corresponds to the same piece of data, right?

There might be some inconsistency between your data and mine.
Here is the test screenshot of pytorch in my local machine:
image
and my pretrained checkpoint is stored in this folder:
image
I run it with

python search.py --reload_from 4000000 -t 202106140524

There might be some inconsistency between your data and mine.
Here is the test screenshot of pytorch in my local machine:
image
and my pretrained checkpoint is stored in this folder:
image
I run it with

python search.py --reload_from 4000000 -t 202106140524

Thank you. In pytorch version, when I use the pre-training model to query the real dataset, the similarity is 0.94. But there are problems with the evaluation. It shows "{'acc': nan, 'mrr': nan, 'map': nan, 'ndcg': nan}, RuntimeWarning: Mean of empty slice".
`Input Query: convert an inputstream to a string
How many results? 5
('@ Override public Matrix like ( ) { return new SparseRowMatrix ( rowSize ( ) , columnSize ( ) ) ; } \n', 0.9421381)

('public static File getFile ( String token ) { File file = null ; if ( tokenToFileMap . containsKey ( token ) ) { file = tokenToFileMap . get ( token ) ; } return file ; } \n', 0.9421381)

('public String toString ( ) { CharArrayList theKeys = keys ( ) ; StringBuilder buf = new StringBuilder ( ) ; buf . append ( '[' ) ; int maxIndex = theKeys . size ( ) - 1 ; for ( int i = 0 ; i <= maxIndex ; i ++ ) { char key = theKeys . get ( i ) ; buf . append ( String . valueOf ( key ) ) ; if ( i < maxIndex ) { buf . append ( ",-" ) ; } } buf . append ( ']' ) ; return buf . toString ( ) ; } \n', 0.9421381)

('private static void setText ( AutoCompleteTextView view , CharSequence text , boolean filter ) { try { Method method = AutoCompleteTextView . class . getMethod ( "setText" , CharSequence . class , boolean . class ) ; method . setAccessible ( true ) ; method . invoke ( view , text , filter ) ; } catch ( Exception e ) { view . setText ( text ) ; } } \n', 0.9415401)

('public void addSysproperty ( Environment . Variable sysp ) { sysProperties . addVariable ( sysp ) ; } \n', 0.93729883)
`

For Keras, you should modify the reload in the config.py to 500 before code representation and search

For Keras, you should modify the reload in the config.py to 500 before code representation and search

Thanks for your reply, I will try it out tomorrow and give you feedback.

For Keras, you should modify the reload in the config.py to 500 before code representation and search

Thank you for your answer. It's very helpful. After modifying the reload in the config.py to 500, the similarity is about 0.4. The results of the evaluation are as follows:
ACC=0.6724, MRR=0.2699894444444445, MAP=0.2699894444444445, nDCG=0.3651292111728052
It seems a little different from the results in the paper.
I have a puzzle: There is only one ground truth for each query in the dataset, but there may be more than one query specific answer in the code base. The model may return relevant answers, but it is not ground truth. Does it affect the evaluation of the model?

@skye95git That is one of the potential threats of automatic evaluation. So we also ask real developers to manually inspected the returned results. The reported MRR is calculated based on human labeling.

@skye95git That is one of the potential threats of automatic evaluation. So we also ask real developers to manually inspected the returned results. The reported MRR is calculated based on human labeling.

Thank you! I have another problem. When I used the pre-trained model in keras version to evaluate on your dataset, the result is
ACC=0.6724, MRR=0.2699894444444445, MAP=0.2699894444444445, nDCG=0.3651292111728052
But when I evaluate on cosbench dataset, the result is
ACC=0.0365,MRR=0.0752,MAP=0.007.
And the result of each query in the QAset of cosbench is different from that of the original paper. Do you know why? I have searched many times, but still couldn't find the reason.

Evaluation result of the original paper:
捕获

My evaluation result:
捕获

The results differ by an order of magnitude.

I do not know. Maybe there is something wrong with data preprocessing?

I use the pre-train model to evaluate, but the results are different from those in the paper.
The results in keras version:
ACC=0.6724, MRR=0.2699894444444445, MAP=0.2699894444444445, nDCG=0.3651292111728052

The results in pytorch version:
ACC=0.3727, MRR=0.3727, MAP=0.3727, nDCG=0.3727

The results in the paper:
捕获

Am I getting a normal result? The MRR gap is a little big. What adjustments do I need to make to the model to achieve the results in the paper?

Please note that the results presented in the paper (Table 2) were manually computed from a test set (top 50 questions from Stack Overflow) that is different from the validation set in the repository (a subset of code-comment pairs from github).

Could I know how much memory is needed to run the repr_code.py in pytorch version? I run it for a while and then I get an error: RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED. It could be out of memory, or it could be that the pyTorch and CUDA versions don't match. I'm not sure.

Please note that the results presented in the paper (Table 2) were manually computed from a test set (top 50 questions from Stack Overflow) that is different from the validation set in the repository (a subset of code-comment pairs from github).

Sorry, I don't notice that. I query the 50 questions in Table 1 and then compare it to Table 2.

Could I know how much memory is needed to run the repr_code.py in pytorch version? I run it for a while and then I get an error: RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED. It could be out of memory, or it could be that the pyTorch and CUDA versions don't match. I'm not sure.

I find the reason. The versions of pytorch and cuda don't match.