open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Home Page:https://opencompass.org.cn/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug] about cil_em_evaluator.py

TousenKaname opened this issue · comments

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

'CUDA available': False,                                                                                     
 'GCC': 'gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)',                                                        
 'MMEngine': '0.10.2',                                                                                        
 'OpenCV': '4.9.0',                                                                                           
 'PyTorch': '2.0.0',     
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '                                                                                                                                            
                              '2023.1-Product Build 20230303 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v2.7.3 (Git Hash 
                              'architecture applications\n'                    
                              '  - Intel(R) MKL-DNN v2.7.3 (Git Hash '                                        
                              '6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n'                                   
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'                                                                                                                                                     
                              '  - LAPACK is enabled (usually provided by '                                                                                                                                                 
                              'MKL)\n'                                                                                                                                                                                      
                              '  - NNPACK is enabled\n'   
                              '  - CPU capability usage: AVX2\n'                                              
                              '  - Build settings: BLAS_INFO=mkl, '                                           
                              'BUILD_TYPE=Release, CUDA_VERSION=11.7, '                                                                                                                                                     
                              'CUDNN_VERSION=8.5.0, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '                          
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -Wno-deprecated '                                      
                              '-fvisibility-inlines-hidden -DUSE_PTHREADPOOL '
                              '-DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER '                                
                              '-DUSE_FBGEMM -DUSE_QNNPACK '                                                                                                                                                                 
                              '-DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK '
                              '-DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC '                                  
                              '-Wall -Wextra -Werror=return-type '                                                                                                                                                          
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '                                           
                              '-Wno-unknown-pragmas -Wunused-local-typedefs '                                 
                              '-Wno-unused-parameter -Wno-unused-function '                                   
                              '-Wno-unused-result -Wno-strict-overflow '                                      
                              '-Wno-strict-aliasing '
                              '-Wno-error=deprecated-declarations '                                           
                              '-Wno-stringop-overflow -Wno-psabi '                                            
                              '-Wno-error=pedantic -Wno-error=redundant-decls '                                                                                                                                             
                              '-Wno-error=old-style-cast '                                                                                                                                                                  
                              '-fdiagnostics-color=always -faligned-new '                                                                                                                                                   
                              '-Wno-unused-but-set-variable '                                                 
                              '-Wno-maybe-uninitialized -fno-math-errno '                                     
                              '-fno-trapping-math -Werror=format '                                            
                              '-Werror=cast-function-type '                                                   
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '                                                                                                                                                         
                              'PERF_WITH_AVX512=1, '
                              'TORCH_DISABLE_GPU_ASSERTS=ON, '                                                
                              'TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, '                              
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '                                                                                                                                                       
                              'USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, '                                     
                              'USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, '                                     
                              'USE_OPENMP=ON, USE_ROCM=OFF, \n',                                              
 'Python': '3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]',                                              
 'TorchVision': '0.15.0',
 'numpy_random_seed': 2147483648,                    
 'opencompass': '0.2.1+40a2441',             
 'sys.platform': 'linux'

Reproduces the problem - code/configuration sample

chatmed_eval_cfg = dict(
    evaluator=dict(type=EMEvaluator))

chatmed_reader_cfg = dict(
    input_columns=['query'], 
    output_column='response'
    )

chatmed_infer_cfg = dict(
    # Prompt 生成配置
    prompt_template=dict(
        type=PromptTemplate,
        template=dict(
                round=[
                        dict(role="HUMAN", prompt='{query}'),
                        dict(role="BOT", prompt="{response}")
                       ]
                    )),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=GenInferencer)
    )

chatmed_datasets = [
    dict(
        type=ChatMedDataset,
        path='/mnt/petrelfs/wangguoan/git_repo/opencompass/data/ChatMed_Consult_Dataset/test.json',
        reader_cfg=chatmed_reader_cfg,
        infer_cfg=chatmed_infer_cfg,
        eval_cfg=chatmed_eval_cfg
        )
]
from datasets import Dataset
from .base import BaseDataset
import jsonlines

def convert_jsonl_to_json(jsonl_file_path):
    json_data = []
    with jsonlines.open(jsonl_file_path) as reader:
        for line in reader:
            json_data.append(line)
    return json_data

class ChatMedDataset(BaseDataset):
    @staticmethod
    def load(path: str) -> Dataset:
        results = convert_jsonl_to_json(path)
        dataset = Dataset.from_list(results)
        return dataset
{"query": "这段时间去上厕所本来想小便的可是每次都会拉大便", "response": "这可能是因为你的饮食习惯或者消化系统的问题导致的。建议你试着调整一下饮食,增加膳食纤维的摄入量,多喝水,避免过度依赖泻药。同时,如果问题持续存在,建议去医院检查一下,排除肠道疾病等可能性。"}
{"query": "医生呀!我刚被查出得了白癜风,不知道初期治疗需要多少钱?", "response": "白癜风的治疗费用因个体差异和治疗方案的不同而有所差异。初期治疗主要以口服药物和外用药物为主,费用相对较低,一般几百元左右。但是,如果采用激光治疗、光疗等高端治疗方法,费用会更高。建议您咨询专业医生,根据自己的情况进行治疗方案的选择,同时了解相关的费用情况。"}

Reproduces the problem - command or script

    def score(self, predictions, references):
        if len(predictions) != len(references):
            return {
                'error': 'predictions and references have different '
                'length'
            }
        predictions = [
            general_postprocess(prediction) for prediction in predictions
        ]
        processed_answers = [[general_postprocess(j) for j in i]
                             for i in references]

        cnt = 0
        details = []
        for pred, ans, origin_ans in zip(predictions, processed_answers,
                                         references):
            answers = list(set(ans + origin_ans))
            detail = {'pred': pred, 'answer': answers}
            if pred in ans or pred in origin_ans:
                cnt += 1
                detail['correct'] = True
            else:
                detail['correct'] = False
            details.append(detail)

        score = cnt / len(predictions) * 100

        return {'score': score, 'details': details}

I dont know why answers = list(set(ans + origin_ans))` need cat ans and origin_ans. Below is what I print.

ans = ['这', '可', '能', '是', '因', '为', '你', '的', '饮', '食', '习', '惯', '或', '者', '消', '化', '系', '统', '的', '问', '题', '导', '致', '的', '', '建', '议', '你', '试', '着', '调', '整', '一', '下', '饮', '食', '', '增', '加', '膳', '食', '纤', '维', '的', '摄', '入', '量', '', '多', '喝', '水', '', '避', '免', '过', '度', '依', '赖', '泻', '药', '', '同', '时', '', '如', '果', '问', '题', '持', '续', '存', '在', '', '建', '议', '去', '医', '院', '检', '查', '一', '下', '', '排', '除', '肠', '道', '疾', '病', '等', '可', '能', '性', '']        origin_ans=   这可能是因为你的饮食习惯或者消化系统的问题导致的。建议你试着调整一下饮食,增加膳食纤维的摄入量,多喝水,避免过度依赖泻药。同时,如果问题持续存在,建议去医院检查一下,排除肠道疾病等可能性。 pred=   您好根据您提供的信息您可能存在便秘的情况便秘是指排便次数减少粪便量减少粪便干结排便费力等症状以下是一些可能导致便秘的原因和建议

Reproduces the problem - error message

Traceback (most recent call last):
  File "/mnt/petrelfs/wangguoan/git_repo/opencompass/opencompass/tasks/openicl_eval.py", line 361, in <module>
    inferencer.run()
  File "/mnt/petrelfs/wangguoan/git_repo/opencompass/opencompass/tasks/openicl_eval.py", line 107, in run
    self._score()
  File "/mnt/petrelfs/wangguoan/git_repo/opencompass/opencompass/tasks/openicl_eval.py", line 216, in _score
    result = icl_evaluator.score(**preds)
  File "/mnt/petrelfs/wangguoan/git_repo/opencompass/opencompass/openicl/icl_evaluator/icl_em_evaluator.py", line 31, in score
    answers = list(set(ans + origin_ans))
TypeError: can only concatenate list (not "str") to list 

Other information

No response