Question about release of infoseek data split

Question

Question about release of infoseek data split

Maxlinn opened this issue 3 months ago · comments

Hi lin, sorry to trouble you!

Recently i have special interest in the infoseek task. I noticed that you have used different train set (downsampled from infoseek train set), val and test set(sampled from infoseek val set), and the results of infoseek on PreFLMR Table.7 is reported using such train/val/test split. (issue #36)

However, i could not find the split of infoseek datasset in M2KR huggingface repo. For a fair comparison on infoseek task, could you please share it? And, for a quick check, are you using infoseek_eval for evaluating on your test split?

Much thanks!

Lin Weizhe · Answer 1 · Mon Mar 18 2024 18:47:10 GMT+0800 (China Standard Time)

Hi,

We have just uploaded the missing infoseek set. VinVL object detection and BLIP 2 captioning are also provided for answer generation.

For your interest, I remembered the wrong detail when answering your previous issue. The infoseek split does not have a validation set. This is due to several considerations. For example, Infoseek is designed to evaluate the performance on unseen entities. If splitting the original validation set into two sets, we leak the distribution of the original validation set into the validation process. Therefore, we decided to evaluate infoseek performance (on the downsampled M2KR test set) without cherry-picking with a validation set.

Since Infoseek does not provide Wikipedia passages, we prepared the data on our own. We did some preprocessing to remove entries whose answers can not be found in the associated Wikipedia passages.

We did not use the official script to postprocess the predictions but followed the description in the original paper to evaluate the answer. The major difference is that the official script has more postprocessing which can increase the numbers slightly, and ours is more strict to the model (mostly exact match and checking numerical numbers).

Maxlinn · Answer 2 · Mon Mar 18 2024 21:20:50 GMT+0800 (China Standard Time)

Thanks for timely response!

Thanks for sharing infoseek data so immediately! I totally agree with your consideration on not setting a validation set in case of entity leakage.

As for wikipedia passages, infoseek does provided wikipedia passages in their data release, the 'entity_id'(also known as 'wikidata_id') for each example in train/val split is provided in the KB mapping file (also in the data release).

For evaluation, it might be more fair to compare using the official tool. Could you please provide the model predictions file of the model in PreFLMR paper Table 7 (i could do the caluclation for you)?

Sorry to cause you trouble!

Lin Weizhe · Answer 3 · Mon Mar 18 2024 22:14:48 GMT+0800 (China Standard Time)

Thanks for timely response!

Thanks for sharing infoseek data so immediately! I totally agree with your consideration on not setting a validation set in case of entity leakage.

As for wikipedia passages, infoseek does provided wikipedia passages in their data release, the 'entity_id'(also known as 'wikidata_id') for each example in train/val split is provided in the KB mapping file (also in the data release).

For evaluation, it might be more fair to compare using the official tool. Could you please provide the model predictions file of the model in PreFLMR paper Table 7 (i could do the caluclation for you)?

Sorry to cause you trouble!

At the time of the beginning of this project (June of 2023) they did not release the passage texts. We matched each data entry with wikipedia passages on our own. So there could be minor differences in the passage contents.

Unfortunately, our experiments were run on AWS machines which were closed after the project finished. We do have a checkpoint but it would be painful to prepare everything again. I may have time to rerun the evaluation later, but now I am too busy with my current work. Now that you have the PreFLMR model, is it possible for you to finetune a BLIP-2 with the predictions of PreFLMR? Of course, I can provide the trained checkpoint, or the evaluation script used in our project.

Maxlinn · Answer 4 · Mon Mar 18 2024 23:02:23 GMT+0800 (China Standard Time)

Thanks your sharing! now i could have a better understanding!

For evaluation, it would be much helpful if you could share the infoseek-finetuned PreFLMR checkpoint and evaluation script which could save me lots of pain. If any better, the retrieved result file and final answer file on infoseek task are also welcomed. If not feasible, i could finetune blip2-t5xl with m2kr infoseek train data on my own. Is the answer generated using the top-1 retrieved passage, or did some reranking among few candidates?

By the way, may i ask how the pasasges in m2kr infoseek are collected? I manually inspected the m2kr infoseek train passages, it seems orignated from 34k unique entities and chunked into 98k passages. Could you please throw more light on this? Like where are the wikipedia texts come from (especially if it is possible to get its wikidata_id like Q8418), how it is chunked,
matched, filtered and mixed? (Sorry to be obtuse but official infoseek knowledge are much bigger in size and length so i am managing to keep a fair comparison)

Sincerely appreciate for all the help!

Lin Weizhe · Answer 5 · Tue Mar 19 2024 02:00:46 GMT+0800 (China Standard Time)

Checkpoint and retrieval results:
https://tmp.link/f/65f87e8318ebd

It is possible that the trained checkpoint is wrong. Please retrain a model on your own in that case.

The passages are from https://huggingface.co/datasets/wiki_dpr and no further chunking/preprocessing were applied.

We indexed them with Elastic Search and then matched the passage with the annotated wikipeda page title.
An example script to match the passages is shown below:

                def search_wiki_passage_with_entity_text(example):
                    es = Elasticsearch(
                        "https://localhost:9200",
                        ca_certs=os.environ["ELASTIC_CA_CERTS"],
                        basic_auth=("elastic", ELASTIC_PASSWORD),
                        timeout=60,
                    )
                    example = EasyDict(example)
                    query = example.entity_text
                    # print(f"Searching for {example.question_id} {example.question}: {query}  answers: {example.answers}")
                    resp = search_for_a_string(es, query, fileds=["title"], size=1000)

                    gold_doc_ids = []
                    gold_doc_contents = []
                    related_doc_ids = []

                    all_answers = example.answers
                    all_answers += example.answer_eval

                    if resp['hits']['total']['value'] > 0:
                        doc_title = resp['hits']['hits'][0]['_source']['title']

                        for retrieved_doc in resp['hits']['hits']:
                            if retrieved_doc['_source']['title'] != doc_title:
                                continue
                            passage_text = retrieved_doc['_source']['text']
                            
                            # pprint(retrieved_doc)
                            found = False
                            for answer in all_answers:
                                if answer.lower() in passage_text.lower():
                                    # print("!! Found answer", answer)
                                    found = True
                                    gold_doc_ids.append(retrieved_doc['_id'])
                                    gold_doc_contents.append(passage_text)
                                    break
                            
                            if not found and example.wikidata_value is not None:
                                # wikidata range
                                # find all float and integer values in passage_text using regular expression, the value can be comma separated
                                all_numbers = re.findall(r"[-+]?\d*\.\d+|\d+", passage_text)
                                
                                for number in all_numbers:
                                    try:
                                        number = float(number)
                                        if abs(number-example.wikidata_value) < 0.01: # >= float(example.wikidata_range[0]) and number <= float(example.wikidata_range[1]):
                                            print("!! Found answer", number, '==', example.wikidata_value)
                                            found = True
                                            gold_doc_ids.append(retrieved_doc['_id'])
                                            gold_doc_contents.append(passage_text)
                                            break
                                    except:
                                        continue
                                
                            related_doc_ids.append(retrieved_doc['_id'])
                            
                        # passage_text = resp['hits']['hits'][0]['_source']['text']
                        # passage_title = resp['hits']['hits'][0]['_source']['title']
                        # example.passages = [i['_id'] for i in resp['hits']['hits']]
                        # print("related_doc_ids", related_doc_ids)
                        # print("gold_doc_ids", gold_doc_ids)
                        example.related_item_ids = related_doc_ids
                        example.pos_item_ids = gold_doc_ids
                        example.pos_item_contents = gold_doc_contents

                        # for doc_id in related_doc_ids:
                        #     available_documents[doc_id] = 1
                    else:
                        example.related_item_ids = []
                        example.pos_item_ids = []
                        example.pos_item_contents = []
                        logger.error(f"Cannot find a passage for {example.question_id}: {query}")
                    
                    return example

Maxlinn · Answer 6 · Tue Mar 19 2024 23:18:16 GMT+0800 (China Standard Time)

sincerely appreciate to all you offered to help!

now i am much close and have last two tiny questions in mind: one is the evaluation script, the second is how the answer is raised? is it generated using the top-1 passage, or generated k-candiadtes using the top-k (may be k=5 or other values?) passages and selected by average probs of answer token?

Lin Weizhe · Answer 7 · Tue Mar 19 2024 23:43:58 GMT+0800 (China Standard Time)

sincerely appreciate to all you offered to help!

now i am much close and have last two tiny questions in mind: one is the evaluation script, the second is how the answer is raised? is it generated using the top-1 passage, or generated k-candiadtes using the top-k (may be k=5 or other values?) passages and selected by average probs of answer token?

It is generated using top-5 documents and the most confident answer (log prob of prediction) is selected.

Maxlinn · Answer 8 · Tue Mar 19 2024 23:45:04 GMT+0800 (China Standard Time)

Thanks! All my problems have been addressed!

Maxlinn · Answer 9 · Tue Mar 19 2024 23:58:04 GMT+0800 (China Standard Time)

sorry to reopen for a bit, i just happend to find there are 676,441 unique examples combining all training parquets of m2kr infoseek, where it should be 100k in the paper, may i ask which examples are used for training?

Lin Weizhe · Answer 10 · Wed Mar 20 2024 00:03:09 GMT+0800 (China Standard Time)

We used 100k for training since there is a great number of samples/entities overlapped in the full 676k data. We released the full set for your convenience.

Maxlinn · Answer 11 · Wed Mar 20 2024 00:08:37 GMT+0800 (China Standard Time)

Thanks for acknowledgement!