praveengadiyaram369/Thesis-retrieval-data

Mitera retrieval dataset

This repository holds the retrieval dataset for Mitera. Still in progess!!.

Getting started

Please refer to the below excel sheet for necessary information for retrieval labeling.

Retieval labels information
Labeling GUI is located at below URI. Visit the Raat homepage and navigate to the Relevance labelling UI by clicking on the last option "Newsfeed Search".

Raat UI

Positive Document characteristics

S No:	`Category Tag`	Category labels
1.	`I`	Innovation or Breakthrough
2.	`F`	Future products
3.	`A`	Applied Research or Implementations
4.	`P`	New way of Procurements
5.	`M`	Misc. (Technological advancements, Big data/AI patterns)

Example news articles

Relevant document example

Irrelevant document example

Dataset generation methodology:

Relevance labeling pipeline

Below are some important key points to consider before starting the labeling process:

1. Once go through the above relevant document characteristics and search keywords from the above excel sheet("Suchbegriff").

2. Every subject needs to select a keyword from the sheet "Suchbegriff" and mark the respective column Status as "fertig" after selection. After marking the column, you can start the labeling process.

3. Keep in mind the search query while labeling the documents (document relevancy with respect to the search keyword is more important than the document relevancy).

4. Subjects are suggested to finish one-query in one-go.

Relevance labels

Label description

Perfect documents

These retrieved documents strongly matches with one of the positive document characteristics and also contains a good coherent discussion about the user given keyword throughout the document.
Partially relevant documents

This set of documents also contains keywords and seems to be relevant, but still lacks in innovation or novelty. However, the document shares information about some efficient or optimal way of doing things, which can be useful for MiTeRa. Must not be a clickbait!!.
Irrelevant documents

These documents contains the given user keyword, but still lacks in innovation and coherent discussion about the query. Some example of these documents are clickbaits, advertisements, marketing blogs usw.. which contains a lot of relevant keywords at the beginning of the document, but yet not useful for MiTeRa.
Wrong documents

These are completely false documents and has nothing to do with the given user query. Eg: for the query "Combat cloud", the documents related to cloud computing are wrong documents.

Dataset details

Final dataset is located at dataset/retrieval_dataset.json. Below are the column descriptions:

S No:	column name	column description
1.	page_id	Unique-id of a news article
2.	query	Original search query
3.	label	Relevance label given by the labeler
4.	text	News article text
5.	text_len	length of the news article(token count)
6.	noun_chunks	Noun-chunks extracted from spacy
7.	mean_nc_vec	Mean average vector of noun-chunks(USE)
8.	title	Title of the news article
9.	published_date	Published of the news article
10.	source_url	Source url of the news article

Simply load the dataset using below command: **using pandas**

df = pd.read_json(json_filepath, lines=True)

Support

please write an email at sri.sai.praveen.gadiyaram@fkie.fraunhofer.de

Authors and acknowledgment

Apelt, Stefan
Hasso, Hussein
Moog, Manuel 
Aymaz, Iliass
Scheffczyk, Jan
Zehart, Sebastian

Project status

Currently, the dataset generation is still in development and is not finished.

praveengadiyaram369 / Thesis-retrieval-data