buaaliuming / Awesome-Resources-for-Scholarly-Big-Data

Tools, datasets, Corpus and Venue Challenge for scholarly big data——Pick up scattered pearls

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Awesome Resources for Scholarly Big Data

Pick up scattered pearls and light up the fire for scholarly big data

News

The death of top-tier Totems:ICLRCVPRXuetao Cao

Tools

Bibliographical Extraction

ParsCit:

ParsCit is a utility for extracting citations from research papers based on Conditional Random Fields and heuristic regularization.

a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications http://cloud.science-miner.com/grobid/

Content ExtRactor and MINEr

FreeCite is an open-source application that parses document citations into fielded data. You can use it as a web application or a service

Legend single threaded extractor written in Perl. External libraries such as SVMheaderparse, ParsCit, and PDFBox are involked.

Figure Extraction from PDF Articles

Extracting Figures from Research Papers,Mining Figures from Research Papers

Figure extraction using deep neural nets

Source code for methods described in Lu, Xiaonan, et al. "Automated analysis of images in documents for intelligent document search." International Journal on Document Analysis and Recognition (IJDAR) 12.2 (2009): 65-81.

Table Extraction from PDF Articles

extraction of table information from PDF-files based on pdf2html URL2: http://ieg.ifs.tuwien.ac.at/projects/pdf2table/

Many report types – financial reports, analysts reports, scientific reports – exist as PDFs (i.e. unstructured data). And they often have valuable data in tables. The Zanran ‘Scaffolder’ software helps you get those tables out automatically – into Excel, XML or HTML.

TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables

Tabula is a tool for liberating data tables locked inside PDF files.

Equation or Formula Extraction

ChemxSeer Tagger provides a chemcial entity extractor that identifies mentions of chemical formula and names in free text.

A Java jar to extract algorithms from PDFs.

Formula Formation

handcalcs is a library to render Python calculation code automatically in Latex, but in a manner that mimics how one might format their calculation if it were written with a pencil: write the symbolic formula, followed by numeric substitutions, and then the result.

Export images and PDFs to LaTex, DOCX, Overleaf, Markdown, Excel, ChemDraw and more, with our AI powered document conversion technology.

Keywords or Keyphrase Extraction

Extracting Keywords or keyphrases. KEA is an algorithm for extracting keyphrases from text documents. It can be either used for free indexing or for indexing with a controlled vocabulary. KEA is implemented in Java and is platform independent.

https://github.com/davidadamojr/TextRank

Auto Summarization

Extreme Summarization of Scientific Documents

Smart Writing

iCite is a tool to access a dashboard of bibliometrics for papers associated with a portfolio.

Technical Checking of manuscripts is a core activity in scholarly publishing — around 5.4m manuscripts are submitted to scholarly journals per year, with just over half of them approved for publication. All these manuscripts are subject to different levels of Technical Checks to ensure that they meet journal guidelines, and many undergo multiple rounds of checks. Correct disclosures (such as conflicts of interest or ethics statements), sections within word limits, correct metadata, correct use of citations and references, acceptable language quality, and many more, are all requirements that occupy editorial teams and authors, and increasingly take much needed time away from other work.

Smart citations for better research

Metedata Extraction

Multi-Entity Extraction Framework for Academic Documents (with default extraction tools)

Science Parse parses scientific papers (in PDF form) and returns them in structured form. It supports these fields: Title, Authors, Abstract, Sections (each with heading and body text), Bibliography, each with Title, Authors Venue, Year, Mentions, i.e., places in the paper where bibliography entries are mentioned, In JSON format.

a machine learning (ML) based method to match paper entities between CiteSeerX and other digital libraries, including but not limited to the IEEE Xplore (IEEE hereafter), DBLP, Web of Science (WoS hereafter). Like most ML-based methods, data preprocessing takes substantial efforts. The purpose of creating this codebase is to centralize working programs that accomplish different tasks so they can be reused for future people that take over corresponding roles.

Plain Text Extraction

Extract Unicode text from PDF files.

Apache POI - the Java API for Microsoft Documents

You can read and write MS Excel, PPT and Word files using Java API.The Apache POI Project's mission is to create and maintain Java APIs for manipulating various file formats based upon the Office Open XML standards (OOXML) and Microsoft's OLE 2 Compound Document format (OLE2).

Research Papers Classification

classify research papers into their subject areas. By splitting abstracts into words and converting each word into a n-dimensional word embedding, we project the text data into a vector space with time-steps.

Paper Generation

SCIgen is a program that generates random Computer Science research papers, including graphs, figures, and citations. It uses a hand-written context-free grammar to form all elements of the papers. Our aim here is to maximize amusement, rather than coherence.

Produce your own math paper, full of research-level, professionally formatted nonsense! Just enter your name and those of up to 3 "co-authors".

Academic Article Evaluation

Rating articles based on LATEX source files and academic metedata

Automatic PeerReview: Predict the accept/reject decisions in top-tier venues

a classifier to predict whether a paper should be accepted or rejected based ont the visual appearance of the paper

The mission of Papers with Code is to create a free and open resource with Machine Learning papers, code and evaluation tables. We believe this is best done together with the community, supported by NLP and ML. All content on this website is openly licenced under CC-BY-SA (same as Wikipedia) and everyone can contribute - look for the "Edit" buttons!

Papers Without Code: where unreproducible papers come to live. The goal of this is to save the time and effort of researchers who try to reproduce the results of a paper that is unreproducible. It could either be due to the paper not having enough details or the method straight up not working. In either case, authors will be given the opportunity to respond. The hope is this saves people time and disincentivizes unreproducible papers.

Paper-Reviewer Assignments

The goal of this system is to help conference chairs with the task of assigning papers to reviewers.

MyReview(unavailable currently/20181202)

75% of academic journal editors say that finding reviewers and getting them to accept review invitations is the hardest part of their job. We want to change that to help get peer reviewed research to the world.

Scholar Influence Analysis

This ranking of top computer science schools is designed to identify institutions and faculty actively engaged in research across a number of areas of computer science. http://csrankings.org/#/index?all

Scholar Gender Prediction

Research Trends Visualization

Analyse the evolution of research topics in a discipline based on co-word network analysis

Visualizing Patterns and Trends in Scientific Literature

Digital Library

Dataset Search

Making it easier to discover datasets

https://www.blog.google/products/search/making-it-easier-discover-datasets/

Ontology parsing libraries

A Python module for easy access to the main medical terminologies

MeSH ontology

Grammer Checker

Using language models trained on millions of journal articles, Writefull corrects grammar, vocabulary, punctuation, and more - aimed specifically at academic writing.

Text Annotation

Research Articles in Simplified HTML (RASH) Framework includes a markup language defined as a subset of HTML+RDF for writing scientific articles, and related tools to convert it into different formats, to extract data from it, etc.

a web-based tool for text annotation

Annotator for Chinese Text Corpus 中文标注工具

GATE includes components for diverse language processing tasks, e.g. parsers, morphology, tagging, Information Retrieval tools, Information Extraction components for various languages, and many others. GATE Developer and Embedded are supplied with an Information Extraction system (ANNIE) which has been adapted and evaluated very widely (numerous industrial systems, research systems evaluated in MUC, TREC, ACE, DUC, Pascal, NTCIR, etc.). ANNIE is often used to create RDF or OWL (metadata) for unstructured conten

Biomedical NLP Packages

This software identifies concepts in medical reports on patients, per the i2b2 shared task: https://www.i2b2.org/NLP/Relations/.

Awesome list of software to do research in medical imaging

The Kaplan Meier plotter is capable to assess the effect of 54k genes (mRNA, miRNA, protein) on survival in 21 cancer types including breast (n=7,830), ovarian (n=2,190), lung (n=3,452), and gastric (n=1,440) cancer. Sources for the databases include GEO, EGA, and TCGA. Primary purpose of the tool is a meta-analysis based discovery and validation of survival biomarkers.

Gene analysis

AlphaFold DB provides open access to protein structure predictions for the human proteome and 20 other key organisms to accelerate scientific research. AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. It regularly achieves accuracy competitive with experiment.

Academic Platforms

peerxiv 一种全新的学术论文评审

DBLP The dblp computer science bibliography provides open bibliographic information on major computer science journals and proceedings. Originally created at the University of Trier in 1993, dblp is now operated and further developed by Schloss Dagstuhl.

CiteSeer

AMiner

ResearchGate

onAcademic

SCI-HUB |unavailable currently/20190110

CiteUlike

SemanticSchoolar

Google Schoolar

Arxiv arXiv is a free distribution service and an open-access archive for 1,737,208 scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Materials on this site are not peer-reviewed by arXiv.

BioRxiv

AfricArxiv

OpenReview

PubPeer The PubPeer database contains all articles. Search results return articles with comments.

Retraction Watch Tracking retractions as a window into the scientific process

CORE

ACADEMIA

FreePatentsonline

学者网

Artigo Our short-term goal is to provide a scene where scientific papers can be searched, read, discussed, and annotated by any person be it either the world's best expert or an undergraduate student.

Datasets, KGs ,and Corpus

The Computer Science Ontology (CSO) is a large-scale ontology of research areas that was automatically generated using the Klink-2 algorithm on the Rexplore dataset, which consists of about 16 million publications, mainly in the field of Computer Science. The Klink-2 algorithm combines semantic technologies, machine learning, and knowledge from external sources to automatically generate a fully populated ontology of research areas. Some relationships were also revised manually by experts during the preparation of two ontology-assisted surveys in the field of Semantic Web and Software Architecture. The main root of CSO is Computer Science, however, the ontology includes also a few secondary roots, such as Linguistics, Geometry, Semantics, and so on.

Academia/Industry DynAmics (AIDA) Knowledge Graph, which describes 14M publications and 8M patents according to the research topics drawn from the Computer Science Ontology. 4M publications and 5M patents are further characterized according to the type of the author's affiliations (academy, industry, or collaborative) and 66 industrial sectors (e.g., automotive, financial, energy, electronics) organized in a two-level taxonomy.

AceKG describes 114.30 million academic entities based on a consistent ontology, including 61,704,089 papers, 52,498,428 authors, 50,233 research fields, 19,843 academic institutes, 22,744 journals, 1,278 conferences and 3 special affiliations.

The Artificial Intelligence Knowledge Graph (AI-KG) is a large-scale automatically generated knowledge graph that describes 857,658 research entities. AI-KG includes 14M RDF triples and 1,2M statements extracted from 333K research publications in the field of AI and describes 5 types of entities (e.g., tasks, methods, metrics, materials, others) linked by 27 relations. It was designed to support a large variety of intelligent services for analyzing and making sense of research dynamics, supporting researchers in their daily job, and informing decision of founding bodies and research policy makers.

This is the largest publicly-available contextual citation graph. The full text alone is the largest structured academic text corpus to date. The S2ORC dataset is a citation graph of 81.1M academic publications and 380.5M citation edges. Abstracts are available for 73.4M papers. Full text and citation contexts are available for 8.1M papers. Citation contexts are linked to their corresponding paper in the graph.

PubMed comprises more than 28 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites.

Linked Open Data offering which aggregates data sources from Springer Nature and key partners from the scholarly domain. The Linked Open Data platform collates information from across the research landscape, for example funders, research projects, conferences, affiliations and publications.

The dataset is available on http://ai.tencent.com/upload/PapersUploads/article_commenting.tgz

https://arxiv.org/pdf/1805.03668.pdf

PeerRead (code and dataset for predict whether you paper will be accepted by top-tier venues)

PearRead is a dataset of scientific peer reviews available to help researchers study this important artifact. The dataset consists of over 14K paper drafts and the corresponding accept/reject decisions in top-tier venues including ACL, NIPS and ICLR, as well as over 10K textual peer reviews written by experts for a subset of the papers.

TableBank is a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet, contains 417K high-quality labeled tables

The Microsoft Academic Graph is a heterogeneous graph containing scientific publication records, citation relationships between publications, as well as authors, institutions, journal and conference "venues," and fields of study.

This data set is generated by linking two large academic graphs: Microsoft Academic Graph (MAG) and AMiner

CiteSeerx data and metadata are available for others to use. Data available includes CiteSeerx metadata, databases, data sets of pdf files and text of pdf files.

A Digital Archive of Research Papers in Computational Linguistics,The ACL Anthology currently hosts over 44,000 papers on the study of computational linguistics and natural language processing

ACL Anthology Network,Here we have collected information regarding all of the papers included in the many ACL venues. From those papers, we have created several networks, including paper citation, author citation, and author collaboration. http://tangra.cs.yale.edu/newaan/

These research papers (CVPR,ICCV, WACV etc. ) are the Open Access versions, provided by the Computer Vision Foundation. Except for the watermark, they are identical to the accepted versions; the final published version of the proceedings is available on IEEE Xplore.

Citation networks (for community detection):

nodes represent papers, edges represent citations

Collaboration network of Arxiv Astro Physic,Collaboration network of Arxiv Condensed Matter,Collaboration network of Arxiv General Relativity, Collaboration network of Arxiv High Energy Physics, Collaboration network of Arxiv High Energy Physics Theory

This classification dataset contains 380 scientific publications from AAN manually classified into three research areas ("Machine Translation", "Dependency Parsing" and "Summarization").

This classification dataset contains 383 scientific publications from AAN manually classified into 31 research areas using session information. The session information was compiled using the session information from ACL, COLING and EMNLP.

https://www.cc.nih.gov/drd/summers.html Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases

This dataset includes a database of regulation relationships among genes and corresponding textual mentions of pairs of genes in PubMed article abstracts.

中文计算机语言学会历年论文集

Venues and Challenge

Scientific documents such as research papers, patents, books, or technical reports are one of the most valuable resources of human knowledge. At the AAAI-22 Workshop on Scientific Document Understanding (SDU@AAAI-22), we aim to gather insights into the recent advances and remaining challenges on scientific document understanding. Researchers from related fields are invited to submit papers on the recent advances, resources, tools, and upcoming challenges for SDU. In addition to that, we propose a shared task on one of the challenging SDU tasks, i.e., acronym extraction and disambiguation in multiple languages text.

Scientific documents such as research papers, patents, books, or technical reports are one of the most valuable resources of human knowledge. At the AAAI-22 Workshop on Scientific Document Understanding (SDU@AAAI-22), we aim to gather insights into the recent advances and remaining challenges on scientific document understanding. Researchers from related fields are invited to submit papers on the recent advances, resources, tools, and upcoming challenges for SDU. In addition to that, we propose a shared task on one of the challenging SDU tasks, i.e., acronym extraction and disambiguation in multiple languages text.

SIGIR 2021- PatentSemTech

PatentSemTech aims to establish a long-term collaboration and a two-way communication channel between the IP industry and academia from relevant fields such as natural-language processing (NLP), text and data mining (TDM) and semantic technologies (ST) in order to explore and transfer new knowledge, methods and technologies for the benefit of industrial applications as well as support research in applied sciences for the IP and neighbouring domains.

‘WhoIsWho’ is the world’s largest manually-labeled name disambiguation dataset (https://www.aminer.cn/whoiswho) and benchmark. It has two subtasks. In the Name Disambiguation from Scratch Subtask, participants will be provided a bunch of papers with authors who have the same names and will be asked to return different clusters of papers by authors. In the Incremental Name Disambiguation subtask, participants will be provided a set of new papers and a group of existed author’s paper lists already in the system, and need to assign the new papers to the existed authors correctly. (https://biendata.xyz/competition/who-is-who2021/)

在论文引用网络中,可能存在多种类型的对抗攻击。例如,预打印论文网站(如arxiv)中的论文因为无需同行评议,所以存在很多低质量的引用。另一种是虚假引用(coercive citation)。2019年,《自然》杂志报道了著名出版商爱思唯尔调查发现数百名研究人员通过操纵同行评议流程,增加自己的论文引用数。这些对引文网络的攻击不仅会降低公众对科技行业的信任,也会损害对学术数据进行定量分析的努力。所以,我们组织这次比赛,希望可以研究如何攻击和防御学术图数据

KDD 2018: Workshop BigScholar 2018

Given any upcoming top conferences such as KDD, SIGIR, and ICML in 2016, rank the importance of institutions based on predicting how many of their papers will be accepted.

Determine whether an author has written a given paper

early detection of breast cancer from X-ray images of the breast

identifying pulmonary embolisms from three-dimensional computed tomography data

The goal in Particle Physics task is to learn a classification rule that differentiates between two types of particles generated in high energy collider experiments. The goal of Protein Homology Prediction is to predict which proteins are homologous to a native sequence

The goal of Citation Prediction is to predict changes in the number of citations to individual papers over time.The goal of Download Estimation task is to estimate the number of downloads that a paper receives in its first two months in the arXiv.

This year the competition included two tasks that involved data mining in molecular biology domains. Task 1: Information Extraction from Biomedical Articles;Task 2: Yeast Gene Regulation Prediction. The first task focused on constructing models that can assist genome annotators by automatically extracting information from scientific articles. The second task focused on learning models that characterize the behavior of individual genes in a hidden experimental setting. Both are described in more detail on the Tasks page.

KDD Cup 2001 was focused on data from genomics and drug design. Sufficient (yet concise) information was provided so that detailed domain knowledge was not a requirement for entry. A total of 136 groups participated to produce a total of 200 submitted predictions over the 3 tasks: 114 for Thrombin, 41 for Function, and 45 for Localization.

Structured information can be extracted at different levels of granularity. Previous and ongoing work has focused on bibliographic information (segmentation and linking of referenced literature, Wick et al., 2013), keyword extraction and categorization (e.g., what are tasks, materials and processes central to a publication, (Augenstein et al., 2017)), and cataloguing research findings. Scientific discoveries can often be represented as pairwise relationships, e.g., protein-protein (Mallory et al., 2016), drug-drug (Segura-Bedmar et al., 2013), and chemical-disease (Li et al., 2016) interactions, or as more complicated networks such as action graphs describing scientific procedures (e.g., synthesis recipes in material sciences, (Mysore et al., 2017)). Information extracted with such methods can be enriched with time-stamps, and other meta-information, such as indicators of uncertainty or limitations of the discovered facts (Zhou et al., 2015). While various workshops have focused separately on several aspects -- extraction of information from scientific articles, building and using knowledge graphs, the analysis of bibliographical information, graph algorithms for text analysis -- the proposed workshop focuses on processing scientific articles and creating structured repositories such as knowledge graphs for finding new information and making scientific discoveries. The aim of this workshop is to identify the necessary representations for facilitating automated reasoning over scientific information, and to bring together experts in natural language processing and information extraction with scientists from other domains (e.g. material sciences, biomedical research) who want to leverage the vast amount of information stored in scientific publications.

Task 8: MeasEval: Counts and Measurements

Counts and measurements are an important part of scientific discourse. It is relatively easy to find measurements in text, but a bare measurement like "17 mg" is not informative. However, relatively little attention has been given to parsing and extracting these important semantic relations. This is challenging because the way scientists write can be ambiguous and inconsistent, and the location of this information relative to the measurement can vary greatly.

MeasEval is a new entity and semantic relation extraction task focused on finding counts and measurements, attributes of these quantities, and additional information including measured entities, properties, and measurement contexts.

Task 9: Statement Verification and Evidence Finding with Tables

Task 10: Source-Free Domain Adaptation for Semantic Processing

Data sharing restrictions are common in NLP datasets. For example, Twitter policies do not allow sharing of tweet text, though tweet IDs may be shared. The situtation is even more common in clinical NLP, where patient health information must be protected, and annotations over health text, when released at all, often require the signing of complex data use agreements. The SemEval-2021 Task 10 framework asks participants to develop semantic annotation systems in the face of data sharing constraints. A participant’s goal is to develop an accurate system for a target domain when annotations exist for a related domain but cannot be distributed. Instead of annotated training data, participants are given a model trained on the annotations. Then, given unlabeled target domain data, they are asked to make predictions. We apply this framework to two tasks: negation detection and time expression recognition.

Task 11: NLPContributionGraph

The Open Research Knowledge Graph (ORKG) is posited as a solution to the problem of keeping track of research progress minus the cognitive overload that reading dozens of full papers impose. It aims to build a comprehensive knowledge graph that publishes the research contributions of scholarly publications per paper, where the contributions are interconnected via the graph even across papers. With the NLPContributionGraph Shared Task, we have formalized the building of such a scholarly contributions-focused graph over NLP scholarly articles as an automated task. The structured contribution annotations are provided as:

Contribution sentences: a set of sentences about the contribution in the article; Scientific terms and relations: a set of scientific terms and relational cue phrases extracted from the contribution sentences; and Triples: semantic statements that pair scientific terms with a relation, modeled toward subject-predicate-object RDF statements for KG building. The Triples are organized under three (mandatory) or more information units (viz., ResearchProblem, Approach, Model, Code, Dataset, ExperimentalSetup, Hyperparameters, Baselines, Results, Tasks, Experiments, and AblationAnalysis).

Task 6: Extracting term-definition pairs in free text

Task 11: Normalization of Medical Concepts in Clinical Narrative

Task 12: Toponym Resolution in Scientific Papers

SemEval-2018

SemEval 2017

Timeline Extraction, Clinical TempEval will focus on domain adaptation: systems will be trained on data from colon cancer patients, but will be asked to make predictions on brain cancer patients.

SemEval-2016

Clinical TempEval 2016 follows in the footsteps of Clinical TempEval 2015 and the i2b2 2012 shared task in bringing timeline extraction to the clinical domain.

SemEval-2015

Clinical TempEval brings the temporal information extraction tasks of previous TempEvals to the clinical domain, using clinical notes and pathology reports for cancer patients from the Mayo Clinic. TS: identifying the spans of time expressions ES: Identifying the spans of event expressions TA: identifying the attributes of time expressions type=DATE, TIME, DURATION, QUANTIFIER, PREPOSTEXP or SET value=TIMEX3 value string as defined by TimeML EA: identifying the attributes of event expressions type=N/A, ASPECTUAL or EVIDENTIAL polarity=POS or NEG degree=N/A, MOST or LITTLE modality=ACTUAL, HEDGED, HYPOTHETICAL or GENERIC DR: identifying the relation between an event and the document creation time docTimeRel=BEFORE, OVERLAP, BEFORE-OVERLAP or AFTER CR: identifying narrative container relations (CONTAINS a.k.a. INCLUDES)

SemEval 2014 Task 7: Analysis of Clinical Text

Task A This includes the recognition of mentions of concepts that belong to the UMLS semantic group disorders

Task B This task involves the mapping of each disorder mention to a unique UMLS CUI. This is referred to as the task of normalization and the mapping is limited to UMLS CUIs of SNOMED codes.

Github Resource

Open Academic Data Challenge 2018 (开放学术数据挖掘大赛)

Task: Researcher Name Disambiguation

Open Academic Data Challenge 2017 (开放学术精准画像大赛)

Task 1: Extract a scholar’s profile information

Task 2: Predictthe scholar’s research interests labels

Task 3: Predict the scholar’s future influence

根据学术数据挖掘系统AMiner.org和Microsoft Academic Graph提供的数据集,提取学者的个人描述信息,分析学者的研究兴趣,以及预测学者的论文引用情况

WWW 2016-2018: SAVE-SD: Workshop on Semantics, Analytics and Visualisation: Enhancing Scholarly Data

WWW 2016 : BIG 2016 CUP

Microsoft provides the latest Microsoft Academic Search data set and an online graph query interface for BIG 2016 CUP

BigScholar 2019 workshop-CIKM 2019: BigScholar

BigScholar 2018 workshop-KDD 2018: BigScholar

BigScholar 2017 workshop-WWW 2017: BigScholar

BigScholar 2016 workshop-WWW 2016: BigScholar

BigScholar 2015 workshop-WWW 2015: BigScholar

BigScholar 2014 workshop-WWW 2014: BigScholar

International Workshop on Mining Scientific Publications 2012-2018

The entire body of research literature is currently estimated at 100-150 million publications with an annual increase of around 1.5 million. Research literature constitutes the most complete representation of knowledge we have assembled as human species. It enables us to develop cures to diseases, solve difficult engineering problems and answer many of the world’s challenges we are facing today. Systematically reading and analysing the full body of knowledge is now beyond the capacities of any human being. Consequently, it is important to better understand how we can leverage Natural Language Processing/Text Mining techniques to aid knowledge creation and improve the process by which research is being done.

This workshop aims to bring together people from different backgrounds who:

have experience with analysing and mining databases of scientific publications, develop systems that enable such analysis and mining of scientific databases (especially those who publication databases) or who develop novel technologies that improve the way research is being done.

In response to the COVID-19 pandemic, the Epidemic Question Answering (EPIC-QA) track challenges teams to develop systems capable of automatically answering ad-hoc questions about the disease COVID-19, its causal virus SARS-CoV-2, related corona viruses, and the recommended response to the pandemic. While COVID-19 has been an impetus for a large body of emergent scientific research and inquiry, the response to COVID-19 raises questions for consumers. The rapid increase in coronavirus literature and evolving guidelines on community response creates a challenging burden not only for the scientific and medical communities but also the general public to stay up-to-date on the latest developments. Consequently, the goal of the track is to evaluate systems on their ability to provide timely and accurate expert-level answers as expected by the scientific and medical communities as well as answers in consumer-friendly language for the general public.

The purpose of this TAC track is to test various natural language processing (NLP) approaches for their information extraction (IE) performance on drug-drug interactions in Structured Product Labeling (SPL) documents. SPL is a document markup standard approved by Health Level Seven (HL7) and adopted by the FDA as a mechanism for exchanging product and facility information about drugs.

The purpose of this TAC track is to test various natural language processing (NLP) approaches for their information extraction (IE) performance on drug-drug interactions in SPLs. A set of 20 gold-standard SPLs annotated with drug-drug interactions will be provided to participants. An additional set of 180 SPLs annotated in slightly different format is available for training. Participants will be evaluated by their performance on a held-out set of 50 labeled SPLs.

One of the major aspects of drug information are safety concerns in the form of Adverse Drug Reactions (ADRs). In this evaluation, we are focusing on extraction of ADRs from the prescription drug labels. Task 1: Extract AdverseReactions and related mentions (Severity, Factor, DrugClass, Negation, Animal). This is similar to many NLP Named Entity Recognition (NER) evaluations. Task 2: Identify the relations between AdverseReactions and related mentions (i.e., Negated, Hypothetical, and Effect). This is similar to many NLP relation identification evaluations. Task 3: Identify the positive AdverseReaction mention names in the labels. For the purposes of this task, positive will be defined as the caseless strings of all the AdverseReactions that have not been negated and are not related by a Hypothetical relation to a DrugClass or Animal. Note that this means Factors related via a Hypothetical relation are considered positive (e.g., "[unknown risk]Factor of [stroke]AdverseReaction") for the purposes of this task. The result of this task will be a list of unique strings corresponding to the positive ADRs as they were written in the label. Task 4: Provide MedDRA PT(s) and LLT(s) for each positive AdverseReaction (occassionally, two or more PTs are necessary to fully describe the reaction). For participants approaching the tasks sequentially, this can be viewed as normalization of the terms extracted in Task 3 to MedDRA LLTs/PTs. Because MedDRA is not publicly available, and contains several versions, a standard version of MedDRA v18.1 will be provided to the participants. Other resources such as the UMLS Terminology Services may be used to aid with the normalization process.

Given: A set of Citing Papers (CPs) that all contain citations to a Reference Paper (RP). In each CP, the text spans (i.e., citances) have been identified that pertain to a particular citation to the RP. Task 1a: For each citance, identify the spans of text (cited text spans) in the RP that most accurately reflect the citance. These are of the granularity of a sentence fragment, a full sentence, or several consecutive sentences (no more than 5). Task 1b: For each cited text span, identify what facet of the paper it belongs to, from a predefined set of facets. Task 2: Finally, generate a structured summary of the RP and all of the community discussion of the paper represented in the citances. The length of the summary should not exceed 250 words. Task 2 is tentative.

About

Tools, datasets, Corpus and Venue Challenge for scholarly big data——Pick up scattered pearls