peipeilihfut/AssessSparseIE

Employing Semantic Context for Sparse Information Extraction Assessment

We address two problems in this paper. We first want to veri- fy the correctness of hundreds of millions of isA relationships. That is, given a candidate pair <c,e>, we want to evaluate how likely e is an entity of class c. Second, given a candidate pair <e1, e2>, and a known relationship R between classes c1 and c2, we want to evaluate whether relationship R exists between e1 and e2.

Introduction

The explosive growth and popularity of the World Wide Web has resulted in a huge amount of texts on the Internet, which presents an unprecedented opportunity for Information Extraction (IE). IE is at the core of many emerging applications, such as entity search, text mining, and risk analysis using financial reports. In these applications, we can divide the outcome of IE into two categories according to the frequency: heads and tails. The heads are those that occur very frequently in the corpus. For instance, we can extract the fact that "google is a company" from numerous distinct sentences. It is built on the assumption that the higher the frequency, the more likely it is correct. Nevertheless, there are results that occur very infrequently, for instance, suppose from a corpus we extract a statement that says Rhodesia(Rhodesia was an unrecognised state located in southern Africa that existed between 1965 and 1979 following its Unilateral Declaration of Independence from the United Kingdom on 11 November 1965.) is a country, and its occurrences in the corpus are few and far between. In Table 1, we show some frequent and rare candidate countries extracted from a web corpus using Hearst patterns. It turns out that all frequent entities are correct, while the majority of infrequent ones are incorrect. The mistakes come from either the extraction algorithm, or erroneous sentences in the corpus.

Table 1: Frequent and infrequent candidate entities of country

Frequent Entities	Rare Entities
India	Northern
China	Sabah
Germany	Yap
Australia	Parts of sudan
Japan	Wealthy
France	Western romania
Canada	American artists
USA	South korea japan
Brazil	New sjaelland
Italy	Rhodesia

How to verify the correctness of a tail extraction (also known as sparse extraction) is one of the most important and challenging problems in IE. As we know, the distribution of words and phrases in a corpus of natural language utterances follows the Zipf's law which states that the frequency of any word or phrase is inversely proportional to its rank in the frequency table. Thus, their occurrences in a particular syntactic pattern we use for extraction are very small. Without a good mechanism to identify correct extractions from incorrect ones, sparse information extraction will be plagued by either low precision or low recall.

Existing efforts in information extraction or sparse extraction can be divided into the following four classes. Heuristic based approaches start with a set of seed entities given a relation or some prior label distributional knowledge, and identify extraction patterns for the relation iteratively. Redundancy-based approaches require that extractions appear relatively frequently with a limited set of patterns. Knowledge-based approaches identify information extraction in terms of external resources, such as Wikipedia, Freebase and WordNet. In addition, most of popular approaches in handling of sparse extractions are context-based model building approaches. They use one important hypothesis known as the distributional hypothesis, which says that different entities of the same semantic relation (such as the unary and binary relations) tend to appear in similar textual contexts. For example, we may not find many occurrences of Rhodesia in the Hearst pattern "countries such as Rhodesia". But if Rhodesia appears in similar context where terms such as India, USA, and Germany occur, then we will be more certain about the claim that Rhodesia is a country according to the distributional hypothesis. This hypothesis is beneficial to assess sparse extractions.However, the challenge lies in modeling contexts and measuring the semantic similarity of two contexts.

Our Semantic Context based Approach for Assessing Sparse Information Extractions

We now analyze the challenges in the tasks. The first challenge is the scale. For example, there are hundreds of millions of isA relationships (formed among 2.7 million categories and 5.5 million entities in Probase[1][2]). It is impossible to learn the generative model (such as the HMM model and the deep learning model) based on the contexts of all entities, it is very time-consuming. The second challenge lies in improving the effectiveness of the verifier. As we mentioned, the feature representation based on the contexts of words are very different that based on the contexts of entities. Meanwhile, neither a bag of words nor a set of hidden states can provide good semantics to understand the relationship between a candidate pair. Motivated by this, in this paper, we introduce a semantic, scalable, and effective approach for sparse information extraction assessment.

The main contributions of this paper are as follows.

First, we introduce a semantic approach for solving the two problems. More precisely, we come up with a semantic representation of the contexts. This approach is natural because we are dealing with a large semantic network, which provides semantic information in various aspects. Using these information, we are able to introduce semantic features to describe a context, which leads to a lightweight and effective solution of context learning.

Second, we scan billions of web documents using MapReduce6 to capture the contexts of millions of entities and pairs of entities in Probase, and then compare the similarity between their contexts and the contexts of seeds7. We further use the similarity evaluated by our three semantic context based approaches to represent the feature space given a pair, and then train a binary-class classifier on a small number of labeled data varying with different base classifiers to select the best one for predicting sparse extractions. Extensive studies show that our approach can achieve better performance than state-of-the-art approaches in sparse extraction assessment.

Data Sets

Considering the experimental data sets, we randomly selected about 1800 entities that belong to 12 classes in Probase. Tables 2 and 3 show the descriptions and some examples in each class respectively. Each entity has no more than 10 occurrences in Hearst patterns and we call them sparse extractions. This is because more than 90% entities of the above 12 concepts have no more than 10 occurrences in Probase, namely lying in the long tail of the entity distribution curves. For example, Figure 2 shows the frequency distribution varying the number of entities in country. We can clearly see the long tail phenomenon under the dotted line with no more than 10 occurrences. We asked human judges to evaluate their correctness.We also looked into three binary relations: is- CapitalOf, isCurrencyOf, and headquarteredIn. We randomly picked 315 sparse extractions that have no more than 10 occurrences, and we also picked the 10 most frequent extractions for each relation which serve as seeds. Details of all test relationships are shown in Table 2.

Table 2: Data sets used in experiments

	total pairs in Probase	paris with frequency < 10	pairs in experiments	#bad pairs	#good pairs
isA relationships
country	5534	92.81%	415	226	189
sport	2866	92.18%	335	67	268
city	8815	90.05%	231	33	198
animal	5562	92.38%	186	37	149
seasoning	531	92.47%	169	41	128
company	59734	96.84%	82	9	73
painter	1097	98.09%	81	5	76
currency	330	91.82%	78	8	70
disease	8280	92.60%	69	9	60
film	10859	96.62%	65	25	40
language	2703	93.53%	51	6	45
river	1924	97.77%	40	2	38
total	108235	92.25%	1802	468	1334
Binary relationships
isCapitalOf(country, city)			160	39	121
isCurrencyOf(country, currency)			80	19	61
headquarteredIn(company, city)			75	22	235
total			315	80	235

Table 3: Examples of isA relations

isA relation	#bad pair	#good pair
country	<country, democratic people>	<country, g77>
city	<city, santa martha>	<city, amadora>
sport	<sport, trafalgar park>	<sport, girls golf>
animal	<animal, cauquenes>	<animal, moon snail>
seasoning	<seasoning, bacon bit>	<seasoning, five spice>
company	<company, institute>	<company, hasbro>
painter	<painter, robert young>	<painter, childe hassam>
film	<film, forest gump>	<film, breach>
language	<language, francophone>	<language, micmac>
river	<river, manda>	<river, missouri river>
isCapitalOf(country, city)	<dili, east timor>	<andorra, andorra la Vella>
isCurrencyOf(country, currency)	<baht, thailand>	<colombia, colombian peso>
headquarteredIn(company, city);	<espoo, general electric>	<michelin, clermont-ferrand>

Used Data Sets: Download

More Details Refer to Used Data Sets (1) .

More Details Refer to Used Data Sets (new) .

Source codes: Download

Our project is implemented by C# and SQL Server. Base classifiers used in our approach are from Weka-3.8.1.jar. Souce codes of this project refer to Source codes.

Our AM (attribute-based context), CM (concept-based context) and IM (Isa-based context) approaches have similar parametes, we explain the parameter list of CM as an example. Main functions of these three approaches are called AMMain(string[] args), SuperConceptBasedMain(string[] args) and IMBasedMain(string[] args) in the file of "CleaningMain.cs".

Parameter list for our CM approach

Parameter list for our CM approach
Variable	Description
databaseServer	the name of database;
databaseName	the name of database;
testEntityTable	the table of conceptualization;
isSelectedTopK	whether select Top tokens or not, 1: yes, 0: no;
classNumThres	the maximum number of concepts in conceptualization;
distEvalType	the type of distance evaluation;
seedsNum	the number of seeds;
bUseClustering	whether use clustering or not, default: false;
pathStr	directory of files;

References

[1] Knowledgebase Probase: http://research.microsoft.com/en-us/projects/probase/release.aspx

[2] W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: a probabilistic taxonomy for text understanding. In Proceedings of SIGMOD'12, pages 481-492, 2012.

Please cite the following references if you use this source code

[1] Peipei Li, Haixun Wang, Hongsong Li, and Xindong Wu, Employing Semantic Context for Sparse Information Extraction Assessment, ACM Transactions on Knowledge Discovery from Data,12(5): 54:1-36, July 2018.

[2] Peipei Li, Haixun Wang, Hongsong Li, and Xindong Wu, Assessing Sparse Information Extraction using Semantic Contexts, In: Proceedings of 22nd ACM International Conference on Information and Knowledge Management (CIKM’13), pp.1709-1714, San Francisco, CA, USA, 2013.10.28-11.01

Contact

Peipei Li (peipeili@hfut.edu.cn): Hefei University of Technology, China
Haixun Wang (haixun@google.com): Google Research, USA
Hongsong Li (hongsong.lhs@alibaba-inc.com): Alibaba Group, China
Xindong Wu (xwu@uvm.edu): University of Louisiana at Lafayett, USA

peipeilihfut / AssessSparseIE

Employing Semantic Context for Sparse Information Extraction Assessment