Web-based Systems Group @ University of Mannheim's repositories
contrastive-product-matching
This repository contains the code to reproduce the experiments of the poster "Supervised Contrastive Learning for Product Matching"
productbert-intermediate
This repository contains code and data download scripts for the paper "Intermediate Training of BERT for Product Matching" by Ralph Peeters, Christian Bizer and Goran Glavaš.
ExtractGPT
Attribute Value Extraction using Large Language Models
productCategorization
This repository contains code and data download instructions for the workshop paper "Improving Hierarchical Product Classification using Domain-specific Language Modelling" by Alexander Brinkmann and Christian Bizer.
wdc-lspc-v2
This repository contains code and data download scripts for the paper "Using schema.org annotations for training and maintaining product matchers" by Ralph Peeters, Anna Primpeli, Benedikt Wichtlhuber and Christian Bizer.
wdcproducts
This repository contains the code and data download links to reproduce building the WDC Products Benchmark.
WDCFramework
Java Framework which is used by the Web Data Commons project to extract Microdata, Microformats and RDFa data, Web graphs, and HTML tables from the web crawls provided by the Common Crawl Foundation.
EntityMatchingTaskProfiler
Code for profiling entity matching tasks using the dimensions described in the following paper: Primpeli, Anna, and Christian Bizer. "Profiling entity matching benchmark tasks." Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2020.
UnsupervisedBootAL
Unsupervised Bootstrapping of Active Learning for Entity Resolution
SubsetCreatorJupyterNBs
Jupyter notebooks used to create the schema.org subsets from the MD and JSON-LD corpus for the WDC 2020 structured data extraction.
TailorMatch
This repository contains code and comprehensive examples to replicate and build upon the experiments presented in our paper “Fine-tuning Large Language Models for Entity Matching” The repository provides resources for implementing fine-tuning techniques on large language models specifically for entity matching tasks.
ALMSER-GEN
This repository contains the code and data for reproducing the results of the paper "Active Learning for Multi-Source Entity Matching: How do the Characteristics of the Task Impact Performance?" .
pie_chatgpt
Product Information Extraction using ChatGPT
schemaorg-tables
This repository contains the code and data download links to reproduce the building process of the 2021 Schema.org Table Corpus.
MannheimSearchJoinsEngine
A Search Join is a join operation which extends a user-provided table with additional attributes based on a large corpus of heterogeneous data originating from the Web or corporate intranets.
DeepAL_for_ER
Code and Data to reproduce the results of the Master Thesis of Stephan Waitz on "Combining Deep Learning and Active Learning for Entity Resolution"
StructuredDataProfiler
Java project for profiling the results of the yearly Web Data Commons extraction of structured data with RDFa, Microdata, Microformat, and Embedded JSON-LD annotations.