adellegia / watermelon

Document classification of Digital Markets Act public consultations using transfer learning with BERT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Classification of Unstructured Documents: Transfer Learning with BERT

Summary

In this tutorial, we build a text classification pipeline for unstructured documents using a transformer-based deep learning model called Bidirectional Encoder Representations from Transformers (BERT). We demonstrate the application of a document classification algorithm for the European Commission (EC).

Whenever new legislation is proposed, the EC opens public consultations where various stakeholders (e.g. businesses, academia, law firms, associations, private individuals) submit documents that detail their views on the proposal. The EC receives anywhere between 10,000 to 4 million of these public consultation documents annually. Using machine learning and deep learning methods to process these documents will streamline the Commission's review of stakeholder comments, which will consequently allow them to integrate more information into their policymaking process.

By the end of this tutorial you will understand how to:

  1. Extract, clean, and pre-process information from unstructured PDF documents
  2. Use the pre-processed text as input to machine learning/deep learning models
  3. Build a text/document classifier with BERT
  4. Compare BERT with text classifiers built using other models

Tutorial

This tutorial is implemented using Python>=3.6 and all software requirements and dependencies can be installed using environment.yml. The notebook Text_Classification_BERT.ipynb contains the memo, code, and explanation of the results. The slide deck of our presentation is available here, Transfer Learning for Classification of Unstructured Documents.pdf. The recorded video (.mp4) walking a user through the notebook can be downloaded here: Link to video file.

Contributors

About

Document classification of Digital Markets Act public consultations using transfer learning with BERT


Languages

Language:Jupyter Notebook 100.0%