10xMLaaS project: MeL is a machine learning and natural language processing tool for analyzing open text data.
- Install Docker Desktop
- Go to https://www.docker.com/products/docker-desktop
- Install the Docker Desktop for your platform
- Make sure Docker is running
- Install git
- Go to https://git-scm.com/download/
- Install the git for your platform
- Create a folder on your machine for the MeL tool
- Choose a location for the MeL tool on your machine, for example, in your Documents folder.
- Open a Terminal (Mac/Linux) or Command Prompt (Windows) and navigate to the folder you chose in the previous step:
cd path/to/that/folder
- for example, on a Mac:
cd ~/Documents
- Create a new folder for the MeL tool, for example
mel
:mkdir mel
- Move to the newly created folder:
cd mel
- Obtain the Mel tool from GitHub
- Type the following to copy the MeL tool to your machine:
git clone https://github.com/18F/10x-MeL.git
- Navigate into the tool by typing:
cd 10x-MeL
- Type the following to copy the MeL tool to your machine:
- Connect the MeL tool to Docker by typing:
docker-compose up --build -d
- Open Docker Desktop
- Observe the listing for
10x-mel
and the presence of a play button - Click the play button
- Once the status is
RUNNING
or the tool icon turns green, go to- http://localhost:5000/
- This page can also be opened but double-clicking
start.webloc
in theMeL
folder
To discover latent categories in the text and assign entries to those categories.
There are several components to automatic categorization:
1. The text in each entry is parsed and noun phrases are extracted
2. Count occurrences of words and phrases contained in the noun phrases
- Counts are boosted according to the the recency of the entries in which they appear
3. Use the most commonly occurring words and phrases as the initial category headings
4. Perform an initial pass over the categories, merging those categories that share common terms
5. Create language models for each category
6. Perform another pass over the categories, this time merging categories whose language models are similar, using entropy as a metric
1. Entries are compared to the terms in the most popular categories, according to a power-law "rich get richer" approach. When terms intersect, the entry is linked to one or more category/subcategory pairs.
2. If an entry fails to match any categories, a language model is created for the entry and it is compared with the language model of each category to determine best fit.
To detect user reports of problem they encounter with usa.gov and other government websites
Two categories of data are used to detect problems:
- The text of users' survey responses
- The ratings provided by users, often on a scale from 1 to 5
- These ratings are normalized from a scale of 1 to 5 to a scale from -2 to +2, where complaints are negative.
The task is distilled into two subtasks:
- example: a complaint
Linguistic analysis is combined with the users' ratings to determine the sentiment of the entry and the likelihood that the user is providing context about a problem that was experienced
- example: a government website provides out-of-date information
Surface pattern matching is used here to determine whether the content of the entry is relevant. This matching is list based, so it can be easily adapted for other use cases.
See CONTRIBUTING for additional information.
This project is in the worldwide public domain. As stated in CONTRIBUTING:
This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.
All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.