The main goal of this workshop is to give an introduction into NLP. Hopefully, this will both be interesting and spark ideas for many professional and/or personal projects!
API stands for Application Programming Interface. An API is basically a method for two applications (or just two pieces of software) to communicate with one another.
This communication can take place in many different forms, but often when talking about APIs we mean web APIs. This communication consists of requests. The two types of requests we are going to use are:
GET
requests: a request to get some data from the resourcePOST
request: a request to send some data to the resource
We are gonna use FastAPI
, which is a framework for building such API's. See
this link for some more info.
Machine learning is a way to make the computer learn something. The computer can learn from studying sorts of data and statistics. It is a program that can predict an outcome after it learned and analysed the data.
For NLP specifically, machine learning ensures that the computer can identify certain aspects of text, such as speech or entities. This technique uses a model that is then applied to a certain text, which will train the computer.
For more information, you can use this link. Oh! Don't forget that Google is your friend!
Before we can make a new Python project, we need to make sure that you have a
(recent) Python version installed on your device. Python 3.8+ is required. If
you have a recent version of Python installed, a version of the tool pip
is
automatically installed.
python --version
Note: When the command is not found or the version is 2.x, try python3 --version
and use python3
instead of python
in the commands below.
Clone this repository to your machine and open the project in your favourite editor. Next, install all the required packages for this project:
-
Create a virtual environment:
python -m venv venv
-
Activate the virtual environment:
- See this link how to activate the
venv
for your operating system.
- See this link how to activate the
-
Update
pip
to get its latest version:python -m pip install -U pip
-
Install wheel:
python -m pip install wheel
-
Install required packages:
python -m pip install -r requirements.txt
-
Set up the database:
python db.py
-
Run the server:
uvicorn api:app --reload
Note: This "runner" will use a terminal instance while active. If you want to run an other command, just open a new terminal instance (and don't kill this one).
Do a
GET
request that returns a dictionary containing the message "Hello Future Pythoneer" when you visithttp://127.0.0.1:8000
in your browser.
{"message":"Hello Pythoneer!"}
- The code for this request is already given in
api.py
.
Create a
POST
request that uploads a file when you executepython requester.py
in your terminal.
- A part of the code for this request is already given in
api.py
. Just replace the...
with the right code to get the right output.
The post request uses the module requester.py
(just take a look and see what
happens there). To do the actual upload of files to the database, run the
following command:
python requester.py
Make sure that you have another terminal active, where the API runner is active.
b'{"message":"file successfully uploaded","file_name":"file.txt"}'
200
Create a
GET
request inapi.py
that returns a dict containing the numbers of files and a list of the files when you visithttp://127.0.0.1:8000/file
in your browser.
@app.get("/file")
def get_all_files() -> Dict[str, Any]:
...
{"nr_files":<number_of_files>,"files":[{"id":<id>,"file_name":"<file_name>"},...}]}
Create a
GET
request inapi.py
that returns a dict containing the file name and its content when you visithttp://127.0.0.1:8000/file/{id}
in your browser.
@app.get("/file/{file_id}")
def get_file(file_id: int) -> Dict[str, str]:
...
{"file_name":<file_name>,"contents":"<contents>"}
Create a
GET
request inapi.py
that returns a dict containing all the tokens from that file when you visithttp://127.0.0.1:8000/file/{id}/tokens
in your browser. A token can be a set of multiple words that belong together like 'New York' or 'Harry Potter'.
@app.get("/file/{file_id}/tokens")
def get_tokens(file_id: int) -> Dict[str, Any]:
...
{"token_count":<number_of_tokens>,"unique_tokens":["<token_1>", "<token_2>", ...]}
Create a
GET
request inapi.py
that returns a dict of sentence tokens containing the sentiment of that file when you visithttp://127.0.0.1:8000/file/{id}/sentiment
in your browser. A line can hold multiple sentence tokens. For example, the line below holds three sentence tokens :"This is the first sentence. This is the second sentence. This is the third sentence."
@app.get("/file/{file_id}/sentiment")
def get_sentiment(file_id: int) -> Dict[str, Any]:
...
{"sentiment":{<sentence_text>:{"neg":<score>,"neu":<score>,"pos":<score>,"compound":<score>},...}}
Create a
GET
request inapi.py
that returns a dict containing all the named entities of the file when you visithttp://127.0.0.1:8000/file/{id}/sentiment
in your browser.
@app.get("/file/{file_id}/named_entities")
def get_named_entities(file_id: int) -> Dict[str, Any]:
...
{"named_entities":[["<token_1>","<entity_tag>"],["<token_2>","<entity_tag>"],...]]}
Now it is time to train your own NLP model! You are going to perform binary sentiment classification which means classifying the sentiment of a review to either positive or negative (0 or 1).
These steps can be followed as a reference:
- The data to train your model can be found in
data\sentiment_competition_train.csv
- Pre-process dataset
- Split dataset into a training and validation set
- Vectorize data
- Train model using classification algorithm
- Validate trained model using validation dataset
- Improve model/pre-processing/vectorizer etc.
- Evaluate with the test set and hope for the best!