Emission Factor Matching

This is an interesting MLE exercise to map item descriptions in natural language to a set of item labels and their corresponding emission factor (dummy data) in order to calculate the carbon emissions. The full details can be found in the PDF document.

As an overview, I think this serve as a great self-contained task to showcase a sort of end-to-end process tackling a ML feature or product from ideation and exploration, to research and modeling, to evaluation and deployment, and finally to productionization and optimization.

The predicted matches for InterviewDataset.csv can be found in data/EmbeddingsInterviewDataset.csv.

Approach

The ideas that came to mind when I read the task were of a few different "types", and I am sure there are many ways to solve this task with varying levels of difficulty and performance. I have written some that I had thought about or researched on below. The approaches are all generally similar in trying to solve the problem of mapping free-form text descriptions written in natural language, to a predefined set of labels.

A good solution thus requires an understanding of natural language in order to draw out the limited amount of contextual information from the descriptions and provide accurate matches.

The Naive Way

The most naive way would likely be to simply try and match the descriptions with the labels through basic pattern matching say using Regex. This is of course going to only work for descriptions that contain the exact labels.

The More Traditional NLP Ways

Before neural networks became viable, NLP tasks were usually approached using a mix of domain knowledge and statistical models such as Naive Bayes, SVMs, Decision Trees or to utilize word frequency information to enable some clustering and classification. However, these approaches are limited in their ability to capture semantic meaning behind the words and thus had suboptimal performance.

Later on, with neural networks such as LSTMs and RNNs that are able to model sequential data, they are discovered to be much more useful since they can capture the sequences and dependencies that are core to natural language, which allowed a better understanding of their semantic meanings.

The Present NLP Ways

My understanding of NLP mostly stopped at LSTMs and RNNs as my modules did teach about them formally, but personally have explored more CV than NLP until recently. I did try to keep up with the current approaches in various tech and ML fields but have not really heard about embeddings explicitly until I started researching for this task. I am however aware that GPT and its variants benefited greatly from extremely large models and the transformer architecture.

Either way, I came to learn that embeddings, or essentially numerical vector representations of words, serve as the foundation to these more advanced models like GPT, and is also well-suited for this particular similarity search task. So my idea was essentially to find a way to encode the item descriptions and item labels into these dense vectors or embeddings, then perform a similarity comparison using a distance metric like cosine similarity, although there are others as well like L1/L2 dist and Jaccard similarity.

My Solution

Since I happened to have been playing around with OpenAI's models recently and even built a chatbot for a simple demo showcase, I stumbled upon their embeddings model and hence decided to use their API instead of using a local one from a library like HuggingFace, which might take longer to setup. The caveat is of course that there is a very small fee involved in running this task. It is also not as fast and is depending on a profit-driven external API which might not be as desirable in production with possible concerns on speed, ability to fine-tune, costs and data privacy.

Setup

Since this project uses OpenAI's embeddings model, we will need a valid API key stored in a .env file. More info about their APIs can be found in their documentation. The following installs all the libraries used to run the notebook and the FastAPI server and also starts it using uvicorn for local testing.

# install required python libraries in a virtual environment
python -m venv venv
./venv/Scripts/activate
pip install -r requirements.txt

# start fastapi server
uvicorn main:app --reload

After the server has started, it should be accessible on localhost:8000 and you can also visit localhost:8000/docs to try out the API endpoints and view the request/response schemas, all auto-generated by FastAPI.

Observations

Generally I think the data is quite clean since this is just an example task, so only some simple cleaning was needed.

Lowercasing and removing special characters and null values were some easy ways to reduce the number of unique tokens that could be matched to perhaps improve the performance and speed, although it might lose some information if they are supposed to have some meaning. There were also some invalid values like #DIV/0 for certain emission factors so I just changed them to 0.

Some of the characters were also broken, such as accented characters, but I think this will likely be better approached from the data source's side. Some information like if an item is either a raw item or an ingredient might not be so clear as well because the way they are represented in the descriptions again depend on the data sources. These cases were ignored.

After cleaning, we can tokenize the descriptions and pass them into an embeddings model. In my case, I used OpenAI's embeddings model, which already includes their own tokenizer.

Improvements

There are definitely many improvements to be made.

For one, the performance can perhaps be enhanced by fine-tuning the embeddings model to the kinds of data that we expect. Since these pretrained models are trained on more general datasets, for more specific domains like food and agriculture, which can contain lesser known colloquial names such as "tau kua" for tofu and may vary in spelling as well, the embeddings model can likely map to the item labels more accurately if it can understand such alternative descriptions.

Moreover, as mentioned in my observations above, certain descriptions have special formats to denote other kinds of information in a succint way, which a general model might not understand. This might also require a better cleaning process through communication with domain-experts to better understand these subtleties.

Other similar models or NLP solutions might also work better in terms of performance, speed, costs and control.

Productionization

A simple FastAPI program was set up to showcase how we can use the precomputed embeddings of our Emissions dataset to be accessible through an API POST endpoint. Users or other programs can then fetch a predicted matching list of emission labels for a given list of descriptions in natural language.

However, the current approach is of course not scalable especially as the Emissions dataset grow immensely, which might also require incremental training or retraining of the embeddings model to utilize and understand the new data coming in. With more time, containerizing the program will likely also be a better practice.

Vector Storage

Firstly, in terms of the storage of the embeddings, right now it is simply stored either in a CSV or Pickle file which can then be read back using Pandas or other similar libraries. However, this is not scalable as storage of these large vectors are unoptimized and worse so for search or queries.

One scalable solution to vector data is vector databases. With the rise in popularity of NLP businesses and use-cases, many new ones dedicated to mainly storing vectors have popped up, such as Pinecone and Weaviate, which also has integrations with pretrained models for easier setups. Essentially, these vector databases or extensions like pgvector and RediSearch allow for efficient storage and retrieval using nearest neighbour or cosine similarty of a large numbers of these vectors. They may also offer other advanced features such as indexing, clustering and compression for further performance boosts.

MLOps Stack

Secondly, in terms of the embeddings model used, one or more locally fine-tuned models can be stored in formats such as ONXX, then deployed and be part of an MLOps pipeline solution such as Databricks, which can also be used to build data engineering piplines. For the data and modeling layers we can use Spark to explore our data using Spark SQL and model training or continual model training using Spark MLLib and even handle streaming and graph processing, all in a scalable manner, since Spark automatically splits up the tasks, among other optimizations.

Interface

Lastly, wrapping all these in a usable manner would require some kind of interface, the simplest of which would be some API endpoints, as showcased in this project. More than likely a user-friendly platform such as a web or mobile app can be built on top of the ML layer in order for businesses and end-users to understand their carbon emissions. The platform can utilize these emission factors and also link to related data such as tagging a certain business cost to these emissions to provide insights and visualizations and further drive business value. These interfaces will then be hosted on cloud servers such as Vercel to deliver to the end-users.

Overall

All of these would require a robust stack of technologies that is scalable and cost-effective. Moreover, if data privacy and IP protection are of high priorities, the stack might have to mostly be self-hosted instead, which poses additional complexities.

Conclusion

It was really insightful learning about these technologies in such a short time. Hopefully I have showcased my high-level approach to solving this complex task. In a small and contained environment this task might seem straightforward but as with any business, tackling scalability, maintainability and costs, while providing value propositions, are all tough challenges; and I will be excited to work on them.

zhermin / emission-factor-matching