vin136 / ML-wild

Synthesizing insights for building Real-world ML systems.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

I'm currently doing my masters in computer science with specialization in Machine learnning. I will be graduating this May but can also do it early - around january. So I'm open to roles that require joining after January.

Befor this I worked for nearly 3 years at a trading SaaS company - Niveshi Capital. I joined when the company was an early stage startup and responsible for building their machine learning pipeline - from gathering data to production grade models. Technology wise - Python and python base libraries like (numpy,pandas,tensorflow and pytorch.)

Also this summer I worked as a Research Intern at a lab in my department. We worked on building Computer vision based models for solar forecasting. Our work is the process of being published at a conference, we r drafting the material. So overall i've experience both research and industry.

Pitch for Why good fit?

I have strong prior experience building Machine Learning Powerd Applications. Previously, I was the founding engineer at Nivshi Capital. I lead a team of engineers to build a Deep Reinforcement learning-based trading SaaS company. I thrive in ambiguity. I have a deep love for learning and can quickly adapt to the needs. Also, I share the visceral experience of building a start-up from scratch and would love to be part of another. As an AI Engineer, I understand the company's mission and how it is the right idea at the right time - Commoditizing AI.

I believe this role would offer me the intellectual stimulation and a collaborative environment that would allow me to excel. I am looking forward to learning more about the role.

Meta Skill: General capability of doing quality research. (Scientific method, thinking in first principles, rigorous testing, and careful analysis) Specialized Knowledge: Training and debugging neural networks, Statistics and Machine learning.

Hello Pratap,

I've just applied for the Machine Learning Scientist Role at Arena. I value your time and promise this will be quick.

Why I'm a great fit for the Role?

Previously I've spent 3 years trying to make Deep Reinforcement learning work in the real world. I was the lead engineer at Niveshi Capital, developing intraday trading & execution strategies using Reinforcement Learning. I'm sure you are well aware of the difficulty in doing so. I've developed the tenacity needed to make the theory work in the real world.

I want to be part of the team. Happy to share anything that's helpful for you to see the fit. Looking forward to hearing from you.

Best,

Vinay

Thanks, Vinay! Great! Here's the link for us to connect: https://rivs.zoom.us/j/4392120096

Tools Preferred: Experience using or deploying MLOps systems/tooling (e.g. MLFlow) Preferred: Experience working with columnar datastores (BigQuery preferred) Preferred: Experience in pipeline orchestration (e.g. Airflow) Preferred: Experience using Google Cloud Platform services Preferred: Experience in DevOps processes/tooling (CI/CD, GitHub Actions) Preferred: Experience using infrastructure as code frameworks (Terraform) Why are you interested in this position (3 minutes to respond)

  1. I learned that I'll be working with Ethicon product team that design and improve surgical instruments, I know that my work has a direct impact in improving people's lives. It's always and satisfying to know that not only you will learn skills useful in you career, but also have positive impact along the way.

  2. The preferred qualifications listed like experience with python data science libraries like pyspark,numpy, Scikitlearn or strong prior experience working with engineering datasets perfectly match my background.

  3. Also this summer my friend worked at J & J as a Software Development intern , who spoke very highly of the work culture and the team, Which makes me excited to work for the company. In fact, since then I started looking at the J&J career site for opportunities. Also working at the company during the spring willl help me to learn more about the company and it's workculture. In the long run, I see myself building machine learning driven products that improve people's lives. And working at J&J would surely be step in the right direction.

In 90 seconds, describe your relevant professional experience (2 minute timer on the recording, despite the question stating 90 seconds)

  1. First a quick summary, I'm currently pursuing my master's in computer science with specialization in Machine learning. Prior to that I worked as a machine learning engineer at a Fintech startup for 3 years.

  2. To tell a bit about my experience at the company. It was more of a research focused role. We were trying to completely automate financial trading using cutting edge machine learning techniques like Deep Reinforcement learning. Back in 2018, there were no successful industry scale applications of the technology. So my role was to read, understand and modify latest research in the field. First build research code. Perform and track experiments. After that take what's working and productionize the system. This cover's the complete lifecycle of a machine learning project - from research to production and maintanence of the models. Technology wise : SQL, Python, Numpy, pandas, tensorflow.

  3. Now, Currently my master's is also focused on Machine learning research. This summer I've worked as a research intern under my Professor. We've built an accurate solar forecasting model, used for energy management. We designed a novel model that takes in Skyimages and other sensor data to predict future Solar irradiance. Our work is the process of being published. Here I explored various computer vision techniques along with classical machine learning algorithms.

What are your top three strengths for this position (3 minutes to respond)

1.From what I understand this position is a research and development role.In research, we constantly deal with uncertainty. We constantly have to ask the right questions and look for the answers. There are no set work to do, unlike in a purely engineering role. My past experience, trained me to deal with ambiguity.

  1. Working with people efficiently. In any R&D role, it's crucial to be able to understand and appreciate people's strengths. This let's us efficiently allocate the work leveraging the strengths of the team. I was the lead engineer in the small team of people with different strengths. Over time I learned their strengths and was able to give them the work they enjoyed and excelled. This dramatically improved our productivity.

  2. Giving my best. In whatever i choose to do, I'll give my all. When working with a novel or a hard problem, things often don't work the first time you try. Typically, people take failure too personally. I learned to decide carefully what's worth trying, irrespective of the outcome and give my best.

What skills have you gained that will help in this role?

First I'll start with the necessary technical skills. My prior experience as a Machine learning engineer helped me to become proficient in python and it's corresponding data science stack of libraries. I've developed sufficient skill to be able to modify the internals of some these libraries. In my last job, We had to implement custom optimizers in tensorflow as there was a bug in the libraries implementation. This gave me the confidence to tackle any technical challeges in the work.

Now to speak of general skills, I learned to write and communicate technical ideas to various stakeholders.From explaining my work to the fellow software engineer to communicating it with a potential investor, I learned to simplify and abstract away the right details. Also during my summer internship, I learned to cooperate with people of various backgrounds. My research work required me to colllaborate with people from different backgrounds with varying technical acumen including my professor,couple of graduate students and even a highschool student working in the lab. I hope these skills would help to excel at the job.

Tell us about a project or assignment you worked on, focusing on the skills you exercised and how you worked as part of a team

Recently I worked in the Energy lab at Rutgers as a research intern. We were working on improving the solar forecasting models. The team involved my professor, along with couple of graduate students and a High school student. First I learned to break the project into pieces that can be handled by different people of varying technical skills. Also in such collaborative projects, It's important to be kind. I've taught one of our labmate a tutorial on python and some of it's libraries. This not only creates a more positive and encouraging environment but improves the team productivity in the long run.

Again, as in any research project, our technical skills are bound to be stretched. From dealing with noisy datasets to implementing non-trivial algorithms, I learned to embrace the unknown and work step by step.We created a roadmap and deligently followed it. With frequent lab meetings to access the progress, at the end of the short stint during my summer, we were able to improve the state of art in Solar forecasting. We are in the process of drafting our work for the publication.

Why this role? Why J&J?

  1. I learned that I'll be working with Ethicon product team that design and improve surgical instruments, I know that my work has a direct impact in improving people's lives. It's always and satisfying to know that not only you will learn skills useful in you career, but also have positive impact along the way.

  2. The preferred qualifications listed like experience with python and it's data science libraries like pyspark,numpy, Scikitlearn and strong prior experience working with engineering datasets perfectly match my background.

  3. Also this summer my friend worked at J & J as a Software Development intern, who spoke very highly of the work culture and the team, Which makes me excited to work for the company. In fact, since then I started looking at the J&J career site for opportunities. Also working at the company during the spring willl help me to learn more about the company and it's workculture. In the long run, I see myself building machine learning driven products that improve people's lives. And working at J&J would surely be step in the right direction.

Tell us about a time you worked on a project with stakeholders and how you handled it.

In my first job our stakeholder was Edelweiss Financial services, Indias largest fintech company. We sell our tranding strategies to them, thus they are our direct stakeholder. They also provide us the infrastructer for developing our strategies. I as a lead ML engineer, is responsible for our models. We have to provide them various metrics that tell them the expected performance of these models in live trading. This means, I have to ensure our team and me are very careful with our models.

I've decided to direct more efforts towards building a stress testing engine - that contains suite of tests much like in traditional software but for machine learning models. These tests are designed to improve the robustness of the models. Also I've taken the call to always provide them with the worst case performance. This helped in earning their trust and improved our relationship with them.

Tell us about a time you analyzed financial data in support of a project.

My last job involved working with financial data to make quantitative strategies to trade in stock markets. One peculiar aspect of financial data is that they are incredably noisy and show frequent regime changes. Thus we need to build models that are sensitive to such changes and can immediately adapt. Also it's extremely crucial to assess the model's robustness.

We've decided to direct more efforts towards building a stress testing engine - that contains suite of tests much like in traditional software but for machine learning models. These tests are designed to improve the robustness of the models. Also from technical standpoint, we always preferred ensembling of various models. This we found performs better in live as the resultant model has less bias. image

ML-wild

Synthesizing insights for building Real-world ML systems.

There is a wide gap between machine learning in the Real-world and academia. This repo is a collection of notes to myself to avoid as much angst as possible for my future self. The challenges many challenges that I faced in my work but doesn't really seem to bother academic ML researchers. More importantly the experience I gained is largely visceral and can't even be easily transferred. This repo is an attempt at forcing myself to clearly articulate the tricks/tools and techniques needed to successfully develop ML-centric applications.

working thoughts(When not mentioned, Models largely refer to Deep Neural networks)

  1. When should I retrain my models ? How to identify, have a clear and a sound rule for retraning. Model updating-> I don't want to retrain but just update on the recent data. In my last job, we need to do this on Deep RL algorithms, completely hopeless endeovour. Can I identify data-set shift before model degradation ? In my experience it's based off historcial experience of the model coming from somebody older than you.

2.Quantifying improvement ? How to quantify improvement. Esp with DEEPRL, no clear best model.

3.What to monitor ? How to split the data ? Not as simple as they seem. We want testing data as relective of production-data. In case of time-series we also want to have training data as close to real-world/testing data. Thus the latest data is contested by both traning-split and testing-split. What split is the best/reliable indicator of your model's live performance.

  1. How to test Machine Learning Code ? We avoided testing altogether. Well there were bunch of assert statements all over. Validation data is often limited. We need to track other metrics/properties of the model that are proxy for generalization. This also comes under testing, if we view tests as series of binary functions that all must assert True before putting the model to production.

  2. Finally we want models that fail gracefully ?

Notes On ML Engineering/Infrastructure

Observations/obvious facts

  1. Building Models is being commoditized. The has been a continuos trend towards unification of methods in cv,nlp etc. AutoML can give you a fairly good model.

Over the next decade, there was a Cambrian explosion of web frameworks which started to converge to common infrastructure stacks like LAMP (Linux, Apache, MySQL, PHP/Perl/Python). By 2020, a number of components, such as the operating system, the web server and databases, have become commodities which few people have to worry about, allowing most developers to focus on the user-facing application layer e.g. using ReactJS

Layers a. DataWarehouse A key challenge is to lay out the data in a way that makes it easily discoverable and efficiently accessible by data science applications while maintaining a high degree of durability - naturally the data warehouse must not lose data. In addition, many data science applications care a lot about granular historical data, which can be used to train models, which poses an extra challenge for many traditional data warehousing solutions.

b.Compute Resources A system that can provide computing resources on demand.

c.Job Scheduler We want to feed fresh data in the algorithm at a regular cadence.

d.Architecture At the architecture layer, you define what your application is and how it is going to work. This includes defining an algorithm or several algorithms and how the scheduling layer is supposed to connect the algorithms and the data pipelines together.

e.Versioning experimentation and iteration you can evaluate results reliably only if the applications are well isolated, that is, they must not interfere with each other. You must isolate the algorithms themselves, the input and output data, and the scheduled pipelines.

f.Model Operations help to deploy, monitor, and assure validity of all models at all times, without hindering the speed of experimentation. To make this possible, the infrastructure needs to track metadata about all executions and models, from prototype to production.

g.Feature Engineering h.Model Development

Screen Shot 2022-01-09 at 10 21 53 PM

Netflix -> not just recommender system. But they also produce movies. Read scripts and predict the hits or watch video clips etc.(CV). So one app can have many applications.

Volume - we want to support a large number of data science applications. Velocity - we want to make it easy and quick to prototype and productionize data science applications. Validity - we want to make sure that the results are valid and consistent. Variety - we want to support many different kinds of data science models and applications.

Data is the king unless -> a. Mission Critical/high stakes application b.Massive scale (here modeling starts to be crucial as improving acc from 99 to 99.5 may mean millions of dollars in revenue)

A self-driving car company has one special application, so they should focus on building a single custom application - they don’t have the variety and the volume that would necessitate a generalized infrastructure. A small startup pricing used cars using a predictive model can quickly put together a basic application to get the job done - again, no need to invest in infrastructure initially.In contrast, a large multinational bank has hundreds of data science applications from credit rating to risk analysis and trading, each of which can be solved using well-understood (albeit sophisticated - “common” doesn’t imply simple or unadvanced in this context) models, so a generalized infrastructure is well justified. Over time companies tend to gravitate towards generalized infrastructure, no matter where they start. A self-driving car company that initially had one custom application will eventually need data science applications to support sales, marketing, or customer service.

Example of the complexity: As a concrete example, consider a hypothetical mid-size e-commerce store: They have a custom recommendation engine (“These products are recommended to you!”), a model to measure effectiveness of marketing campaigns (“Facebook ads seem to be performing better than Google ads in Connecticut”), an optimization model for logistics (“It is more efficient to dropship category B vs. keeping them in stock”), and a financial forecasting model estimating churn the customer lifetime value (“customers buying X seems to churn less”). Each of these four applications is a factory in itself: They may involve multiple models, multiple data pipelines, multiple people, and multiple versions.

avoid introducing incidental complexity, i.e. complexity that is not necessitated by the problem itself but it is just an artifact of our approach. Incidental complexity is a huge problem for real-world data science because we have to deal with such a high level of inherent complexity that distinguishing between real problems and imaginary problems becomes hard.

Public clouds, such as Amazon Web Services, Google Compute Platform, and Microsoft Azure, have massively changed the infrastructure landscape by allowing anyone to access foundational layers that were previously available only for the largest companies.

Don't reinvent the wheel.

  • Amazon S3 which provide a virtually unlimited amount of storage with close to a perfect level of durability and high availability.
  • nearly infinite, elastically scaling, compute resources like Amazon EC2

Objective is to merge the first 4 roles onto a single individual:

Data Scientist or Machine Learning Researcher develops and prototypes machine learning or other data science models. Machine Learning Engineer implements the model in a scalable, production-ready way. Data Engineer sets up data pipelines for input and output data, including data transformations. DevOps Engineer deploys applications in production and makes sure that all the systems stay up and running flawlessly. Application Engineer integrates the model with other business components, e.g. web applications, which are the consumers of the model. Infrastructure or Platform Engineer provides general pieces of infrastructure, such as data warehouses or compute platforms for many applications to use.

Screen Shot 2022-01-09 at 11 00 09 PM

Data Scientist -> Looking at data, writing code, evaluating it, and analyzing results.

I used spreadsheets for experiment-tracking.->No uniform guidelines about how to structure the spreadsheets.

END-END MLOPS Projects

Don't need a bigger boat1 MSYS Course material no-ops ML Another end-end project Robustness-gym and Mandaline DAG CARDS N IT'S CODE

yugene yan's blog

Books Good Book on MLFLOW

Research Papers:

Mandoline: Model Evaluation under Distribution Shift

Model Patching: Closing the Subgroup Performance Gap with Data Augmentation

Understanding Dataset Shiftand Potential Remedies

Broad Strokes.

Assorted notes from various sources.

Step 1: Store and Collect data

Step 2: Prepare Training data

a. Hand Labeling

  • Linear(More people for more labels)
  • Privacy Issues (Maybe you can't show the data to a person)
  • Costly( Radiology reports vs comment-toxicity)
  • Aren't adaptive( you now want to train on 3 classes instead of 2 which you previously handlabeled)

If you have to follow this route be careful with ambiguity( specify the task precisely and give instructions on how to label in ambiguos cases -> ensure quality labels), multiplicity(variation(label quality) due to different annotators,typically data from various sources/vendors is merged)

How to Deal with unlabeled Data ?

Screen Shot 2022-01-07 at 10 25 18 PM

Good Quality to study ML from the foundations

Practical Resources

AppliedAI notes

Causal Inference in DataScience

Mathematics

BEST RESOURCES Optimization : this notes or this longer notes

Linear Algebra: this course

Good Resoursces Linear Algebra Optimization Another Course in Optimization though not as good Robust ML Course

Practice Oriented Courses/Resources

ML Course Learn to build open Source Package Bayesian Data Analysis

Interview Prep Resources This blog

Model evaluation and Selection

About

Synthesizing insights for building Real-world ML systems.

License:MIT License


Languages

Language:Jupyter Notebook 100.0%