jarvis

Potential use cases

quick switch to application or a tab
layout application windows in a grid
automate interactions with applications (share)
start/stop recording
take screenshot
quick note taking
copy to different clipboard buffers "copy this as blah", "paste from blah"
set reminders
notify

We try to follow the Google Python Style Guide.

Roadmap

See the Brainstorm doc for the "crazy ideas."

Features / Enhancements

Keyboard shortcut to open App and click record
"Listening" animation after clicking record
Dedicated console on GUI for debug logs (currently logs are truncated)
Record from history (save a sequence of voice commands as a macro)

Bugs / Known Issues

If you forget to call "exit" in stream mode and let the microphone run for awhile before your next command, google will continue to record audio and try to process an extremely large transcript which causes the program to timeout / drag. We need a way to detect silence in stream mode, and clear the audio buffer. We can use a timeout parameter in the Microphone or GoogleTranscriber to clear the buffer if no commands are heard for N seconds
[Mac] The TaskBar loads slowly and gets stuck sometimes
[Mac] "Switch to X" gets stuck if program is minimized ()
[Mac] GUI layout isn't formatted properly. Appears to be differences between Monitors or Operating Systems we need to work out. Ideally the GUI can appear the same across all monitors/OS.

UI Features/Bugs

Use real-time sound detection from mic to play animation (don't wait for google)?
Show/Hide GUI using the Python API (make keyboard shortcut)
Send Show/Hide event to Python when the user opens the window
UI should always be on top of all windows (pin the window)?

Developer Setup

Mac Setup

(Tested on MacOS Big Sur 11.4, M1 Chip, Intel Chip)

Install Pyenv and Python 3.8.10

brew install pyenv
pyenv install 3.8.10
pyenv global 3.8.10

# Run this and follow instructions for how to update your PATH, ~/.profile, ~/.zprofile, and ~/.zshrc. Then do a full logout and log back in.
pyenv init

# Verify pyenv is working
>> python -V 
Python 3.8.10

Install homebrew prerequisites

# Microphone support
brew install portaudio

# Sphinx NLP library (Optional, also requires python 3.6)
# https://pypi.org/project/pocketsphinx/
# https://github.com/Uberi/speech_recognition/blob/master/reference/pocketsphinx.rst
brew install swig

# For AppKit
brew install cairo gobject-introspection

# For Kivy
# https://kivy.org/doc/stable/installation/installation-osx.html#install-source-os
brew install pkg-config sdl2 sdl2_image sdl2_ttf sdl2_mixer gstreamer

brew install openssl

Update environment variables to properly configure clang

Either add these to ~/.profile or manually run them in the shell before running pip install -r requirements

export GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1
export GRPC_PYTHON_BUILD_SYSTEM_ZLIB=1
export CPLUS_INCLUDE_PATH="${CPLUS_INCLUDE_PATH:+${CPLUS_INCLUDE_PATH}:}/opt/homebrew/opt/openssl/include"

Install Chrome Web driver (for browser automation)

Instructions here. On MacOS you also have to grant permissions to web driver. Download the same version as your version of Chrome.

Set up Google Cloud Project

First, create a GCP project or use the Jarvis one (jarvis-1626279785926). If you create one, you'll need to set up a billing account and enable the Cloud Speech APIs.

Next, install the SDK https://cloud.google.com/sdk/docs/install and configure it.

gcloud init
gcloud config list

# Should see something like
[core]
account = bfortuner@gmail.com
disable_usage_reporting = True
project = jarvis-1626279785926

# Login to get credentials
gcloud auth application-default login

Ubuntu Setup

(Tested on Ubuntu 20.04)

Install Python Virtual Environment

sudo apt install python3-venv

Install library dependencies

sudo apt install python3.8-dev

# Kivy depends on this
sudo apt install python3-tk
sudo apt install libcairo2-dev

# SpeechRecognition package depends on these
sudo apt install libportaudio2 portaudio19-dev

# PyGObject depends on this
sudo apt install libgirepository1.0-dev

# Taskbar icon support requires this
sudo apt install gir1.2-appindicator3-0.1

# If running without a GUI and pyautogui gives you KEYERROR :DISPLAY. Add this to ~/.bashrc, etc.
export DISPLAY=:0

Python Setup

Create Virtualenv (Python 3.8)

pip3 install virtualenv
virtualenv .venv --python=python3
source .venv/bin/activate

Install python dependencies

# export ARCHFLAGS="-arch x86_64"  # for pyaudio on older versions of MacOS (not required on Big Sur)
pip install -r requirements.txt

Install Kivy (Mac Only)

# The M1 architecture requires we install Kivy from source
# https://kivy.org/doc/stable/gettingstarted/installation.html#from-source
git clone git://github.com/kivy/kivy.git kivy_repo && cd kivy_repo
python -m pip install -e ".[base]"  && cd ..

Install atomac (Mac Only)

Atomac seems to have a dependency because of which we can't install directly using pip install so we need to get the source code.

git clone https://github.com/pyatom/pyatom.git pyatom_repo && cd pyatom_repo
python -m pip install future
python -m pip install . && cd ..

Verify things are working

# Say something
python scratch/speech_recognition_examples.py

# Verify GCP auth is working
python scratch/google_speech_recognition_example.py

# A window with "Hello world" should open
python scratch/kivy_example.py

# Verify Selenium is installed correctly
python -m scratch.selenium_example

# Run the Kivy app (then click Record and "Switch to Chrome")
python main.py

# OR, run the electron app
python electron.py

# And in prophet...
npm install
npm run start

App steps

# Load contacts from google
python -m higgins.automation.contacts.google

# Install pytorch
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

Testing

Run unit tests

# Optionally prepend SPEED_LIMIT=N_SEC to slow down the automation when debugging
pytest tests/

Distributing the app

Some options for packaging the app into a native executable:

GPT3 Notes

How to avoid making up facts (and large document parsing)
Fine-tune a model to improve truthfulness
- Sample 10+ generations and pass them through a discriminator / classifier to determine truthfulness
Preventing Hallucination Facebook
Longformer - Larger texts
Improving reasoning skills

Ideas to improve truthfulness

Generator samples 10 answers, discriminator evaluates the answers and selects the best
Fine-tune the model on your facts
Lower the temperature
Include "I don't know" as a valid response, with examples (false positives)
Incorporate the model's confidence (log probs?) to evaluate the reply (and determine how many times to sample?)

Ideas to process large documents

Website for GPT-based projects http://gptcrush.com/ Email-related product from GPT https://www.hypertype.co/

NOTE: You pay money for every document searched
Pre-search the data with a cheap model (Ada) or non-model-based search engine (Gmail API, ElasticSearch, txt AI)
Break large documents into snippets
Pre-process the document into summarizations or salient facts
Run semantic search, then completion (like answers/ endpoint)
I have 6M documents and fast search with SOLR (ElasticSearch)
Steps
- Search relevant documents with cheap local engine (elasticsearch, SOLR)
- Upload chunks of these articles dynamically based on the most relevant chunk from that article
- Pass results to semantic search or answers endpoint

Semantic Search / Email Processing

Info Retreival with ReRank
Info Retreival with ReRank (Could be used for email search. Can operate on paragraphs.)
Topic Modeling - topic modeling / clustering. Guided (manually seed topics), semi, or unsupervised. Could be used to categorize emails by their semantic meaning (recruiter emails, family/personal, flights, verification codes, orders/receipts, promotions, etc.)
[TextRank] https://towardsdatascience.com/textrank-for-keyword-extraction-by-python-c0bae21bcec0 - extract most important keywords from emails and discard the rest
Extractive Summarization Overview - Gensim
Extracting data from HTML tables and websites 1 2
TextPipe Extracting text from HTML pages

Data Labeling

https://prodi.gy/features (Scriptable annotation tool for NLP tasks -- and some CV)

cfortuner / jarvis