edgeai-computer-pointer-controller

Computer Pointer Controller app controls the mouse pointer by using eye and head position.

Introduction

Computer Pointer Controller app controls the mouse pointer by rolling of the eyes and head pose estimation. The app takes video or webcam stream as input and uses Intel OpenVino toolket to run the interference on image frames and move the mouse pointer accordingly.

Project Set Up and Installation

Setup

Install the openvino toolkit following the instructions here. Here are the instructions for macos.
Clone the repo.
Run jupyter notebook: computer-pointer-controller-workflow.ipynb. This will download the needed models in the ./models directory. Model details are provided in later section.
Now source the OpenVino environmnet.

source /opt/intel/openvino/bin/setupvars.sh -pyver 3.7

Python version can be any python 3 version installed on the computer system.

Install all the python requirements from ./requirements.txt into the conda or venv environment.

Demo

Now that the setup is done, we are ready to run the workflow.

To run the job either using queue_job.sh or python3 code. The shell script is wrapper around following call:

python3 ./src/main.py -fm "$FACE_DETECTION_MODEL_PATH" \
                -hm "$HEAD_POSE_ESTIMATION_MODEL_PATH" \
                -lm "$FACIAL_LANDMARKS_DETECTION_MODEL_PATH" \
                -gm "$GAZE_ESTIMATION_MODEL_PATH" \
                -i "$INPUT" \
                -o "$OUTPUT" \
                -d "$DEVICE" \
                -t "$THRESHOLD"

To get the detailed help type:

python3 ./src/main.py -h

Here is the help output:

  -h, --help            show this help message and exit
  -fm FACE_DETECTION_MODEL, --face-detection-model FACE_DETECTION_MODEL
                        Path to Face Detection model without extension
  -hm HEAD_POSE_MODEL, --head-pose-model HEAD_POSE_MODEL
                        Path to Head Pose Estimation model without extension
  -lm FACIAL_LANDMARKS_MODEL, --facial-landmarks-model FACIAL_LANDMARKS_MODEL
                        Path to Facial Landmarks Detection model without
                        extension
  -gm GAZE_ESTIMATION_MODEL, --gaze-estimation-model GAZE_ESTIMATION_MODEL
                        Path to Gaze Estimation model without extension
  -i INPUT, --input INPUT
                        Path to input video. Use 'cam' for capturing video
                        stream from camera
  -l CPU_EXTENSION, --cpu_extension CPU_EXTENSION
                        MKLDNN (CPU)-targeted custom layers. Absolute path to
                        shared lib with the kernels impl.
  -d DEVICE, --device DEVICE
                        Specify the target device to infer on; Can be: CPU,
                        GPU, FPGA or MYRIAD
  -t THRESHOLD, --threshold THRESHOLD
                        Probability threshold for detections
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        Path to output directory
  -v SHOW_INTERMEDIATE_VISUALIZATION, --show-intermediate-visualization SHOW_INTERMEDIATE_VISUALIZATION
                        Shows intermediate step visualization

As a sample input, demo.mp4 is provided as sample video in ./original_videos directory. For intermediate visualization, please move the demo video as sometimes the intermediate visualization may be below the actual demo video.

Benchmarks

Here are the benchmarks for CPU on my local system:

Precision	Load Time	Inference Time	Effective FPS
FP16	497 ms	23.2 s	2.5
FP32	543 ms	23.1 s	2.4
FP16-INT8	693 ms	23.1 s	2.55

Across devices:

Precision	Load Time	Inference Time	Effective FPS
FP16	497 ms	23.2 s	2.5
FP32	543 ms	23.1 s	2.4
FP16-INT8	693 ms	23.1 s	2.55

Results

Here are the results:

Decreasing the precision of model decreases accuracy. It should in general decrease inference time, but not always.
With higher precision, model takes slightly higher time in inference, but accuracy drop from FP32 to FP16 is not significant. This may be due to models being trained and simplified in such a way that they work nicely at low precision rates

Edge Cases

Face detection currently occurs for 1 face. Not sure how the model will react in case of multiple people in frame
Lighting condition expected may result in gaze prediction in case eye vector is not properly recognized.

kinshuk4 / edgeai-computer-pointer-controller