Exploring MediaPipe on Raspberry Pi 4

The first two numbers are normalized (0.0-1.0) xy positions of a finger tip. The third number is an approximated z position (cm) of the hand (assuming horizontal fingers) from the camera. The fourth number is MediaPipe hand tracking at ~14fps using max_num_hands=1 (~8fps for max_num_hands=2).


Raspberry Pi 4 Model B (tested)

Raspberry Pi Camera 1.3 (tested)


Raspberry Pi OS with desktop
Release date: May 7th 2021
Kernel version: 5.10
Size: 1,180MB

Raspberry Pi is up to date:

sudo apt-get update
sudo apt-get upgrade
sudo reboot

Next four lines are from

sudo apt-get install ffmpeg python3-opencv
sudo apt-get install libxcb-shm0 libcdio-paranoia-dev libsdl2-2.0-0 libxv1 libtheora0 libva-drm2 libva-x11-2 libvdpau1 libharfbuzz0b
sudo apt-get install libbluray2 libatlas-base-dev libhdf5-103 libgtk-3-0 libdc1394-22 libopenexr23

sudo pip3 install mediapipe-rpi4

sudo apt-get install espeak
espeak hello (check if working)

Numpy and Pygame are already in Pi OS.

Experiment 1

Python code to track hand, compute length of middle finger, approximate z distance (cm) from Pi camera, vary an audio sine wave as a function of z:

import mediapipe as mp
import cv2
import pygame
import numpy
import threading
import time
import os

x1 = 3.0
y1 = 3.0
x2 = 3.0
z = 60.0        # initial value distance (cm) from camera to prevent thread from exit at start
fps = 0.0

sampling = 44100
pygame.mixer.init(sampling, -16, 1)
def sound1():

  while True:

        if z < 20.0:

        z_hold = z   # mediapipe hand tracking ~8Hz, python thread generates a sine wave as a function of z every 1 seconds 

        data = numpy.sin(2 * numpy.pi * z_hold * numpy.arange(sampling) * 10 / sampling).astype(numpy.float16)
        sound = pygame.mixer.Sound(data)   # set volume low
        string = "espeak " + str(numpy.round(z_hold,1))

thread1 = threading.Thread(target=sound1)

mp_drawing =
mp_hands =

cap = cv2.VideoCapture(0)

with mp_hands.Hands(min_detection_confidence=0.8, min_tracking_confidence=0.5, max_num_hands=1) as hands:

    while cap.isOpened():

        time1 = time.time()

        ret, frame =
        image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        image = cv2.flip(image, 1)
        image.flags.writeable = False
        results = hands.process(image)   # mediapipe analyzes a frame
        image.flags.writeable = True
        image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
        if results.multi_hand_landmarks:
            for num, hand in enumerate(results.multi_hand_landmarks):
                mp_drawing.draw_landmarks(image, hand, mp_hands.HAND_CONNECTIONS, 
                                        mp_drawing.DrawingSpec(color=(0, 0, 0), thickness=2, circle_radius=4),
                                        mp_drawing.DrawingSpec(color=(250, 50, 250), thickness=2, circle_radius=2)
            x1 = hand.landmark[mp_hands.HandLandmark.MIDDLE_FINGER_TIP].x
            y1 = hand.landmark[mp_hands.HandLandmark.MIDDLE_FINGER_TIP].y
            x2 = hand.landmark[mp_hands.HandLandmark.MIDDLE_FINGER_MCP].x
            z = int(numpy.round(10/(x2-x1+1e-09)))            # assuming middle finger ~horizontal in front of camera
                                                              # 1e-09 prevents zero division 1/0 for improbable x2 = x1 = 0
        time2 = time.time()
        fps = 1/(time2-time1)
        time1 = time2

        string2 = str(numpy.round(x1,2)) + " " + str(numpy.round(y1,2)) + " " + str(z) + "cm" + " " + str(numpy.round(fps,1))+" fps"

        image2 = cv2.putText(
                img = image,
                text = string2,
                org = (100, 100),
                fontFace = cv2.FONT_HERSHEY_DUPLEX,
                fontScale = 1.0,
                color = (0, 255, 0),
                thickness = 2
        cv2.imshow('Hand Tracking (hand z < 20cm or q key to exit)', image2)

        if cv2.waitKey(10) & 0xFF == ord('q'):

        if  z < 20.0:

z = 1.0         # thread exits


os.system("espeak 'stopping program'")

Call python script "".


Exploring how Experiment 1 works

Experiment 2 - Depth perception

Explores depth perception using MediaPipe with two cameras.

Two identical usb webcams separated ~6cm to mimic human eyes. In each opencv window, MediaPipe hand tracking gives XY positions of finger tip and computes Z positions. A small terminal window on the bottom prints (x,y,z) of finger tip from left view. Z positions are not scaled properly and dimensionless. Experiment works as Z is increasing as hand is moving away from both cameras.

A Pi camera and a usb webcam should work but most likely have different field of views so different computed Z positions. Not critical since this is an experiment to learn to compute Z from two different views.

Unlike Experiment 1, it is not dependent on a finger being horizontal to compute Z (but required two cameras).

Here is a python script for a camera.

import numpy
import mediapipe
import cv2
import time
from sys import stdout

x_index = 0
y_index = 0

mp_drawing =
mp_hands =

cap = cv2.VideoCapture(0)

with mp_hands.Hands(min_detection_confidence=0.8, min_tracking_confidence=0.5, max_num_hands=1) as hands:

    while cap.isOpened():

        time1 = time.time()

        ret, frame =
        image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        image = cv2.flip(image, 1)
        image.flags.writeable = False

        results = hands.process(image)  # mediapipe analyzes image
        image.flags.writeable = True
        image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
        if results.multi_hand_landmarks:
            for num, hand in enumerate(results.multi_hand_landmarks):
                mp_drawing.draw_landmarks(image, hand, mp_hands.HAND_CONNECTIONS, 
                                        mp_drawing.DrawingSpec(color=(0, 0, 0), thickness=2, circle_radius=4),
                                        mp_drawing.DrawingSpec(color=(250, 50, 250), thickness=2, circle_radius=2)

            x_index = hand.landmark[mp_hands.HandLandmark.INDEX_FINGER_TIP].x
            y_index = hand.landmark[mp_hands.HandLandmark.INDEX_FINGER_TIP].y
        stdout.flush()          # keep pipe in next script moving

        time2 = time.time()
        fps = 1/(time2-time1)
        time1 = time2

        string = str(numpy.round(x_index,2)) + " " + str(numpy.round(y_index,2)) + " " + str(numpy.round(fps,1))+" fps"
        image = cv2.putText(
                img = image,
                text = string,
                org = (100, 100),
                fontFace = cv2.FONT_HERSHEY_DUPLEX,
                fontScale = 1.0,
                color = (0, 255, 0),
                thickness = 2
        cv2.imshow('Hand Tracking (q key to exit)', image)

        if cv2.waitKey(10) & 0xFF == ord('q'):


Name script "". Run "python3" in a terminal. Let that camera be the left eye.

Next "cp" and change "cap = cv2.VideoCapture(2)". (not sure what happen to 1?) Run "python3" in another terminal. Let that camera be the right eye.

If working, press key "q" in opencv window to exit scripts. The next script will run both scripts and compute Z positions from XY positions.

import subprocess
import threading
import time
import numpy

x1 = "2.0"
x2 = "1.0"
z = 0.0

proc = subprocess.Popen(["python3",""],stdout=subprocess.PIPE, text=True)

proc2 = subprocess.Popen(["python3",""],stdout=subprocess.PIPE, text=True)

def data1():
	global x1
	for line in proc.stdout:
		x1 = line.strip()
		x1 = x1.split()

def data2():
	global x2
	for line in proc2.stdout:
		x2 = line.strip()
		x2 = x2.split()

		if type(x1) == list:
			z = numpy.abs(1/(float(x2[0])-float(x1[0])+1e-09))		

thread1 = threading.Thread(target=data1)

thread2 = threading.Thread(target=data2)



Hand landmarks (eg. coding hand.landmark[mp_hands.HandLandmark.MIDDLE_FINGER_TIP].x)

Nicholas Renotte's excellent introduction to python mediapipe hand tracking

Pygame mixer is used for sound synthesis

What is Pygame?

Experiment 2 is based on the same concept "Basics" in OpenCV depthmap tutorial but using one x and x' for a finger tip and not computing a depth map from left and right images.

