OpenAdaptAI / OpenAdapt

AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models

Home Page:https://www.OpenAdapt.AI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Consolidate video.py and capture.py for local hardware acceleration

abrichr opened this issue · comments

Feature request

capture/_macos.py uses AVFoundation, capture/_windows.py uses screen_recorder_sdk which uses MediaFoundationAPI. These are likely to be more performant than mss used in record.py and video.py, but currently capture does not support extracting time aligned screenshots (while video does):

(openadapt-py3.10) abrichr@MacBook-Pro-4 OpenAdapt % ffprobe captures/2024-02-19-10-43-33.mov  
ffprobe version 6.1.1 Copyright (c) 2007-2023 the FFmpeg developers
  built with Apple clang version 15.0.0 (clang-1500.1.0.2.5)
  configuration: --prefix=/usr/local/Cellar/ffmpeg/6.1.1_3 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags='-Wl,-ld_classic' --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libaribb24 --enable-libbluray --enable-libdav1d --enable-libharfbuzz --enable-libjxl --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopenvino --enable-libspeex --enable-libsoxr --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack --enable-videotoolbox --enable-audiotoolbox
  libavutil      58. 29.100 / 58. 29.100
  libavcodec     60. 31.102 / 60. 31.102
  libavformat    60. 16.100 / 60. 16.100
  libavdevice    60.  3.100 / 60.  3.100
  libavfilter     9. 12.100 /  9. 12.100
  libswscale      7.  5.100 /  7.  5.100
  libswresample   4. 12.100 /  4. 12.100
  libpostproc    57.  3.100 / 57.  3.100
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x7fd88a704b40] moov atom not found
captures/2024-02-19-10-43-33.mov: Invalid data found when processing input

This issue will be complete once we have modified these files to support saving video files recorded via openadapt.capture from which time-aligned screenshots can be extracted. i.e. we need to modify openadapt.capture._macos.Capture and openadapt.capture._windows.Capture to supply screenshots in memory instead of file, e.g. self.session.addOutput_(self.file_output).

Motivation

Local hardware acceleration -> maximum performance

Via ChatGPT:

To replace self.session.addOutput_(self.file_output) with a mechanism that calls a callback with a screenshot in your macOS capture implementation, you would typically use AVCaptureVideoDataOutput instead of AVCaptureMovieFileOutput. AVCaptureVideoDataOutput allows you to receive video frames as they are captured, which you can then process in a callback method.

Here’s a conceptual outline on how to set this up:

  1. Use AVCaptureVideoDataOutput: This class provides a way to capture video frames as they are produced by the capture session.

  2. Set up a Delegate for Frame Capture: Implement a delegate that conforms to the AVCaptureVideoDataOutputSampleBufferDelegate protocol. This delegate will receive callbacks with the video frames.

  3. Implement the Callback Method: The delegate's callback method receives a CMSampleBufferRef that contains the frame data. You can then convert this sample buffer into a format suitable for your needs (e.g., a screenshot).

Step-by-Step Implementation

First, modify your Capture class to include an AVCaptureVideoDataOutput and set up the delegate:

from Foundation import NSObject, NSLog
import AVFoundation as AVF
from Quartz import CGMainDisplayID

class SampleBufferDelegate(NSObject):
    def captureOutput_didOutputSampleBuffer_fromConnection_(self, captureOutput, sampleBuffer, connection):
        # This method is called with a CMSampleBufferRef `sampleBuffer`
        # You can convert this to a screenshot here and call your desired callback
        NSLog("Received a frame")
        # Conversion to screenshot and callback call goes here

class Capture:
    def __init__(self):
        # Initialize as before...
        self.videoDataOutput = None
        self.videoDataOutputQueue = None
        self.sampleBufferDelegate = None

    def start(self, audio: bool = False, camera: bool = False):
        # Setup as before...
        
        # Setup video data output
        self.videoDataOutput = AVF.AVCaptureVideoDataOutput.alloc().init()
        self.videoDataOutputQueue = AVF.dispatch_queue_create("videoDataOutputQueue", None)
        self.sampleBufferDelegate = SampleBufferDelegate.alloc().init()
        
        self.videoDataOutput.setSampleBufferDelegate_queue_(self.sampleBufferDelegate, self.videoDataOutputQueue)
        
        if self.session.canAddOutput_(self.videoDataOutput):
            self.session.addOutput_(self.videoDataOutput)

Notes:

  • Conversion of CMSampleBufferRef: Converting a CMSampleBufferRef to a more usable format (like an image or numpy array) inside the captureOutput_didOutputSampleBuffer_fromConnection_ method requires additional steps. This might involve extracting the image buffer (CVImageBufferRef) and then converting it to your desired format. This process can be complex and might involve using additional libraries (like OpenCV) for image manipulation.

  • Performance Considerations: Processing video frames in real-time can be CPU-intensive. Ensure your callback and conversion logic is optimized for performance. Depending on your requirements, you might want to drop frames or reduce the resolution to maintain performance.

  • Threading: The callback method captureOutput_didOutputSampleBuffer_fromConnection_ is called on the dispatch queue specified when setting the sample buffer delegate. Make sure any UI updates or intensive processing tasks are dispatched back to the main thread or handled appropriately to avoid blocking the capture queue.

This approach allows you to intercept video frames as they are captured, enabling you to process and use them as screenshots within your application.

@0dm thoughts? 🙏 😄

@0dm thoughts? 🙏 😄

This could work. I will look into implementing this sometime this week.

Regarding this:

Performance Considerations: Processing video frames in real-time can be CPU-intensive. Ensure your callback and conversion logic is optimized for performance. Depending on your requirements, you might want to drop frames or reduce the resolution to maintain performance.

See max_cpu_percent and related for an attempt to implement this https://github.com/OpenAdaptAI/OpenAdapt/pull/569/files#diff-57d8577d1fb5faaf576a6f5663741c83e672378c13c91a1db036fb7a3f05e067R559

@Cody-DV for a Windows approach see:

https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/capture/_windows.py

https://github.com/Andrey1994/screen_recorder_sdk/blob/31417c8af136a7b8b44702e69fa0bb6ebb5c2b13/python/screen_recorder_sdk/screen_recorder.py

https://chat.openai.com/share/19cc37a0-750f-451a-95cf-acad27efb7b6

import cv2
import numpy as np
import time

from screen_recorder_sdk import screen_recorder

def capture_frames_in_memory(duration, fps):
    """
    Captures frames for a given duration and fps, and stores the video in memory.
    
    :param duration: Duration to capture video for in seconds
    :type duration: int
    :param fps: Frames per second
    :type fps: int
    """
    frame_interval = 1.0 / fps
    num_frames = int(duration * fps)

    # Initialize video capture parameters
    params = screen_recorder.RecorderParams()
    screen_recorder.init_resources(params)

    # Prepare the first screenshot to determine resolution
    image = screen_recorder.get_screenshot()
    frame = np.array(image)
    height, width, layers = frame.shape
    size = (width, height)

    # Initialize an in-memory video writer using OpenCV
    # FourCC is a 4-byte code used to specify the video codec. The list of available codes can be found in fourcc.org.
    # *'MP4V' is a codec that is compatible with MP4 files.
    fourcc = cv2.VideoWriter_fourcc(*'MP4V')
    video_writer = cv2.VideoWriter('appsrc ! videoconvert ! x264enc noise-reduction=10000 speed-preset=ultrafast tune=zerolatency ! mp4mux ! filesink location=video.mp4 ', fourcc, fps, size)

    start_time = time.time()
    for _ in range(num_frames):
        image = screen_recorder.get_screenshot()
        frame = np.array(image)
        video_writer.write(cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))
        time.sleep(frame_interval)

    video_writer.release()
    screen_recorder.free_resources()

    elapsed_time = time.time() - start_time
    print(f"Capturing completed in {elapsed_time:.2f} seconds.")

# Example usage
if __name__ == "__main__":
    duration = 5  # seconds
    fps = 10
    capture_frames_in_memory(duration, fps)

We can replace the cv2 writer with what we have in https://github.com/OpenAdaptAI/OpenAdapt/blob/main/openadapt/video.py