中文版

ESP-BOX Developer Guide

ESP-BOX is a new generation AIoT application development kits released by Espressif Systems, powered by Espressif ESP32-S3 SoC, supporting AI acceleration and Wi-Fi + Bluetooth 5 (LE) wireless functions. ESP-BOX has a built-in 2.4-inch LCD capacitive touch screen, dual microphones, a speaker, and dual Pmod™ compatible expansion interfaces. With the out-of-the-box voice wake-up and voice recognition functions, as well as Espressif's open-source development framework and examples, developers can create various interesting AIoT applications better and faster.

Hardware Overview

Specifications:

Ports:

Digilent Pmod™ Interface Specification

Technical Architecture

The ESP-BOX development framework includes three main layers: system layer, framework layer, and application layer:

System Layer

ESP-IDF is the open-source development framework for Espressif SoCs including ESP32、ESP32-S and ESP32-C series. It integrates a large number of system-level basic components, a set of development tools for code compilation and debugging, and detailed development guidance documents:

System-level basic components, mainly including multi-tasking, multi-core support based on FreeRTOS, and memory heap allocator with external expansion of PSRAM; a uniform operation interface support for LL, HAL, Driver, and VFS layers of multiple ESP chips; multiple standard network protocol stacks like TCP/IP, HTTP, MQTT, WebSocket, etc.
Set of development tools, mainly including GCC cross-toolchain for code building, JTAG debugging tool based on OpenOCD, real-time tracking based on Segger SystemView, Flash and eFuse programmers, etc.
Development guidance documents, API references, guidelines, and notes of Espressif’s various chip platforms and software versions. Developers can view these documents Online or Download.

Framework Layer

In addition to system-level support, ESP-BOX also integrates several excellent solution frameworks for typical problems in different scenarios. Including the speech recognition framework ESP-SR, the human-computer interaction framework ESP-HMI, and the end-to-end IoT development framework ESP-RainMaker. The functions and customization methods of these solution frameworks will be briefly introduced as follows:

ESP-SR

ESP-SR is a speech recognition framework developed by Espressif for the field of intelligent speech. ESP-SR can provide high-performance and high-reliability speech recognition functions for developers by using ESP32-S3 AI extension instructions. Now, this framework has three modules:

Espressif's acoustic algorithm (Audio Front-End, AFE) , which is specially designed to improve speech recognition performance in far-field or noisy environments to enable developers to obtain high-quality and stable audio data. It mainly includes AEC (Acoustic Echo Cancellation) , BSS (Blind Source Separation), NS (Noise Suppression). ESP-SR encapsulates the above algorithms into simple APIs. Without undersatanding details of these algorithms, developers only need to arrange and combine the algorithms that need to be used for specific application scenarios, and input the voice data to be processed according to the API format requirements and can get the results.
Espressif wake word engine (WakeNet), which is used to monitor specific language fragments in real-time in a continuous audio stream, and wake up the smart device from the standby state to the voice interaction state. WakeNet uses MFC (Mel-Frequency Cepstrum) to obtain the features of the input audio clip , and classifies the characteristic signals by the neural network algorithm optimized for ESP32-S3. It can trigger the wake-up signal when monitor the command word, and the Effective recognition rate is not less than 80% in noisy environment. ESP-BOX provides different offline wake words like “Xiao Ai Tong Xue” “Hi, Lexin” etc. and also allows customized wake word.
Espressif wake-up word engine (WakeNet), which is used to monitor specific language fragments in real-time in a continuous audio stream, and wake up the smart device from the standby state to the voice interaction state. WakeNet uses MFC (Mel-frequency Cepstrum) to obtain the features of the input audio clip, then classifies the characteristic signals by the neural network algorithm optimized for ESP32-S3. That can trigger the wake-up signal when analyzed the command word, and the Effective recognition rate is more than 80% in a noisy environment. ESP-BOX provides different offline wake words like “Hi, ESP” and “Hi, Lexin”. And also allows customized wake word.

ESP-HMI

ESP-HMI is a human-computer interaction solution designed by Espressif based on the open-source GUI framework LVGL. With the powerful computing and peripheral resources in ESP32-S3, it can achieve excellent data visualization, touch control, and other functions:

Screen driver library: compatible with a variety of mainstream 8080, SPI, I2C interface color screens, monochrome screens, touch screens, and provide a uniform display driver interface. Developers can directly use the adapted screen controller or add their screens according to the display driver interface. And the UI interface can be deployed to a variety of screens.
Open source UI templates: Espressif has designed multiple open source UI templates suitable for IoT control panels, multimedia players, voice assistants, and other different IoT scenarios, which can help developers carry out more professional secondary development.
Hardware design reference: Espressif provides hardware reference designs, circuit schematics, and development guides related to screen applications to help developers quickly complete hardware designs.

ESP-RainMaker

ESP-RainMaker is a complete system for IoT development, and it can As an end-to-end platform, it can provide developers with one-stop IoT product functions and examples such as device distribution, cloud connection, and OTA, it contains the following functions:

Device-side SDK, developers can use Espressif chips to define their IoT products by designing device parameters and attributes, and complete device-side firmware development quickly.
Secure cloud middleware, developers do not need to manage device certificates and cloud infrastructure. They can directly use Espressif chips to quickly establish a secure communication link with the cloud.
iOS / Android APP, which integrates the necessary functions such as device discovery, Wi-Fi provisioning, user login, device association and control, and developers do not need to carry out additional code development work.

AI Instructions and Libraries

With the help of ESP32-S3 Xtensa® 32bit LX7 dual-core CPU and AI / DSP extended instruction set, ESP32-S3 can efficiently complete vector operations, complex number operations, FFT and other operations, which can accelerate neural network calculations and digital signal processing calculation capabilities. AI developers can implement high-performance image recognition, voice wake-up and recognition with Espressif's open source software library based on the AI / DSP extended instruction set.

Device Drivers

TBD:

Application Layer