Deepware's repositories
Sextans
An FPGA accelerator for general-purpose Sparse-Matrix Dense-Matrix Multiplication (SpMM).
SparseP
SparseP is the first open-source Sparse Matrix Vector Multiplication (SpMV) software package for real-world Processing-In-Memory (PIM) architectures. [https://arxiv.org/abs/2201.05072]
Serpens
An HBM FPGA based SpMV Accelerator
trans-fat
An FPGA Accelerator for Transformer Inference (BERT)
How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the program on the GPU in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
EdgeBERT
HW/SW co-design of sentence-level energy optimizations for latency-aware multi-task NLP inference
Paddle-Lite
Multi-platform high performance deep learning inference engine (『飞桨』多平台高性能深度学习预测引擎)
SpinalHDL_CNN_Accelerator
CNN accelerator implemented with Spinal HDL
dory
A tool to deploy Deep Neural Networks on PULP-based SoC's
nemo
NEural Minimizer for pytOrch
lenet5_hls
FPGA Accelerator for CNN using Vivado HLS
neural-compressor
Intel® Neural Compressor (formerly known as Intel® Low Precision Optimization Tool), targeting to provide unified APIs for network compression technologies, such as low precision, sparsity, pruning, knowledge distillation, cross different deep learning frameworks to purse best inference performance.
openvino_tensorflow
OpenVINO™ integration with TensorFlow
bnna
bnn accelerator
FPGA_AcceleratorWrapper
Accelerator wrapper with AXI3 DMA and AXI Lite for control
approximate-spmv-topk
Public repostory for the DAC 2021 paper "Scaling up HBM Efficiency of Top-K SpMV forApproximate Embedding Similarity on FPGAs"
MVU
Neural Network accelerator powered by MVUs and RISC-V.
PE-array-for-LeNet-accelerator-based-on-FPGA
This is a 4*5 PE array for LeNet accelerator based on FPGA.
Yolo-Fastest
:zap: Based on yolo's ultra-lightweight universal target detection algorithm, the calculation amount is only 250mflops, the ncnn model size is only 666kb, the Raspberry Pi 3b can run up to 15fps+, and the mobile terminal can run up to 178fps+
XNNPACK
High-efficiency floating-point neural network inference operators for mobile, server, and Web
ara
The PULP Ara is a 64-bit Vector Unit, compatible with the RISC-V Vector Extension Version 0.10, working as a coprocessor to CORE-V's CVA6 core
hci
Heterogeneous Cluster Interconnect to bind special-purpose HW accelerators with general-purpose cluster cores