mojo-demo

Background

lasagna architecture examples

OSI network layers

Programming language layers

The underlying implmentations matter
- They generated the same results with different speed

select count(*) from Deposits
inner join households on households.id = deposits.HouseholdId
where CashierId = 'd89c8029-4808-4cea-b505-efd8279dc66d'

The above SQL scripts can have different execution plan

Which parts of python is slow for example?

Python is easy to use

a = "hello"
b = "world"
c = a + b

Equivalent C code

#include <stdio.h>

int main() {
  // Define the two strings
  char a[] = "hello";
  char b[] = "world";

  // Calculate the combined string length
  int combinedLength = strlen(a) + strlen(b) + 1; // Add 1 for null terminator

  // Allocate memory for the combined string
  char c[combinedLength];

  // Manually copy each character
  for (int i = 0; i < strlen(a); i++) {
    c[i] = a[i];
  }

  // Append the second string after the first
  for (int i = strlen(a), j = 0; i < combinedLength; i++, j++) {
    c[i] = b[j];
  }

  // Add the null terminator
  c[combinedLength - 1] = '\0';

  return 0;
}

python3 is dynamic type. Python does type check and memory allocate at runtime

a = 30
a = "aaa"

list is slow
- NumPy arrays are significantly faster than Python lists for several reasons:
  1. Memory layout:
  - Numpy arrays: Stored contiguously in memory, meaning all elements of the same data type are stored together. This allows for faster access and manipulation as the CPU can read large chunks of data efficiently.
  - Python lists: Heterogeneous in memory, meaning elements can be of different data types and stored scattered across memory. This makes accessing and manipulating elements slower as the CPU needs to jump around memory to find the required data.
  1. Data type:
  - Numpy arrays: Homogeneous, meaning all elements are of the same data type (e.g., all integers or all floats). This allows for optimized operations and efficient use of memory.
  - Python lists: Heterogeneous, meaning different elements can have different data types. This requires more overhead for storing and manipulating data.
  1. Operations:
  - Numpy arrays: Optimized C code for vectorized operations, meaning operations are applied to entire arrays simultaneously. This is significantly faster than looping over individual elements in a list.
  - Python lists: Interpreted Python code for operations, meaning each operation is executed individually. This is slower and less efficient.
  1. Built-in functions:
  - Numpy arrays: Have a vast library of optimized functions for various mathematical, statistical, and linear algebra operations. These functions take advantage of the underlying C code and contiguously stored data for maximum performance.
  - Python lists: Lack dedicated functions for complex operations. These operations need to be implemented using loops, leading to slower execution.
  1. Parallelism:
  - Numpy arrays: Can leverage multi-core processors for parallel execution of operations. This further improves performance for large datasets.
  - Python lists: Primarily single-threaded, limiting the potential for performance improvement on modern hardware.

Mojo

Mojo is not the only solution, NumBa, Cython, Julia, ... can also do the similar things
Mojo combines the usability of Python with the performance of C, unlocking unparalleled programmability of AI hardware and extensibility of AI models.

How to improve python runtime performance

static type
- ComplexFloat64 in mojo example
Parallelism
Use instruction set provided by specific hardware
- Squared addition is a common operation, and the Mojo standard library provides a special function for it called squared_add, which is implemented using FMA instructions for maximum performance
Compiler optimizations
- loop unrolling

Mojo vs Cython

Both Cython and Mojo are language extensions for Python, designed to improve performance and efficiency.
Parallelism

in real world

C++

It allows us to use Single Instruction Multiple Data (SIMD) instruction. Each instruction can perform the same type of operations on multiple samples.
- Arm neon instruction set (e.g., float32x4_t, vld1q_f32, vaddq_f32, ...)

  // Load input and weight tensors
  float32x4_t in[4];
  float32x4_t wt[4];
  for (int i = 0; i < 4; ++i) {
    in[i] = vld1q_f32(input);
    wt[i] = vld1q_f32(weight);
  }

  // Perform the convolution
  float32x4_t acc = vdupq_n_f32(0.0f);
  for (int k = 0; k < kernel_size; ++k) {
    for (int c = 0; c < input_channels; ++c) {
      float32x4_t in_c = vld1q_f32(input + c * input_channels * kernel_size);
      float32x4_t wt_c = vld1q_f32(weight + c * output_channels * kernel_size);
      acc = vmlaq_f32(acc, in_c, wt_c);
    }
    input += stride;
    weight += stride;
  }

  // Add bias and store output
  acc = vaddq_f32(acc, vld1q_f32(bias));
  vst1q_f32(output, acc);

Key takeaway

Use numpy, tensorflow, pytorch provided data structure and operation if possible. Parallelism is implemented there
Keep eyes on mojo and use C\C++ for our inference server

HemingwayLee / mojo-demo

mojo-demo

Background

lasagna architecture examples

Which parts of python is slow for example?

Mojo

How to improve python runtime performance

Mojo vs Cython

C++

Key takeaway

About

Languages