- OSI network layers
- Programming language layers
- The underlying implmentations matter
- They generated the same results with different speed
select count(*) from Deposits
inner join households on households.id = deposits.HouseholdId
where CashierId = 'd89c8029-4808-4cea-b505-efd8279dc66d'
The above SQL scripts can have different execution plan |
---|
- Python is easy to use
a = "hello"
b = "world"
c = a + b
Equivalent C code
#include <stdio.h>
int main() {
// Define the two strings
char a[] = "hello";
char b[] = "world";
// Calculate the combined string length
int combinedLength = strlen(a) + strlen(b) + 1; // Add 1 for null terminator
// Allocate memory for the combined string
char c[combinedLength];
// Manually copy each character
for (int i = 0; i < strlen(a); i++) {
c[i] = a[i];
}
// Append the second string after the first
for (int i = strlen(a), j = 0; i < combinedLength; i++, j++) {
c[i] = b[j];
}
// Add the null terminator
c[combinedLength - 1] = '\0';
return 0;
}
- python3 is dynamic type. Python does type check and memory allocate at runtime
a = 30
a = "aaa"
- list is slow
- NumPy arrays are significantly faster than Python lists for several reasons:
- Memory layout:
- Numpy arrays: Stored contiguously in memory, meaning all elements of the same data type are stored together. This allows for faster access and manipulation as the CPU can read large chunks of data efficiently.
- Python lists: Heterogeneous in memory, meaning elements can be of different data types and stored scattered across memory. This makes accessing and manipulating elements slower as the CPU needs to jump around memory to find the required data.
- Data type:
- Numpy arrays: Homogeneous, meaning all elements are of the same data type (e.g., all integers or all floats). This allows for optimized operations and efficient use of memory.
- Python lists: Heterogeneous, meaning different elements can have different data types. This requires more overhead for storing and manipulating data.
- Operations:
- Numpy arrays: Optimized C code for vectorized operations, meaning operations are applied to entire arrays simultaneously. This is significantly faster than looping over individual elements in a list.
- Python lists: Interpreted Python code for operations, meaning each operation is executed individually. This is slower and less efficient.
- Built-in functions:
- Numpy arrays: Have a vast library of optimized functions for various mathematical, statistical, and linear algebra operations. These functions take advantage of the underlying C code and contiguously stored data for maximum performance.
- Python lists: Lack dedicated functions for complex operations. These operations need to be implemented using loops, leading to slower execution.
- Parallelism:
- Numpy arrays: Can leverage multi-core processors for parallel execution of operations. This further improves performance for large datasets.
- Python lists: Primarily single-threaded, limiting the potential for performance improvement on modern hardware.
- NumPy arrays are significantly faster than Python lists for several reasons:
- Mojo is not the only solution, NumBa, Cython, Julia, ... can also do the similar things
- Mojo combines the
usability of Python
with theperformance of C
, unlocking unparalleled programmability of AI hardware and extensibility of AI models.
- static type
ComplexFloat64
in mojo example
- Parallelism
- Use instruction set provided by specific hardware
- Squared addition is a common operation, and the Mojo standard library provides a special function for it called
squared_add
, which is implemented usingFMA instructions
for maximum performance
- Squared addition is a common operation, and the Mojo standard library provides a special function for it called
- Compiler optimizations
- loop unrolling
-
Both Cython and Mojo are language extensions for Python, designed to improve performance and efficiency.
-
Parallelism
- in real world
- It allows us to use Single Instruction Multiple Data (SIMD) instruction. Each instruction can perform the same type of operations on multiple samples.
- Arm neon instruction set (e.g.,
float32x4_t
,vld1q_f32
,vaddq_f32
, ...)
- Arm neon instruction set (e.g.,
// Load input and weight tensors
float32x4_t in[4];
float32x4_t wt[4];
for (int i = 0; i < 4; ++i) {
in[i] = vld1q_f32(input);
wt[i] = vld1q_f32(weight);
}
// Perform the convolution
float32x4_t acc = vdupq_n_f32(0.0f);
for (int k = 0; k < kernel_size; ++k) {
for (int c = 0; c < input_channels; ++c) {
float32x4_t in_c = vld1q_f32(input + c * input_channels * kernel_size);
float32x4_t wt_c = vld1q_f32(weight + c * output_channels * kernel_size);
acc = vmlaq_f32(acc, in_c, wt_c);
}
input += stride;
weight += stride;
}
// Add bias and store output
acc = vaddq_f32(acc, vld1q_f32(bias));
vst1q_f32(output, acc);
- Use
numpy
,tensorflow
,pytorch
provided data structure and operation if possible. Parallelism is implemented there - Keep eyes on mojo and use C\C++ for our inference server