andrei-pokrovsky / tensorForth

Forth does tensors, in CUDA.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

tensorForth - eForth does tensor calculus, implemented in CUDA.

  • Forth VM that supports tensor calculus and dynamic parallelism

Status

version feature stage description comparable
release 1.0 float beta extended eForth with F32 float Python
release 2.0 matrix alpha added array and matrix objects NumPy
next CNN planning add tensor NN ops with autograd PyTorch
- RNN later - -

Why?

Compiled programs run fast on Linux. On the other hand, command-line interface and shell scripting tie them together. Productivity grows with this model especially for researchers.

For AI development today, we use Python mostly. To enable processing on CUDA device, say with Numba or the likes, mostly there will be 'just-in-time' compilations behind the scene then load and run. In a sense, the Python code behaves like a Makefile which requires compilers to be on the host box. At the tailend, to analyze, visualization can then be have. This is usually a long journey. After many coffee breaks, we update the Python and restart again. In order to catch progress, scanning the intermediate formatted files sometimes become necessary which probably reminisce the line-printer days for seasoned developers.

Having a 'shell' that can interactively and incrementally run 'compiled programs' from within GPU directly without dropping back to host system might be useful. Even though some might argue that the branch divergence in kernel could kill the GPU, but performance of the script itself is not the point. So, here we are!

How?

GPU, behaves like a co-processor. It has no OS, no string support, and runs its own memory. Most of the available libraries are built to call from CPU instead of from within GPU. So, to be interactive, a memory manager, IO, and syncing with CPU are things to be added pretty much like creating a Forth from scratch in the old days.

Since GPUs have good compiler support nowaday and I've changed the latest eForth to lambda-based in C++, pretty much all words can be straight copy except some attention to those are affected by CELL being float32 such as addressing, logic ops. i.e. BRANCH, 0=, MOD, XOR would not work as expected.

Having an interactive Forth in GPU does not mean a lot by itself. However, by adding matrix ops and tensor with backprop, sort of taking the path of Numpy to PyTorch, combining the cleanness of Forth with the massively parallel nature of GPUs can be useful one day, hopefully!

Small Example

> ten4 -v 1                          # enter tensorForth, with mmu debug tracing on
tensorForth 2.0
\  GPU 0 initialized at 1800MHz, dict[1024], pmem=48K, tensor=1024M
\  VM[0] dict=0x7f56fe000a00, mem=0x7f56fe004a00, vss=0x7f56fe010a00

2 3 matrix{ 1 2 3 4 5 6 }            \ create matrix
mmu#tensor(2,3) => size=6            \ the optional debug traces
 <0 T2[2,3]> ok                      \ 2-D tensor shown on top of stack (TOS)
dup                                  \ duplicate i.e. create a view
mmu#view 0x7efc18000078 => size=6
 <0 T2[2,3] V2[2,3]> ok              \ view shown on TOS
.                                    \ print the view
matrix[2,3] = {
	{ +1.0000 +2.0000 +3.0000 }
	{ +4.0000 +5.0000 +6.0000 } }
 <0 T2[2,3]> ok
mmu#free(T2) size=6                  \ view released after print
 <0 T2[2,3]> ok
3 2 matrix ones                      \ create a [3,2] matrix and fill with ones
mmu#tensor(3,2) => size=6
 <0 T2[2,3] T2[3,2]> ok
*                                    \ multiply matrices [2,3] x [3,x]
mmu#tensor(2,2) => size=4            \ a [2,x] resultant matrix created
 <0 T2[2,3] T2[3,2] T2[2,2]> ok      \ shown on TOS
.                                    \ print the matrix
matrix[2,2] = {
	{ +6.0000 +6.0000 }
	{ +15.0000 +15.0000 } }
 <0 T2[2,3] T2[3,2]> ok
mmu#free(T2) size=4                  \ matrix release after print
2drop                                \ free both matrics
mmu#free(T2) size=6
mmu#free(T2) size=6
 <0> ok
bye                                  \ exit tensorForth
 <0 T2[2,3] T2[3,2]> ok
tensorForth 2.0 done.

Larger Example - benchmark [1024,2048] x [2048,512] 1000 loops

1024 2048 matrix rand                \ create a [1024,2048] matrix with uniform random values
 <0 T2[1024,2048]> ok                
2048 512 matrix ones                 \ create another [2048,512] matrix filled with 1s
 <0 T2[1024,2048] T2[2048,512]> ok
*                                    \ multiply them and resultant matrix on TOS
 <0 T2[1024,2048] T2[2048,512] T2[1024,512]> ok
2048 / .                             \ scale down and print the resutant [1024,512] matrix
matrix[1024,512] = {                 \ in PyTorch style (edgeitem=3)
	{ +0.4873 +0.4873 +0.4873 ... +0.4873 +0.4873 +0.4873 }
	{ +0.4274 +0.4274 +0.4274 ... +0.4274 +0.4274 +0.4274 }
	{ +0.5043 +0.5043 +0.5043 ... +0.5043 +0.5043 +0.5043 }
	...
	{ +0.5041 +0.5041 +0.5041 ... +0.5041 +0.5041 +0.5041 }
	{ +0.5007 +0.5007 +0.5007 ... +0.5007 +0.5007 +0.5007 }
	{ +0.5269 +0.5269 +0.5269 ... +0.5269 +0.5269 +0.5269 } }
 <0 T2[1024,2048] T2[2048,512] T2[1024,512> ok     \ original T2[1024,512] is still left on TOS
drop                                               \ because tensor ops are by default non-destructive
 <0 T2[1024,2048] T2[2048,512]> ok                 \ so we drop it from TOS
: mx clock >r for * drop next clock r> - ;         \ define a word 'mx' for benchmark loop
9 mx                                               \ run benchmark for 10 loops
 <0 T2[1024,2048] T2[2048,512] 396> ok             \ 396 ms for 10 cycles
drop                                               \ drop the value
 <0 T2[1024,2048] T2[2048,512]> ok
999 mx                                             \ now try 1000 loops
 <0 T2[1024,2048] T2[2048,512] 3.938+04> ok        \ that is 39.38 sec (i.e. ~40ms / loop)

Note:

  • cuRAND uniform distribution averaged 0.5 is doing OK.
  • 39.4 ms per 1Kx1K matmul on GTX 1660 with naive implementation. PyTorch average 0.850 ms which is 50x faster. Luckily, CUDA matmul tuning methods are well known. TODO!

To build

  • install CUDA 11.6 on your machine
  • clone repo to your local directory

with Makefile, and test

  • cd to your ten4 repo directory
  • update root Makefile to your desired CUDA_ARCH, CUDA_CODE
  • type 'make all'
  • if all goes well, some warnings aside, cd to tests
  • type 'ten4 < lesson_1.txt' for Forth syntax check,
  • and 'ten4 < lesson_2.txt' for matrix stuffs

with Eclipse

  • install Eclipse
  • install CUDA SDK 11.6 for Eclipse (from Nvidia site)
  • create project by importing from your local repo root
  • exclude directories - ~/tests, ~/img
  • set File=>Properties=>C/C++ Build=>Setting=>NVCC compiler
    • Dialect=C++14
    • CUDA=5.2 or above
    • Optimization=O3

tensorForth command line options

  • -h - list all GPU id and their properties
  • -d device_id - select GPU device id
  • -v verbo_level - set verbosity level 0: off (default), 1: mmu tracing on, 2: detailed trace

Forth Tensor operations (see doc for detail and examples)

Tensor creation words

   array     (n       -- T1)     - create a 1-D array and place on top of stack (TOS)
   matrix    (h w     -- T2)     - create 2-D matrix and place on TOS
   tensor    (n h w c -- T4)     - create a 4-D NHWC tensor on TOS
   array{    (n       -- T1)     - create 1-D array from console stream
   matrix{   (h w     -- T2)     - create a 2-D matrix from console stream
   copy      (Ta      -- Ta Ta') - duplicate (deep copy) a tensor on TOS

View creation words

   dup       (Ta    -- Ta Va)    - create a view of a tensor on TOS
   over      (Ta Tb -- Ta Tb Va) - create a view from 2nd item on stack
   2dup      (Ta Tb -- Ta Tb Va Vb)
   2over     (Ta Tb Tc Td -- Ta Tb Tc Td Va Vb)

Tensor/View print word

   . (dot)   (Ta -- )        - print array
   . (dot)   (Va -- )        - print view

Shape adjusting words (change shape of origial tensor)

   flatten   (Ta -- T1a')    - reshap a tensor to 1-D array
   reshape2  (Ta -- T2a')    - reshape a 2-D matrix
   reshape4  (Ta -- T4a')    - reshape to a 4-D NHWC tensor

Fill tensor with init values (data updated to original tensor)

   zeros     (Ta   -- Ta')   - fill tensor with zeros
   ones      (Ta   -- Ta')   - fill tensor with ones
   full      (Ta   -- Ta')   - fill tensor with number on TOS
   eye       (Ta   -- Ta')   - fill diag with 1 and other with 0
   rand      (Ta   -- Ta')   - fill tensor with uniform random numbers
   randn     (Ta   -- Ta')   - fill tensor with normal distribution random numbers
   ={        (Ta   -- Ta')   - fill tensor with console input from the first element
   ={        (Ta n -- Ta')   - fill tensor with console input starting at n'th element

Tensor slice and dice

   slice     (Ta x0 x1 y0 y1 -- Ta Ta') - numpy.slice[x0:x1,y0:y1,]

Tensor arithmetic words (by default non-destructive)

   +         (Ta Tb -- Ta Tb Tc) - tensor element-wise addition
   +         (Ta n  -- Ta Ta')   - tensor matrix-scaler addition (broadcast)
   -         (Ta Tb -- Ta Tb Tc) - tensor element-wise subtraction
   -         (Ta n  -- Ta Ta')   - tensor matrix-scaler subtraction (broadcast)
   *         (Ta Tb -- Ta Tb Tc) - matrix-matrix multiplication
   *         (Ta Ab -- Ta Ab Ta')- TODO: matrix-array multiplication (broadcase)
   *         (Aa Ab -- Aa Ab n)  - array-array dot product
   *         (Ta n  -- Ta Ta')   - matrix-scaler multiplication (broadcast)
   /         (Ta Tb -- Ta Tb Tc) - TODO: A * inv(B) matrix
   /         (Ta n  -- Ta Ta')   - matrix-scaler scale down multiplication (broadcast)
   sum       (Ta    -- Ta n)     - sum all elements of a tensor
   exp       (Ta    -- Ta Ta')   - element-wise exponential
   inverse   (Ta    -- Ta Ta')   - TODO: matrix inversion
   transpose (Ta    -- Ta Ta')   - matrix transpose
   matmul    (Ta Tb -- Ta Tb Tc) - matrix multiplication
   gemm      (a b Ta Tb Tc -- a b Ta Tb Tc') - GEMM Tc' = a * Ta x Tb + b * Tc

TODO

  • backprop and autograd
  • add KV pair (associative array) for label, and fully connected lookup
  • add CNN
    • study torch.nn, CUB (for kernel)
    • conv: ~pushing_the_limits_for_2d_conv..., shuffle reduction
    • benchmark (MNIST, CIFAR, Kaggle...)
  • models load/save - (VGG-19, ResNet (i.e. skip-connect), compare to Keras)
  • sampling and distribution
  • refactor - add namespace
  • add RNN
  • add inter-VM communication (CUDA stream, review CUB again)
  • add batch loader (from VM->VM)
  • .petastorm, .csv loader (available on github)
  • add GNN - dynamic graph with VMs

LATER

  • integrate plots (matplotlib, tensorboard)
  • integrate ONNX
  • integrate CUB, CUTLASS (utilities.init, gemm_api) - slow, later
  • preprocessor (DALI) + GPUDirect - heavy, later

History

Release 1.0 features

  • Dr. Ting's eForth words with F32 as data unit, U16 instruction unit
  • Support parallel Forth VMs
  • Lambda-based Forth microcode
  • Memory mangement unit handles dictionary, stack, and parameter blocks in CUDA
  • Managed memory debug utilities, words, see, ss_dump, mem_dump
  • String handling utilities in CUDA
  • Light-weight vector class, no dependency on STL
  • Output Stream, async from GPU to host

Release 2.0 features

  • array, matrix, tensor objects (modeled to PyTorch)
  • TLSF tensor storage manager (now 4G max)
  • matrix arithmetics (i.e. +, -, *, copy, matmul, transpose)
  • matrix fill (i.e. zeros, ones, full, eye, random)
  • matrix console input (i.e. matrix[..., array[..., and T![)
  • matrix print (i.e PyTorch-style, adjustable edge elements)
  • tensor view (i.e. dup, over, pick, r@)
  • GEMM (i.e. a * A x B + b * C, use CUDA Dynamic Parallelism)
  • command line option: debug print level control (T4_DEBUG)
  • command line option: list (all) device properties
  • use cuRAND kernel randomizer for uniform and standard normal distribution

Release 2.2 features

About

Forth does tensors, in CUDA.


Languages

Language:C++ 93.4%Language:C 4.5%Language:Makefile 2.1%