Unsloth Challenges

I am working through the unsloth challenges: https://colab.research.google.com/drive/1JqKqA1XWeLHvnYAc0wzrR4JBCnq43HyH I note here my progress so far. Current estimated total points: 25 (21 as submited)

My personal notes with time per task and more are available upon request.

Quick References:

https://github.com/parnox/unsloth-puzzles-a
https://github.com/parnox/unsloth-puzzles-b (unavailable)
https://github.com/parnox/unsloth-puzzles-c (unavailable)
https://github.com/parnox/unsloth-puzzles-d (unavailable)
https://github.com/parnox/unsloth-puzzles-e

Some notes about me. I am a US citizen and math phd who has spent the last ten years at various startups, mostly my own. For reasons specific to timing, circumstance, and positioning, I seek gainful employment at unsloth, specifically on location in SF. In truth, I think that it will be quite a successful company, especially if we can somehow figure out how to add functionality for macos.

I am certain that, looking at hyperlearn, both the co-founders exceed me in any aspect of engineering. I am an outsider in ML, but sometimes an outsider can see what others look past. For example, in NF4 quantization only 92% of the embeddings are normaly distributed. Perhaps a fit with two curves would have higher accuracy, and this at a blockwise cost of only one bit. I also have some basic competence in engineering, which I hope I will illustrate to you in my solutions to follow.

I have always been a private individual and do not have a facebook linkedin twitter etc. Please consider my application in spite of my eccentricity.

Challenge Progress:

Part A (if attempted):
- Single triton kernel (+3)
- Speedup checks:
  - If speedup <= 1.00 (-3)
  - If speedup >= 1.05 (+1)
  - If speedup >= 1.10 (+2)
  - If speedup >= 1.15 (+2)
- Kernel works in torch compile (+1)
  - If not (-1)
- Custom ASM works (+3)
- Uses cache eviction (+1)
- Tested in f16 and bf16 (+1)
  - [] If not (-1)

State: Submitted. Complete. Speedup 1.13x on colab

https://github.com/parnox/unsloth-puzzles-a

10.35% written by AI.

Part B (if attempted):
- FSDP2 works with QLoRA:
  - With torch compile (+5)
  - Without torch compile (+3)
  - Uses part A and single kernel and faster (+3)
  - Uses torchAO:
    - If torchAO slower than BnB (-3)
- TP or PP with QLoRA:
  - With zero bubble (+3)
  - Without zero bubble (+2)
- FSDP1 works with QLoRA (+1)
- Kaggle notebook 2 tesla t4 example (+2)
  - If not (score = 0)
- If not attempted (-2)

State: Not submitted. Tested on 2GPU setup. Initial integration of FSDP2 with QLoRA.

https://github.com/parnox/unsloth-puzzles-b

Part C (if attempted):
- Uses flex attention:
  - Dynamic sequence length works (+3)
  - If not (+1)
- No torch compile BnB (-2)
- Use part A (+1)
- Torch compile BnB (+1)
- Attention compiled:
  - With excessive recompilation (-3)
  - Without excessive recompilation (+2)
- MLP compiled:
  - With excessive recompilation (-3)
  - Without excessive recompilation (+1)
- Loss not compiled (-1)
- Layernorms not compiled (-3)
- Max autotune triton matmul:
  - With excessive recompilation (-2)
  - Without excessive recompilation (+2)
- If not attempted (-1)

State: Not submitted. Graph breaks down by around 50%, but training loss suffers. Currently: monkey patching bnb matmul and adjacent.

https://github.com/parnox/unsloth-puzzles-c

45.50% written by AI.

Part D:
- Unavailable - VLMs Data Collator
  - #55
  - #56
- Unavailable - VLMs image resizing
  - #1808
- Unavailable - GGUF Vision support
  - #1799
- Unavailable - Support Flex Attention
  - #1803
- Available - Support Sequence Classification
  - Rejected: #1739
- Available - Refactor Attention
- Unavailable - Tool Calling
  - #1764
- Available - VLMs train only on completions
  - Rejected: #1736
- Other issues (+1/+2, max 12)

State: Preliminary reading of issues and PRs.

Part E (if attempted):
- VRAM 50% reduction (+2)
- Remove float32 upcast (score = 0)
- Show CE loss works (+1)
- Show other functions work (+1)
- Hardcoded gradients (score = 0)
- Allows dynamic chunk sizes (+1)
- Llama 1B training loss matches (+1)
  - If not (score = 0)
- GRPO memory efficient linear works (+4)

State: Submitted. Complete.

https://github.com/parnox/unsloth-puzzles-e

parnox / unsloth-notes

Unsloth Challenges

Challenge Progress:

About