I am working through the unsloth challenges: https://colab.research.google.com/drive/1JqKqA1XWeLHvnYAc0wzrR4JBCnq43HyH I note here my progress so far. Current estimated total points: 25 (21 as submited)
My personal notes with time per task and more are available upon request.
Quick References:
- https://github.com/parnox/unsloth-puzzles-a
- https://github.com/parnox/unsloth-puzzles-b (unavailable)
- https://github.com/parnox/unsloth-puzzles-c (unavailable)
- https://github.com/parnox/unsloth-puzzles-d (unavailable)
- https://github.com/parnox/unsloth-puzzles-e
Some notes about me. I am a US citizen and math phd who has spent the last ten years at various startups, mostly my own. For reasons specific to timing, circumstance, and positioning, I seek gainful employment at unsloth, specifically on location in SF. In truth, I think that it will be quite a successful company, especially if we can somehow figure out how to add functionality for macos.
I am certain that, looking at hyperlearn, both the co-founders exceed me in any aspect of engineering. I am an outsider in ML, but sometimes an outsider can see what others look past. For example, in NF4 quantization only 92% of the embeddings are normaly distributed. Perhaps a fit with two curves would have higher accuracy, and this at a blockwise cost of only one bit. I also have some basic competence in engineering, which I hope I will illustrate to you in my solutions to follow.
I have always been a private individual and do not have a facebook linkedin twitter etc. Please consider my application in spite of my eccentricity.
- Part A (if attempted):
- Single triton kernel (+3)
- Speedup checks:
- If speedup <= 1.00 (-3)
- If speedup >= 1.05 (+1)
- If speedup >= 1.10 (+2)
- If speedup >= 1.15 (+2)
- Kernel works in torch compile (+1)
- If not (-1)
- Custom ASM works (+3)
- Uses cache eviction (+1)
- Tested in f16 and bf16 (+1)
- [] If not (-1)
State: Submitted. Complete. Speedup 1.13x on colab
https://github.com/parnox/unsloth-puzzles-a
10.35% written by AI.
- Part B (if attempted):
- FSDP2 works with QLoRA:
- With torch compile (+5)
- Without torch compile (+3)
- Uses part A and single kernel and faster (+3)
- Uses torchAO:
- If torchAO slower than BnB (-3)
- TP or PP with QLoRA:
- With zero bubble (+3)
- Without zero bubble (+2)
- FSDP1 works with QLoRA (+1)
- Kaggle notebook 2 tesla t4 example (+2)
- If not (score = 0)
- If not attempted (-2)
- FSDP2 works with QLoRA:
State: Not submitted. Tested on 2GPU setup. Initial integration of FSDP2 with QLoRA.
https://github.com/parnox/unsloth-puzzles-b
- Part C (if attempted):
- Uses flex attention:
- Dynamic sequence length works (+3)
- If not (+1)
- No torch compile BnB (-2)
- Use part A (+1)
- Torch compile BnB (+1)
- Attention compiled:
- With excessive recompilation (-3)
- Without excessive recompilation (+2)
- MLP compiled:
- With excessive recompilation (-3)
- Without excessive recompilation (+1)
- Loss not compiled (-1)
- Layernorms not compiled (-3)
- Max autotune triton matmul:
- With excessive recompilation (-2)
- Without excessive recompilation (+2)
- If not attempted (-1)
- Uses flex attention:
State: Not submitted. Graph breaks down by around 50%, but training loss suffers. Currently: monkey patching bnb matmul and adjacent.
https://github.com/parnox/unsloth-puzzles-c
45.50% written by AI.
- Part D:
-
Unavailable - VLMs Data Collator
-
Unavailable - VLMs image resizing
-
Unavailable - GGUF Vision support
-
Unavailable - Support Flex Attention
-
Available - Support Sequence Classification
- Rejected: #1739
-
Available - Refactor Attention
-
Unavailable - Tool Calling
-
Available - VLMs train only on completions
- Rejected: #1736
-
Other issues (+1/+2, max 12)
-
State: Preliminary reading of issues and PRs.
- Part E (if attempted):
- VRAM 50% reduction (+2)
- Remove float32 upcast (score = 0)
- Show CE loss works (+1)
- Show other functions work (+1)
- Hardcoded gradients (score = 0)
- Allows dynamic chunk sizes (+1)
- Llama 1B training loss matches (+1)
- If not (score = 0)
- GRPO memory efficient linear works (+4)
State: Submitted. Complete.