parnox / unsloth-notes

Unsloth Puzzle 2-16. Notes and indications of progress. Currently: 25 points

Repository from Github https://github.comparnox/unsloth-notesRepository from Github https://github.comparnox/unsloth-notes

Unsloth Challenges

I am working through the unsloth challenges: https://colab.research.google.com/drive/1JqKqA1XWeLHvnYAc0wzrR4JBCnq43HyH I note here my progress so far. Current estimated total points: 25 (21 as submited)

My personal notes with time per task and more are available upon request.

Quick References:

Some notes about me. I am a US citizen and math phd who has spent the last ten years at various startups, mostly my own. For reasons specific to timing, circumstance, and positioning, I seek gainful employment at unsloth, specifically on location in SF. In truth, I think that it will be quite a successful company, especially if we can somehow figure out how to add functionality for macos.

I am certain that, looking at hyperlearn, both the co-founders exceed me in any aspect of engineering. I am an outsider in ML, but sometimes an outsider can see what others look past. For example, in NF4 quantization only 92% of the embeddings are normaly distributed. Perhaps a fit with two curves would have higher accuracy, and this at a blockwise cost of only one bit. I also have some basic competence in engineering, which I hope I will illustrate to you in my solutions to follow.

I have always been a private individual and do not have a facebook linkedin twitter etc. Please consider my application in spite of my eccentricity.

Challenge Progress:

  • Part A (if attempted):
    • Single triton kernel (+3)
    • Speedup checks:
      • If speedup <= 1.00 (-3)
      • If speedup >= 1.05 (+1)
      • If speedup >= 1.10 (+2)
      • If speedup >= 1.15 (+2)
    • Kernel works in torch compile (+1)
      • If not (-1)
    • Custom ASM works (+3)
    • Uses cache eviction (+1)
    • Tested in f16 and bf16 (+1)
      • [] If not (-1)

State: Submitted. Complete. Speedup 1.13x on colab

https://github.com/parnox/unsloth-puzzles-a

10.35% written by AI.

  • Part B (if attempted):
    • FSDP2 works with QLoRA:
      • With torch compile (+5)
      • Without torch compile (+3)
      • Uses part A and single kernel and faster (+3)
      • Uses torchAO:
        • If torchAO slower than BnB (-3)
    • TP or PP with QLoRA:
      • With zero bubble (+3)
      • Without zero bubble (+2)
    • FSDP1 works with QLoRA (+1)
    • Kaggle notebook 2 tesla t4 example (+2)
      • If not (score = 0)
    • If not attempted (-2)

State: Not submitted. Tested on 2GPU setup. Initial integration of FSDP2 with QLoRA.

https://github.com/parnox/unsloth-puzzles-b

  • Part C (if attempted):
    • Uses flex attention:
      • Dynamic sequence length works (+3)
      • If not (+1)
    • No torch compile BnB (-2)
    • Use part A (+1)
    • Torch compile BnB (+1)
    • Attention compiled:
      • With excessive recompilation (-3)
      • Without excessive recompilation (+2)
    • MLP compiled:
      • With excessive recompilation (-3)
      • Without excessive recompilation (+1)
    • Loss not compiled (-1)
    • Layernorms not compiled (-3)
    • Max autotune triton matmul:
      • With excessive recompilation (-2)
      • Without excessive recompilation (+2)
    • If not attempted (-1)

State: Not submitted. Graph breaks down by around 50%, but training loss suffers. Currently: monkey patching bnb matmul and adjacent.

https://github.com/parnox/unsloth-puzzles-c

45.50% written by AI.

  • Part D:
    • Unavailable - VLMs Data Collator

    • Unavailable - VLMs image resizing

    • Unavailable - GGUF Vision support

    • Unavailable - Support Flex Attention

    • Available - Support Sequence Classification

    • Available - Refactor Attention

    • Unavailable - Tool Calling

    • Available - VLMs train only on completions

    • Other issues (+1/+2, max 12)

State: Preliminary reading of issues and PRs.

  • Part E (if attempted):
    • VRAM 50% reduction (+2)
    • Remove float32 upcast (score = 0)
    • Show CE loss works (+1)
    • Show other functions work (+1)
    • Hardcoded gradients (score = 0)
    • Allows dynamic chunk sizes (+1)
    • Llama 1B training loss matches (+1)
      • If not (score = 0)
    • GRPO memory efficient linear works (+4)

State: Submitted. Complete.

https://github.com/parnox/unsloth-puzzles-e

About

Unsloth Puzzle 2-16. Notes and indications of progress. Currently: 25 points