Logic Rl

🎉 Successfully reproduced DeepSeek R1 Zero on 2K Logic Puzzle Dataset.

📢 Our detailed technical report is coming soon! Stay tuned!

See project explanation : here.

Wandb project : here.

Enhanced Features (After Rule-Based RL)

🚩 Uncertainty Marking	📝 Progressive Summarization	✅ Self Verification	🌐 Multilingual Switching
Flag ambiguous steps for verification	Maintain intermediate conclusions	First verify then answer	Chinese reasoning traces with English answers

📸 Results Preview


Test Score Plot	Average Output Length Plot

Model Output Example

Benchmark

Model	2ppl	3ppl	4ppl	5ppl	6ppl	7ppl	8ppl
o1-2024-12-17	0.83	0.51	0.38	0.38	0.35	0.30	0.20
GPT-4o	0.68	0.57	0.49	0.32	0.23	0.21	0.11
Deepseek-Math-7b	0.35	0.21	0.08	0.06	0.02	0.00	0.00
Qwen2.5-7B-Instruct-1M	0.49	0.40	0.25	0.11	0.02	0.06	0.01
Qwen2.5-7B-Logic-RL (ours)	0.68	0.59	0.44	0.34	0.22	0.16	0.15

Our model only used 2K training data with 400 training steps. More model benchmarks will be updated later this week.

🛠️ Installation

conda create -n logic python=3.9
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip3 install vllm==0.6.3 ray
pip3 install flash-attn --no-build-isolation
pip install -e .  # For verl integration
pip install wandb IPython matplotlib

Data Preparation

You can directly use /data.

For your own data generation, here's a demo:

Base Model

python ./examples/data_preprocess/kk.py \
    --local_dir {processed_data_path} \
    --data_path {raw_data_path}

Instruct Model

python ./examples/data_preprocess/kk.py \
    --template_type=qwen-instruct \
    --local_dir {processed_data_path} \
    --data_path {raw_data_path}

Training Execution

conda activate logic
bash main_grpo.sh  # 4×A100 80G

⚙️ Implementation Details

Component	Location
Reward Modeling	`verl/utils/reward_score/kk.py`
Data Preprocessing	`examples/data_preprocess/kk.py`

Citation

@misc{logic-rl,
author       = {Tian Xie and Qingnan Ren and Yuqian Hong and Zitian Gao},
title        = {Logic-RL},
howpublished = {https://github.com/Unakar/Logic-RL},
note         = {Accessed: 2025-02-03},
year         = {2025}
}

Acknowledgements

Verl 🔗
TinyZero 🔗
Knights and Knaves (K&K) puzzles dataset 🔗

Star History

About

Reproduce R1 Zero on Logic Puzzle

Apache License 2.0

Languages

Language:Python 97.3%Language:Shell 2.7%