Arxiv Translation Project

이 레포는 쏟아지는 페이퍼들에 대응하기 위하여, 빠르게 Arxiv 페이퍼를 살펴볼 수 있도록 한글화된 웹페이지를 제공하는 것을 목표로 합니다. 각기 다른 형태의 PDF 파일을 번역하기 위해서, 텍스트를 추출할 때 nougat OCR 라이브러리를 활용합니다. 따라서 추출이 원활하지 않을 수 있습니다. 처음에는 Ar5iv를 번역할까 생각했지만, Ar5iv도 한달이 지나서야 페이퍼가 업데이트 되며, 최초 버전만 HTML화 하고 최종 버전은 반영되어 있지 않기 때문에, 자체적으로 내용을 추출하기로 결정하였습니다. 정확한 내용을 파악하기 위해서는 원본 페이퍼를 읽는 것을 추천합니다.

Paper List

새 창 열기가 지원되지 않습니다. 직접 새 창으로 열기를 통해 열기를 권장합니다.

ArXiv ID	Title	ArXiv	Go to
2403.06634	Stealing Part of a Production Language Model	arXiv	page
2403.06563v1	Unraveling the Mystery of Scaling Laws Part I	arXiv	page
2403.04706v1	Common 7B Language Models Already Possess Strong Math Capabilities	arXiv	page
2403.04652v1	Yi Open Foundation Models by 01AI	arXiv	page
2403.03883v2	SaulLM-7B A pioneering Large Language Model for Law	arXiv	page
2403.03507v1	GaLore Memory-Efficient LLM Training by Gradient Low-Rank Projection	arXiv	page
2403.02178v1	Masked Thought Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models	arXiv	page
2402.18815v1	How do Large Language Models Handle Multilingualism?	arXiv	page
2402.18563v1	Approaching Human-Level Forecasting with Language Models	arXiv	page
2402.16837v1	Do Large Language Models Latently Perform Multi-Hop Reasoning?	arXiv	page
2402.16819v2	Nemotron-4 15B Technical Report	arXiv	page
2402.14714v1	Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models	arXiv	page
2402.12847v1	Instruction-tuned Language Models are Better Knowledge Learners	arXiv	page
2402.08939v1	Premise Order Matters in Reasoning with Large Language Models	arXiv	page
2402.07043v1	A Tale of Tails Model Collapse as a Change of Scaling Laws	arXiv	page
2402.06196v2	Large Language Models A Survey	arXiv	page
2402.00838v3	OLMo Accelerating the Science of Language Models	arXiv	page
2401.16380v1	Rephrasing the Web A Recipe for Compute and Data-Efficient Language Modeling	arXiv	page
2401.10225v1	ChatQA Building GPT-4 Level Conversational QA Models	arXiv	page
2401.08417v3	Contrastive Preference Optimization Pushing the Boundaries of LLM Performance in Machine Translation	arXiv	page
2401.05654v1	Towards Conversational Diagnostic AI	arXiv	page
2401.01055v2	LLaMA Beyond English An Empirical Study on Language Capability Transfer	arXiv	page
2312.05934v3	Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs	arXiv	page
2311.13647	Language Model Inversion	arXiv	page
2311.12023v2	LQ-LoRA Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning	arXiv	page
2310.11511	Self-RAG Learning to Retrieve Generate and Critique through Self-Reflection	arXiv	page
2310.01889	Ring Attention with Blockwise Transformers for Near-Infinite Context	arXiv	page
2309.12288	The Reversal Curse LLMs trained on A is B fail to learn B is A	arXiv	page
2308.12284	D4 Improving LLM Pretraining via Document De-Duplication and Diversification	arXiv	page
2304.08177v3	Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca	arXiv	page
2110.03215	Towards Continual Knowledge Learning of Language Models	arXiv	page
2107.06499	Deduplicating Training Data Makes Language Models Better	arXiv	page
2104.13478	Geometric Deep Learning Grids Groups Graphs Geodesics and Gauges	arXiv	page

Procedure

Arxiv 페이퍼를 번역하기 위해서 총 4단계를 거칩니다.

ArXiv Paper Download

Arxiv는 wget 등의 명령어를 통해서 pdf 파일을 다운로드 받을 수 없게 하였습니다. 아마도 무분별한 scrapping에 대응하기 위한 것으로 생각됩니다. 따라서 pdf 파일을 다운로드 받기 위해서 arxiv-dl 패키지를 활용합니다.

PDF to Markdown

Nougat OCR을 활용하여 Mathpix Markdown 파일로 변환합니다.

Translation

자체 번역 모델을 활용하여 번역을 수행합니다. 다음과 같이 페이퍼의 번역을 위해 사용된 번역기의 성능(초록색)은 DeepL과 Google, Naver의 중간쯤에 위치합니다.

Markdown to HTML

Mathpix Markdown을 HTML로 변환합니다. 변환 방법은 여기에 설명되어 있습니다. 그리고 저장된 github에 push되어 저장된 HTML 파일을 githack.com을 통해 렌더링하도록 합니다.

Future Work

페이퍼 중간의 이미지들은 Nougat OCR에서 추출해주지 않기 때문에 빠져 있습니다. 따라서 이미지도 함께 포함하여 결과물을 만들어내도록 하고자 합니다.

Contact

Kim Ki Hyun pointzz.ki@gmail.com

kh-kim / arxiv-translator