LATIN-Prompt

Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering

News

2023.09.07: We support the Llama2 and Llama2-chat models.
2023.09.06: We introduce LATIN-Tuning for Alpaca, which enhances its zero-shot performance on DocVQA from 0.3567 to 0.6697.
2023.06.30: We now provide implementations based on Azure OpenAI text-davinci-003 Completion.
2023.06.29: We now provide implementations based on Alpaca-7B and Vicuna-13B.

Roadmap

Preparation

Prepare the environment

pip install -r requirements.txt

Set environment variable for Cluade and OpenAI

Set the ANTHROPIC_API_KEY for Cluade. Please refer to the script ./utils/claude.py.

export ANTHROPIC_API_KEY="Your API key"

Set the OPENAI_API_KEY and OPENAI_API_BASE for Azure OpenAI. Please refer to the script ./utils/openai_api.py. For the differences between Azure OpenAI and OpenAI, see here.

export OPENAI_API_KEY="Your API key"
export OPENAI_API_BASE="Your base url"

Note: Currently, due to resource constraints, our experiments are all based on the Azure OpenAI API. At the same time, we are working hard to seek access to official OpenAI API. If you can provide relevant resources, please contact the author and we will be very grateful.

Set environment variable for data directory

export DATAS_DIR="Your data directory"

Prepare the dataset

Download DocVQA dataset with Azure OCR results from DUE Benchmark and put it into the DATAS_DIR.
Download DocVQA dataset with Official OCR results from Robust Reading Competition and put it into the DATAS_DIR.

Set the path of model checkpoints

Refer to the script ./utils/model_path_config.py.

Examples

Examples with Claude

Example: Claude + LATIN-Prompt on DocVQA (Azure OCR)

bash script/claude_eval.sh 0 claude docvqa_due_azure task_instruction_space

Example: Claude + Plain Prompt on DocVQA (Azure OCR)

bash script/claude_eval.sh 0 claude docvqa_due_azure plain

Example: Claude + Task Description on DocVQA (Azure OCR)

bash script/claude_eval.sh 0 claude docvqa_due_azure task_instruction

Example: Claude + Layout-aware Document on DocVQA (Azure OCR)

bash script/claude_eval.sh 0 claude docvqa_due_azure space

Examples with Azure OpenAI GPT-3.5-turbo (ChatGPT) Completion

Example: GPT-3.5-turbo + LATIN-Prompt on DocVQA (Azure OCR)

bash script/claude_eval.sh 0 gpt-35 docvqa_due_azure task_instruction_space

Example: GPT-3.5-turbo + Plain Prompt on DocVQA (Azure OCR)

bash script/claude_eval.sh 0 gpt-35 docvqa_due_azure plain

Examples with Azure OpenAI GPT-3.5-turbo (ChatGPT) ChatCompletion

Example: GPT-3.5-turbo + LATIN-Prompt on DocVQA

bash script/claude_eval.sh 0 gpt-35-chat docvqa task_instruction_space

Examples with Azure OpenAI text-davinci-003 Completion

Example: text-davinci-003 + LATIN-Prompt on DocVQA (Azure OCR)

bash script/claude_eval.sh 0 text-davinci-003 docvqa_due_azure task_instruction_space

Examples with Alpaca and Vicuna

Example: Alpaca + LATIN-Prompt on DocVQA (Azure OCR)

bash script/llama_eval.sh 0 alpaca-7b docvqa_due_azure task_instruction_space

Example: Vicuna + LATIN-Prompt on DocVQA (Azure OCR)

bash script/vllm_eval.sh 0 vicuna-13b docvqa_due_azure task_instruction_space

Examples with LLaMA2-chat

Example: LLaMA2-chat + LATIN-Prompt on DocVQA (Azure OCR)

bash script/vllm_eval.sh 0 llama2-13b-chat docvqa_due_azure task_instruction_space

Performance

DocVQA (Azure OCR, DUE)

The performance in this table is based on the Azure OCR results provided in DUE Benchmark by default. The Official OCR represents the performance is based on the OCR results provided in Robust Reading Competition

Model	Prompt	Test Data		Val Data
Model	Prompt	ANLS	⬆	ANLS	⬆
Claude	Plain	0.2298	-	0.2144	-
Claude	LATIN	0.8366	+0.6038	0.8311	+0.6167
Azure OpenAI ChatGPT (Completion)	Plain	0.6866	-	0.6795	-
Azure OpenAI ChatGPT (Completion)	LATIN	0.8255	+0.1389	0.8135	+0.1340
Azure OpenAI ChatGPT (ChatCompletion)	Plain	TODO	-	TODO	-
Azure OpenAI ChatGPT (ChatCompletion)	LATIN	TODO	TODO	0.5954 (Official OCR)	TODO
Azure OpenAI text-davinci-003 (Completion)	LATIN	-	-	0.8188	-
Alpaca (7B)	Plain	0.3567	-	0.3506	-
	LATIN	0.4200	+0.0633	0.4304	+0.0798
	LATIN-Tuning + LATIN-Prompt	0.6697	+0.3130	0.6668	+0.3162
Vicuna (13B)	Plain	0.0710	-	0.0688	-
Vicuna (13B)	LATIN	0.4725	+0.4015	0.4597	+0.3909
Llama2-13b-chat	Plain	0.1783	-	0.1863	-
Llama2-13b-chat	LATIN	0.4283	+0.2500	0.4435	+0.2572

Citation

@misc{wang2023layout,
      title={Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering}, 
      author={Wenjin Wang and Yunhao Li and Yixin Ou and Yin Zhang},
      year={2023},
      eprint={2306.00526},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

MIT License

Languages

Language:Python 93.7%Language:Shell 6.3%