TL;DR
BigDocs is a large-scale multimodal dataset designed to enhance document understanding through 7.5 million diverse samples across 30 tasks. It empowers models to tackle complex document challenges with innovative tasks involving multimodal code generation, reasoning over graphical user interfaces (GUI), websites and documents and generating code from images.
- Bridging the Document AI Gap With comprehensive multimodal samples, enabling models to move beyond basic OCR.
- Complete Transparency With clear documentation and permissive licensing for broad use.
- Real-World Innovation Through novel tasks like GUI reasoning and multimodal code synthesis.
- Performance Gains With up to 15.14% improvement on document benchmarks when training with BigDocs.
What is BigDocs?
BigDocs is a multimodal dataset effort for advanced document understanding, consisting of two key components:
- BigDocs-7.5M: A high-quality, open-access, large-scale dataset of 7.5 million multimodal documents spanning 30 tasks
- BigDocs-Bench: A benchmark suite with 10 real-world-inspired tasks like reasoning over graphical user interfaces (GUI), websites and documents and generating code from images

BigDocs-Bench Datasets & Tasks
BigDocs-Bench comprises a diverse set of tasks designed to evaluate model performance across different document understanding scenarios. Below is a detailed breakdown of the dataset composition for each task:
Task | Train | Val | Test | Hidden | Tokens |
---|---|---|---|---|---|
Screenshot-2HTML
|
9.3K | 1000 | 500 | 500 | 32.7K±53K |
Table-2LaTeX
|
77.7K | 1000 | 500 | 500 | 438±540 |
Image2SVG
|
198K | 2000 | 748 | 500 | 2.9K±1.7K |
Image2Flow (GraphViz)
|
8.0K | 1000 | 500 | 500 | 418±124 |
Image2Flow (JSON)
|
8000 | 1000 | 500 | 500 | 1800±601 |
Chart-2Markdown
|
4500 | 1000 | 500 | 500 | 1.6K±4.4K |
Chart2Caption
|
5.4K | 1300 | 650 | 500 | 94±49 |
GUI2UserIntent
|
79K | 1000 | 500 | 500 | 28±4 |
GUI2Summary
|
79K | 1000 | 500 | 500 | 132±25 |
GUI-VQA
|
78.9k | 1000 | 500 | 500 | 35±24 |
BigDocs-Bench Leaderboard
Our comprehensive evaluation demonstrates the effectiveness of models fine-tuned on BigDocs. The leaderboard below showcases performance comparisons across various metrics, highlighting the improvements achieved through our approach:
Model | Chart2MD | Chart2Cap. | Image2Flow (GraphViz) | Image2Flow (JSON) | GUI2Sum. | GUI2Intent | Image2SVG | Screenshot2HTML | Table2Latex | GUI-VQA | Avg. Score |
---|---|---|---|---|---|---|---|---|---|---|---|
DocOwl-1.5-8B | 0.08 | 18.69 | 0.00 | 0.00 | 11.22 | 13.88 | 3.58 | 3.50 | 75.07 | 27.22 | 15.32 |
Qwen2-VL-2B | 41.17 | 22.88 | 0.00 | 0.00 | 23.98 | 17.70 | 23.18 | 6.46 | 74.83 | 26.40 | 23.66 |
Phi3.5-V-4B | 60.64 | 21.88 | 1.61 | 0.65 | 27.80 | 10.81 | 34.57 | 4.25 | 74.14 | 34.96 | 27.13 |
LLAVA-NeXT-7B | 22.00 | 20.67 | 1.58 | 0.46 | 21.99 | 12.38 | 20.53 | 5.00 | 73.81 | 27.54 | 20.60 |
Idefics2-8B | 25.34 | 20.95 | 1.17 | 0.00 | 8.75 | 5.06 | 37.73 | 3.56 | 74.50 | 27.76 | 20.48 |
Llama-3.2.90B | 45.21 | 20.60 | 0.73 | 0.52 | 22.16 | 12.04 | 45.97 | 7.32 | 74.79 | 27.28 | 25.66 |
Qwen2-VL-72B | 70.47 | 19.42 | 1.07 | 0.23 | 18.80 | 33.94 | 54.43 | 10.03 | 74.51 | 30.67 | 31.36 |
GeminiPro-1.5 | 66.70 | 25.23 | 22.66 | 27.28 | 27.12 | 17.57 | 60.34 | 10.33 | 74.65 | 36.58 | 36.84 |
DocOwl-1.5-8B + BigDocs | 54.81 | 23.59 | 13.92 | 37.46 | 26.45 | 13.12 | 25.46 | 9.70 | 74.44 | 26.58 | 30.55 |
LLAVA-NeXT-7B + BigDocs | 76.63 | 25.90 | 11.51 | 33.59 | 25.54 | 16.79 | 15.21 | 7.43 | 75.22 | 35.35 | 32.32 |
Idefics2-8B + BigDocs | 74.43 | 33.38 | 42.16 | 48.54 | 45.55 | 89.15 | 33.66 | 3.64 | 81.28 | 43.46 | 49.52 |
Llama-3.2.90B + BigDocs | 72.25 | 33.74 | 41.61 | 52.11 | 42.59 | 71.65 | 33.51 | 9.20 | 78.54 | 33.97 | 46.92 |
Qwen2-VL-2B + BigDocs | 72.78 | 32.88 | 59.66 | 71.49 | 46.14 | 79.55 | 60.63 | 10.40 | 80.79 | 40.67 | 55.50 |
Qwen2-VL-2B (base) + BigDocs (Ours) | 84.01 | 36.78 | 63.07 | 71.86 | 47.32 | 86.91 | 34.65 | 12.05 | 81.94 | 44.81 | 56.34 |
Current Limitations in the Field
Despite recent advances in document AI, several challenges persist in the field. We identify three key limitations that BigDocs aims to address:
Scarcity of Open Datasets
Many datasets for training VLMs are not publicly available, with limited transparency about their content.
Simple Tasks in Open Datasets
Public datasets often address only basic tasks, insufficient for complex real-world challenges.
Restrictive Licensing
Unclear or restrictive licenses make many datasets difficult to use for business purposes.
BigDocs-7.5M Dataset
The BigDocs-7.5M dataset represents a significant advancement in document understanding, offering comprehensive coverage across multiple domains and tasks. Our dataset is structured around three primary categories:
Task Categories
Document Information Extraction
Enhanced OCR, layout analysis, and table detection
Document Understanding
Document classification, question answering, and diagram analysis
Document Creation and Manipulation
Transform visual data into HTML, LaTeX, Markdown and JSON
Results of Training on BigDocs-7.5M
Our experimental results demonstrate significant improvements across multiple benchmarks, showcasing the effectiveness of training with BigDocs-7.5M:
Performance Boost
Up to 34.5% improvement through fine-tuning on BigDocs, enabling superior document understanding capabilities.
Competitive Edge
Surpasses proprietary models by 25.8% on BigDocs-Bench, demonstrating the dataset's excellence for real-world tasks.
Model | DocVQAVAL | InfoVQAVAL | DeepFormTEST | KLCTEST | WTQTEST | TabFactTEST | ChartQATEST | TextVQAVAL | MMIMTEST | DudeMiniTEST | SlideVQA-MTEST | TableVQATEST | Avg. Score |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DocOwl1.5-8B (instruct) | 80.73 | 49.94 | 68.84 | 37.99 | 38.87 | 79.67 | 68.56 | 68.91 | 33.67 | 34.64 | 31.62 | 52.60 | 53.84 |
DocOwl1.5-8B (base) | 2.07 | 1.84 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 24.44 | 19.07 | 3.30 | 13.63 | 5.36 |
DocOwl1.5-8B (base) + DocStruct4M | 75.99 | 46.88 | 62.77 | 35.21 | 32.86 | 71.56 | 68.36 | 65.08 | 33.67 | 29.00 | 27.03 | 46.27 | 49.56 |
DocOwl1.5-8B (base) + BigDocs (Ours) | 78.70 | 47.62 | 64.39 | 36.93 | 35.69 | 72.65 | 65.80 | 67.30 | 32.33 | 32.55 | 29.60 | 49.03 | 51.05 |
Qwen2-VL-2B (instruct) | 89.16 | 64.11 | 32.38 | 25.18 | 38.20 | 57.21 | 73.40 | 79.90 | 42.00 | 45.23 | 46.50 | 43.07 | 53.03 |
Qwen2-VL-2B (base) | 7.26 | 0.78 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.14 | 34.89 | 28.43 | 14.55 | 0.00 | 7.25 |
Qwen2-VL-2B (base) + DocStruct4M | 59.53 | 32.00 | 53.98 | 36.38 | 28.48 | 64.24 | 54.44 | 55.89 | 34.89 | 28.78 | 22.68 | 46.53 | 43.15 |
Qwen2-VL-2B (base) + BigDocs (Ours) | 57.23 | 31.88 | 49.31 | 34.39 | 31.61 | 64.75 | 68.60 | 61.01 | 35.67 | 27.19 | 17.46 | 47.53 | 43.89 |
Phi3.5-Vision-4B (instruct) | 86.00 | 56.20 | 10.47 | 7.49 | 17.18 | 30.43 | 82.16 | 73.12 | 46.00 | 37.20 | 30.93 | 70.70 | 45.66 |
Phi3.5-Vision-4B + DocStruct4M | 86.76 | 68.90 | 70.12 | 37.83 | 51.30 | 82.12 | 79.76 | 68.60 | 44.11 | 35.52 | 31.90 | 69.17 | 60.51 |
Phi3.5-Vision-4B + BigDocs (Ours) | 87.05 | 70.05 | 70.97 | 37.45 | 51.21 | 81.24 | 81.56 | 68.72 | 45.00 | 36.15 | 32.47 | 67.77 | 60.80 |
LLAVA-NeXT-7B (instruct) | 63.51 | 30.90 | 1.30 | 5.35 | 20.06 | 52.83 | 52.12 | 65.10 | 38.89 | 17.94 | 7.46 | 32.87 | 32.36 |
LLAVA-NeXT-7B + DocStruct4M | 60.95 | 26.14 | 39.78 | 28.34 | 25.90 | 67.72 | 61.20 | 52.25 | 25.78 | 21.70 | 15.33 | 27.03 | 37.68 |
LLAVA-NeXT-7B + BigDocs (Ours) | 57.13 | 24.47 | 46.38 | 31.09 | 27.06 | 72.58 | 54.72 | 49.06 | 17.78 | 22.88 | 16.07 | 33.13 | 37.70 |
Citation
If you find this work useful for your research, please consider citing our paper:
@misc{rodriguez2025bigdocsopendatasettraining,
title={BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks},
author={Juan Rodriguez and Xiangru Jian and Siba Smarak Panigrahi and Tianyu Zhang and
Aarash Feizi and Abhay Puri and Akshay Kalkunte and François Savard and
Ahmed Masry and Shravan Nayak and Rabiul Awal and Mahsa Massoud and
Amirhossein Abaskohi and Zichao Li and Suyuchen Wang and Pierre-André Noël and
Mats Leon Richter and Saverio Vadacchino and Shubham Agarwal and Sanket Biswas and
Sara Shanian and Ying Zhang and Noah Bolger and Kurt MacDonald and Simon Fauvel and
Sathwik Tejaswi and Srinivas Sunkara and Joao Monteiro and Krishnamurthy DJ Dvijotham and
Torsten Scholak and Nicolas Chapados and Sepideh Kharagani and Sean Hughes and
M. Özsu and Siva Reddy and Marco Pedersoli and Yoshua Bengio and Christopher Pal and
Issam Laradji and Spandana Gella and Perouz Taslakian and David Vazquez and Sai Rajeswar},
year={2025},
eprint={2412.04626},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2412.04626}
}