Introduction
Documents remain a vital part of daily life, often combining text with visual elements like charts, tables, and diagrams to provide context and meaning. To enable AI systems to match human-like understanding of such documents, models must move beyond basic OCR to comprehend layouts, diagrams, and even hand-drawn sketches.
What is BigDocs?
BigDocs is a multimodal dataset effort for advanced document understanding, consisting of two key components:
- BigDocs-7.5M: A high-quality, open-access, large-scale dataset of 7.5 million multimodal documents spanning 30 tasks
- BigDocs-Bench: A benchmark suite with 10 real-world-inspired tasks like reasoning over graphical user interfaces (GUI), websites and documents and generating code from images
BigDocs-Bench Tasks
Task | Train | Val | Test | Hidden | Tokens |
---|---|---|---|---|---|
Screenshot-2HTML | 9.3K | 1.0K | 500 | 500 | 32.7K±53K |
Table-2LaTeX | 77.7K | 1.0K | 500 | 500 | 438±540 |
Image2SVG | 198K | 2.0K | 748 | 500 | 2.9K±1.7K |
Image2Flow (GraphViz) | 8.0K | 1.0K | 500 | 500 | 418±124 |
Image2Flow (JSON) | 8.0K | 1.0K | 500 | 500 | 1.8K±601 |
Chart-2Markdown | 4.5K | 1.0K | 500 | 500 | 1.6K±4.4K |
Chart2Caption | 5.4K | 1.3K | 650 | 500 | 94±49 |
GUI2UserIntent | 79K | 1.0K | 500 | 500 | 28±4 |
GUI2Summary | 79K | 1.0K | 500 | 500 | 132±25 |
GUI-VQA | 78.991 | 1,000 | 500 | 500 | 35±24 |
BigDocs-Bench Leaderboard
Model | Chart2MD | Chart2Cap. | Image2Flow (GraphViz) | Image2Flow (JSON) | GUI2Sum. | GUI2Intent | Image2SVG | Screenshot2HTML | Table2Latex | GUI-VQA | Avg. Score |
---|---|---|---|---|---|---|---|---|---|---|---|
Open Models | |||||||||||
DocOwl-1.5-8B | 0.08 | 18.69 | 0.00 | 0.00 | 11.22 | 13.88 | 3.58 | 3.50 | 75.07 | 27.22 | 15.32 |
Qwen2-VL-2B | 41.17 | 22.88 | 0.00 | 0.00 | 23.98 | 17.70 | 23.18 | 6.46 | 74.83 | 26.40 | 23.66 |
Phi3.5-V-4B | 60.64 | 21.88 | 1.61 | 0.65 | 27.80 | 10.81 | 34.57 | 4.25 | 74.14 | 34.96 | 27.13 |
LLAVA-NeXT-7B | 22.00 | 20.67 | 1.58 | 0.46 | 21.99 | 12.38 | 20.53 | 5.00 | 73.81 | 27.54 | 20.60 |
Idefics2-8B | 25.34 | 20.95 | 1.17 | 0.00 | 8.75 | 5.06 | 37.73 | 3.56 | 74.50 | 27.76 | 20.48 |
Llama-3.2.90B | 45.21 | 20.60 | 0.73 | 0.52 | 22.16 | 12.04 | 45.97 | 7.32 | 74.79 | 27.28 | 25.66 |
Qwen2-VL-72B | 70.47 | 19.42 | 1.07 | 0.23 | 18.80 | 33.94 | 54.43 | 10.03 | 74.51 | 30.67 | 31.36 |
Closed Models | |||||||||||
GPT-4o 20240806 | 66.70 | 25.23 | 22.66 | 27.28 | 27.12 | 17.57 | 60.34 | 10.33 | 74.65 | 36.58 | 36.84 |
Claude-3.5 Sonnet | 54.81 | 23.59 | 13.92 | 37.46 | 26.45 | 13.12 | 25.46 | 9.70 | 74.44 | 26.58 | 30.55 |
GeminiPro-1.5 | 76.63 | 25.90 | 11.51 | 33.59 | 25.54 | 16.79 | 15.21 | 7.43 | 75.22 | 35.35 | 32.32 |
BigDocs Models (ours) | |||||||||||
DocOwl-1.5-8B + BigDocs | 74.43 | 33.38 | 42.16 | 48.54 | 45.55 | 89.15 | 33.66 | 3.64 | 81.28 | 43.46 | 49.52 |
Qwen2-VL-2B + BigDocs | 72.25 | 33.74 | 41.61 | 52.11 | 42.59 | 71.65 | 33.51 | 9.20 | 78.54 | 33.97 | 46.92 |
LLAVA-NeXT-7B+ BigDocs | 72.78 | 32.88 | 59.66 | 71.49 | 46.14 | 79.55 | 60.63 | 10.40 | 80.79 | 40.67 | 55.50 |
Phi3.5-V-4B + BigDocs | 84.01 | 36.78 | 63.07 | 71.86 | 47.32 | 86.91 | 34.65 | 12.05 | 81.94 | 44.81 | 56.34 |
Current Limitations in the Field
Scarcity of Open Datasets
Many datasets for training VLMs are not publicly available, with limited transparency about their content.
Simple Tasks in Open Datasets
Public datasets often address only basic tasks, insufficient for complex real-world challenges.
Restrictive Licensing
Unclear or restrictive licenses make many datasets difficult to use for business purposes.
BigDocs-7.5M Dataset
Task Categories
Document Information Extraction
Enhanced OCR, layout analysis, and table detection
Document Understanding
Document classification, question answering, and diagram analysis
Document Creation and Manipulation
Transform visual data into HTML, LaTeX, Markdown and JSON
Results of Training on BigDocs-7.5M
Model | DocVQAVAL | InfoVQAVAL | DeepFormTEST | KLCTEST | WTQTEST | TabFactTEST | ChartQATEST | TextVQAVAL | MMIMTEST | DudeMiniTEST | SlideVQA-MTEST | TableVQATEST | Avg. Score |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DocOwl1.5-8B (instruct) | 80.73 | 49.94 | 68.84 | 37.99 | 38.87 | 79.67 | 68.56 | 68.91 | 33.67 | 34.64 | 31.62 | 52.60 | 53.84 |
DocOwl1.5-8B (base) | 2.07 | 1.84 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 24.44 | 19.07 | 3.30 | 13.63 | 5.36 |
DocOwl1.5-8B (base) + DocStruct4M | 75.99 | 46.88 | 62.77 | 35.21 | 32.86 | 71.56 | 68.36 | 65.08 | 33.67 | 29.00 | 27.03 | 46.27 | 49.56 |
DocOwl1.5-8B (base) + BigDocs (Ours) | 78.70 | 47.62 | 64.39 | 36.93 | 35.69 | 72.65 | 65.80 | 67.30 | 32.33 | 32.55 | 29.60 | 49.03 | 51.05 |
Qwen2-VL-2B (instruct) | 89.16 | 64.11 | 32.38 | 25.18 | 38.20 | 57.21 | 73.40 | 79.90 | 42.00 | 45.23 | 46.50 | 43.07 | 53.03 |
Qwen2-VL-2B (base) | 7.26 | 0.78 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.14 | 34.89 | 28.43 | 14.55 | 0.00 | 7.25 |
Qwen2-VL-2B (base) + DocStruct4M | 59.53 | 32.00 | 53.98 | 36.38 | 28.48 | 64.24 | 54.44 | 55.89 | 34.89 | 28.78 | 22.68 | 46.53 | 43.15 |
Qwen2-VL-2B (base) + BigDocs (Ours) | 57.23 | 31.88 | 49.31 | 34.39 | 31.61 | 64.75 | 68.60 | 61.01 | 35.67 | 27.19 | 17.46 | 47.53 | 43.89 |
Phi3.5-Vision-4B (instruct) | 86.00 | 56.20 | 10.47 | 7.49 | 17.18 | 30.43 | 82.16 | 73.12 | 46.00 | 37.20 | 30.93 | 70.70 | 45.66 |
Phi3.5-Vision-4B + DocStruct4M | 86.76 | 68.90 | 70.12 | 37.83 | 51.30 | 82.12 | 79.76 | 68.60 | 44.11 | 35.52 | 31.90 | 69.17 | 60.51 |
Phi3.5-Vision-4B + BigDocs (Ours) | 87.05 | 70.05 | 70.97 | 37.45 | 51.21 | 81.24 | 81.56 | 68.72 | 45.00 | 36.15 | 32.47 | 67.77 | 60.80 |
LLAVA-NeXT-7B (instruct) | 63.51 | 30.90 | 1.30 | 5.35 | 20.06 | 52.83 | 52.12 | 65.10 | 38.89 | 17.94 | 7.46 | 32.87 | 32.36 |
LLAVA-NeXT-7B + DocStruct4M | 60.95 | 26.14 | 39.78 | 28.34 | 25.90 | 67.72 | 61.20 | 52.25 | 25.78 | 21.70 | 15.33 | 27.03 | 37.68 |
LLAVA-NeXT-7B + BigDocs (Ours) | 57.13 | 24.47 | 46.38 | 31.09 | 27.06 | 72.58 | 54.72 | 49.06 | 17.78 | 22.88 | 16.07 | 33.13 | 37.70 |
- Performance gains up to 34.5% through fine-tuning
- Surpasses proprietary models by 25.8% on BigDocs-Bench
- Enables automation of repetitive document-related tasks
- Transforms unstructured data into structured formats