Introduction

Documents remain a vital part of daily life, often combining text with visual elements like charts, tables, and diagrams to provide context and meaning. To enable AI systems to match human-like understanding of such documents, models must move beyond basic OCR to comprehend layouts, diagrams, and even hand-drawn sketches.

What is BigDocs?

BigDocs is a multimodal dataset effort for advanced document understanding, consisting of two key components:

  • BigDocs-7.5M: A high-quality, open-access, large-scale dataset of 7.5 million multimodal documents spanning 30 tasks
  • BigDocs-Bench: A benchmark suite with 10 real-world-inspired tasks like reasoning over graphical user interfaces (GUI), websites and documents and generating code from images
Data Distribution

BigDocs-Bench Tasks

Task Train Val Test Hidden Tokens
Screenshot-2HTML 9.3K 1.0K 500 500 32.7K±53K
Table-2LaTeX 77.7K 1.0K 500 500 438±540
Image2SVG 198K 2.0K 748 500 2.9K±1.7K
Image2Flow (GraphViz) 8.0K 1.0K 500 500 418±124
Image2Flow (JSON) 8.0K 1.0K 500 500 1.8K±601
Chart-2Markdown 4.5K 1.0K 500 500 1.6K±4.4K
Chart2Caption 5.4K 1.3K 650 500 94±49
GUI2UserIntent 79K 1.0K 500 500 28±4
GUI2Summary 79K 1.0K 500 500 132±25
GUI-VQA 78.991 1,000 500 500 35±24

BigDocs-Bench Leaderboard

Model Chart2MD Chart2Cap. Image2Flow (GraphViz) Image2Flow (JSON) GUI2Sum. GUI2Intent Image2SVG Screenshot2HTML Table2Latex GUI-VQA Avg. Score
Open Models
DocOwl-1.5-8B 0.08 18.69 0.00 0.00 11.22 13.88 3.58 3.50 75.07 27.22 15.32
Qwen2-VL-2B 41.17 22.88 0.00 0.00 23.98 17.70 23.18 6.46 74.83 26.40 23.66
Phi3.5-V-4B 60.64 21.88 1.61 0.65 27.80 10.81 34.57 4.25 74.14 34.96 27.13
LLAVA-NeXT-7B 22.00 20.67 1.58 0.46 21.99 12.38 20.53 5.00 73.81 27.54 20.60
Idefics2-8B 25.34 20.95 1.17 0.00 8.75 5.06 37.73 3.56 74.50 27.76 20.48
Llama-3.2.90B 45.21 20.60 0.73 0.52 22.16 12.04 45.97 7.32 74.79 27.28 25.66
Qwen2-VL-72B 70.47 19.42 1.07 0.23 18.80 33.94 54.43 10.03 74.51 30.67 31.36
Closed Models
GPT-4o 20240806 66.70 25.23 22.66 27.28 27.12 17.57 60.34 10.33 74.65 36.58 36.84
Claude-3.5 Sonnet 54.81 23.59 13.92 37.46 26.45 13.12 25.46 9.70 74.44 26.58 30.55
GeminiPro-1.5 76.63 25.90 11.51 33.59 25.54 16.79 15.21 7.43 75.22 35.35 32.32
BigDocs Models (ours)
DocOwl-1.5-8B + BigDocs 74.43 33.38 42.16 48.54 45.55 89.15 33.66 3.64 81.28 43.46 49.52
Qwen2-VL-2B + BigDocs 72.25 33.74 41.61 52.11 42.59 71.65 33.51 9.20 78.54 33.97 46.92
LLAVA-NeXT-7B+ BigDocs 72.78 32.88 59.66 71.49 46.14 79.55 60.63 10.40 80.79 40.67 55.50
Phi3.5-V-4B + BigDocs 84.01 36.78 63.07 71.86 47.32 86.91 34.65 12.05 81.94 44.81 56.34

Current Limitations in the Field

Scarcity of Open Datasets

Many datasets for training VLMs are not publicly available, with limited transparency about their content.

Simple Tasks in Open Datasets

Public datasets often address only basic tasks, insufficient for complex real-world challenges.

Restrictive Licensing

Unclear or restrictive licenses make many datasets difficult to use for business purposes.

BigDocs-7.5M Dataset

Task Categories

Document Information Extraction

Enhanced OCR, layout analysis, and table detection

Document Understanding

Document classification, question answering, and diagram analysis

Document Creation and Manipulation

Transform visual data into HTML, LaTeX, Markdown and JSON

Results of Training on BigDocs-7.5M

Model DocVQAVAL InfoVQAVAL DeepFormTEST KLCTEST WTQTEST TabFactTEST ChartQATEST TextVQAVAL MMIMTEST DudeMiniTEST SlideVQA-MTEST TableVQATEST Avg. Score
DocOwl1.5-8B (instruct) 80.73 49.94 68.84 37.99 38.87 79.67 68.56 68.91 33.67 34.64 31.62 52.60 53.84
DocOwl1.5-8B (base) 2.07 1.84 0.00 0.00 0.00 0.00 0.00 0.00 24.44 19.07 3.30 13.63 5.36
DocOwl1.5-8B (base) + DocStruct4M 75.99 46.88 62.77 35.21 32.86 71.56 68.36 65.08 33.67 29.00 27.03 46.27 49.56
DocOwl1.5-8B (base) + BigDocs (Ours) 78.70 47.62 64.39 36.93 35.69 72.65 65.80 67.30 32.33 32.55 29.60 49.03 51.05
Qwen2-VL-2B (instruct) 89.16 64.11 32.38 25.18 38.20 57.21 73.40 79.90 42.00 45.23 46.50 43.07 53.03
Qwen2-VL-2B (base) 7.26 0.78 0.00 0.00 0.00 0.00 0.00 1.14 34.89 28.43 14.55 0.00 7.25
Qwen2-VL-2B (base) + DocStruct4M 59.53 32.00 53.98 36.38 28.48 64.24 54.44 55.89 34.89 28.78 22.68 46.53 43.15
Qwen2-VL-2B (base) + BigDocs (Ours) 57.23 31.88 49.31 34.39 31.61 64.75 68.60 61.01 35.67 27.19 17.46 47.53 43.89
Phi3.5-Vision-4B (instruct) 86.00 56.20 10.47 7.49 17.18 30.43 82.16 73.12 46.00 37.20 30.93 70.70 45.66
Phi3.5-Vision-4B + DocStruct4M 86.76 68.90 70.12 37.83 51.30 82.12 79.76 68.60 44.11 35.52 31.90 69.17 60.51
Phi3.5-Vision-4B + BigDocs (Ours) 87.05 70.05 70.97 37.45 51.21 81.24 81.56 68.72 45.00 36.15 32.47 67.77 60.80
LLAVA-NeXT-7B (instruct) 63.51 30.90 1.30 5.35 20.06 52.83 52.12 65.10 38.89 17.94 7.46 32.87 32.36
LLAVA-NeXT-7B + DocStruct4M 60.95 26.14 39.78 28.34 25.90 67.72 61.20 52.25 25.78 21.70 15.33 27.03 37.68
LLAVA-NeXT-7B + BigDocs (Ours) 57.13 24.47 46.38 31.09 27.06 72.58 54.72 49.06 17.78 22.88 16.07 33.13 37.70
  • Performance gains up to 34.5% through fine-tuning
  • Surpasses proprietary models by 25.8% on BigDocs-Bench
  • Enables automation of repetitive document-related tasks
  • Transforms unstructured data into structured formats