LLM Assisted Visualization Analysis Pipeline

Project Description

This project presents a scalable, end-to-end system for extracting, classifying, and interactively exploring large collections of scholarly figures using a multimodal language model (LLM). Leveraging Playwright automation and OpenAlex harvesting, we collected over 11,000 publication PDFs. Figures are isolated using pdffigures2 and a Faster R-CNN-based detector, followed by zero-shot chart-type classification via GPT-4o prompting. The resulting metadata populates a dual-mode exploration interface—a traditional 2D dashboard and a 3D free-exploration environment built with Observable Framework, Three.js, and D3.js.

Designed to support visualization service providers, the system enables rapid trend discovery, technique identification, and cross-domain consulting. Initial evaluations show a manual-verification accuracy of 91.2%, with the potential to reduce manual annotation efforts by over 98%. Future work will integrate Vision Transformer embeddings with t-SNE/UMAP dimensionality reduction for taxonomy-free exploration.

This work would not have been possible without the support of the ITP community, as well as Carolina Roe-Raymond (Princeton University) and Devin Richard Bayly (University of Arizona).

Technical Details

Data Acquisition & Preprocessing:

1. OpenAlex API harvesting

2. Playwright browser automation

3. pdffigures2 for figure–caption extraction

4. VisImages-Detection (Faster R‑CNN) for subfigure isolation

Classification:

1. Zero-shot chart-type inference via OpenAI GPT‑4o (version: 2024‑08‑06)

2. Caption-based prompt engineering

Interface:

1. 2D dashboards: Observable Framework

2. 3D exploration: Three.js and D3.js

3. Future work: Taxonomy-free embedding using t‑SNE/UMAP on Vision Transformer features

Research/Context

This work proposes a quantitative analysis pipeline for visualization datasets, combining bibliometric and document analysis foundations. Using this pipeline, we built one of the largest institutional collections of visualization figures (>30k images), supporting flexible, user-defined taxonomy. Unlike prior projects such as Beagle’s web extraction [Battle et al. 2018] and Vis30k’s curated datasets [Chen et al. 2021], our approach eliminates manual annotation overhead by incorporating LLM-based zero-shot labeling. Drawing interface inspirations from Google’s t-SNE Map [Diagne et al. 2018] and Duhaime’s Three.js guide [Duhaime 2017], we designed a 3D interactive exploration environment. Comparisons to VisImages [Deng et al. 2022] emphasize challenges in multi-label figure handling and motivate subfigure isolation strategies. This work demonstrates how AI4Vis techniques can streamline visualization consulting, trend analysis, and technique discovery at scale across institutions.