Human vs. AI Perceptual Alignment
An investigation into whether Vision-Language Models perceive scientific visualizations with the same nuance as human experts. The research evaluates 13 state-of-the-art models on a curated set of images, measuring their alignment with expert judgments on visual purpose and encoding patterns, providing a quantitative view of the gap between machine and human perception.
The increasing use of AI to interpret visual data rests on a fundamental assumption: that these models perceive charts and figures in a way that aligns with human expertise. But is this assumption valid? This research investigates that question, exploring whether the emergent perceptual abilities of Vision-Language Models are consistent with the nuanced judgments of human experts.
To explore this, we designed a systematic evaluation, comparing the classifications of 13 state-of-the-art models against a ground truth of expert annotations on a curated set of scientific visualizations. The focus was on pure visual categorization—assessing a model’s ability to identify a visualization’s purpose, encoding, and dimensionality without any textual context. The engineering behind the study was designed for rigor and reproducibility, using a multi-provider setup with tools like
The goal of this work is not to crown a superior model, but to provide a measured, quantitative view of the alignment gap between human and machine perception. The results offer a fine-grained analysis of where current models succeed and, more critically, where they diverge from human consensus, highlighting specific weaknesses in interpreting complex visual encodings. This research, accepted at IEEE VIS 2025 contributes to a more grounded understanding of the capabilities and limitations of AI in the critical domain of visual data analysis.
Stack
While the problem is more important than the tools, the tech stack tells a story about the project's architecture and trade-offs. Here's what this project is built on:
Platforms & Runtimes
Frontend & Visualization
Generates publication-quality charts showing model performance across difficulty levels and encoding types
Creates interactive confusion matrices and multi-label classification visualizations for research analysis
Used for designing and assembling the research poster and visual assets for publication and presentation
AI & Machine Learning
Orchestrates multi-provider AI model integration and structured output generation for visualization categorization tasks
Provides GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, and O4-mini models for zero-shot visualization categorization evaluation
Provides Gemini 2.0-flash, Gemini 2.5-flash, and Gemini 2.5-pro models for zero-shot visualization categorization evaluation
Provides Mistral-medium-3, Mistral-small-3.1, and Pixtral-large models for zero-shot visualization categorization evaluation
Provides Llama-4-scout and Llama-4-maverick models for zero-shot visualization categorization evaluation
Provides Qwen2.5-vl-32b-instruct model for zero-shot visualization categorization evaluation
Aggregates access to multiple AI providers (Mistral, Meta, Qwen) through unified API for model comparison
Used for multi-label classification metrics and confusion matrix generation in model performance evaluation
Data Engineering
Processes VIS30K dataset with 6,803 images using lazy evaluation for stratified sampling and performance analysis
Enables efficient columnar data processing and Parquet serialization for large-scale visualization dataset handling
Stores structured analysis results from 13 AI models in partitioned format for efficient querying and caching
Backend & APIs
External Services
Development Tooling
Used to build interactive notebooks for VLM inference and evaluation with real-time visualization updates
Manages Python dependencies and virtual environment for reproducible research environment setup
Enforces code quality and formatting standards across research codebase for maintainable analysis scripts