Human vs. AI Perceptual Alignment

AI Research Visualization Opensource Analytics

May 2025 - June 2025

Analysis Code Paper Poster

An investigation into whether Vision-Language Models categorize scientific visualizations in ways that align with expert judgment. The study evaluates 13 models on a labeled image set and measures agreement on visual purpose and encoding patterns.

Developed in collaboration with

University of Vienna

Vienna University of Technology

Mohamed bin Zayed University of Artificial Intelligence

Austrian Institute of Technology

The increasing use of AI to interpret visual data rests on a key assumption: that models “see” charts and figures in ways that match human expertise. This study tests that assumption by comparing Vision-Language Model outputs with expert annotations.

To explore this, we designed a systematic evaluation, comparing the classifications of 13 models against a ground truth of expert annotations on a labeled set of scientific visualizations. The focus was on pure visual categorization—assessing a model’s ability to identify a visualization’s purpose, encoding, and dimensionality without any textual context. The engineering behind the study was designed for rigor and reproducibility, using a multi-provider setup with tools like LangChain to ensure a broad comparison.

The goal is not to crown a “best” model, but to provide a measured, quantitative view of the alignment gap. The results show where current models agree with human consensus and where they reliably diverge, including specific failure modes on complex encodings. This research, accepted at IEEE VIS 2025, contributes evidence about current capabilities and limitations in visual data analysis.

Stack

While the problem is more important than the tools, the tech stack tells a story about the project's architecture and trade-offs. Here's what this project is built on:

Platforms & Runtimes

Implements the evaluation pipeline and analysis workflows.

Frontend & Visualization

Generates charts summarizing model performance across difficulty levels and encoding types.

Creates interactive confusion matrices and multi-label classification visualizations.

Designs the research poster and supporting visual assets.

AI & Machine Learning

Orchestrates multi-provider model integration and structured outputs for categorization tasks.

Provides models used for zero-shot visualization categorization evaluation.

Provides models used for zero-shot visualization categorization evaluation.

Provides models used for zero-shot visualization categorization evaluation.

Provides models used for zero-shot visualization categorization evaluation.

Provides models used for zero-shot visualization categorization evaluation.

Routes requests to multiple providers behind a unified API for model comparison.

Computes multi-label classification metrics and confusion matrices.

Data Engineering

Processes the VIS30K dataset for stratified sampling and performance analysis.

Supports columnar data processing and Parquet serialization.

Stores structured analysis results in a partitioned format for efficient querying and caching.

Backend & APIs

Caches HTTP responses and LLM outputs to avoid redundant API calls.

Defines and validates structured output schemas (purpose, encoding, dimensionality).

External Services

Tracks token usage and latency across the evaluation runs.

Enables local model deployment and testing for on-device evaluation scenarios.

Development Tooling

Builds interactive notebooks for inference, evaluation, and visualization.

Manages Python dependencies for a reproducible research environment.

Lints and formats the research codebase.