KDD 2026

ThermEval
A Structured Benchmark for VLMs on Thermal Imagery

Can vision-language models reason about temperature? We evaluate 25 VLMs across ~55,000 thermal VQA pairs and find critical gaps in thermal understanding.

Ayush Shrivastava* Kirtan Gangani Laksh Jain Mayank Goel Nipun Batra
IIT Gandhinagar Carnegie Mellon University
* Corresponding author    Equal contribution
Paper Code Dataset Cite
ThermEval motivation: thermal imagery enables critical perception tasks in settings where RGB fails, but VLMs trained predominantly on RGB exhibit systematic errors.

Thermal imagery enables critical perception tasks where RGB fails, but VLMs trained predominantly on RGB exhibit systematic errors driven by modality mismatch and language priors.

Abstract

What is ThermEval?

Vision language models (VLMs) achieve strong performance on RGB imagery, but they do not generalize to thermal images. Thermal sensing plays a critical role in settings where visible light fails, including nighttime surveillance, search and rescue, autonomous driving, and medical screening. Unlike RGB imagery, thermal images encode physical temperature rather than color or texture, requiring perceptual and reasoning capabilities that existing RGB-centric benchmarks do not evaluate.


We introduce ThermEval-B, a structured benchmark of approximately 55,000 thermal visual question answering pairs designed to assess the foundational primitives required for thermal vision language understanding. ThermEval-B integrates public datasets with our newly collected ThermEval-D, the first dataset to provide dense per-pixel temperature maps with semantic body-part annotations across diverse indoor and outdoor environments.


Evaluating 25 open-source and closed-source VLMs, we find that models consistently fail at temperature-grounded reasoning, degrade under colormap transformations, and default to language priors or fixed responses, with only marginal gains from prompting or supervised fine-tuning.

55K+
Thermal VQA Pairs
25
VLMs Evaluated
7
Benchmark Tasks
1000+
Thermal Images with Per-Pixel Temperature
Benchmark

ThermEval-B: Seven Progressive Tasks

The benchmark is organized from simple modality identification to complex temperature estimation, each task probing complementary aspects of thermal vision-language understanding.

T1

Modality Identification

Can VLMs distinguish thermal images from RGB? We test binary classification using paired thermal-RGB images from FLIR and LLVIP datasets.

T2

Colormap Robustness

Do models rely on color cues or true modality understanding? We test identification under diverse colormap transformations (Magma, Viridis, Summer, Spring).

T3

Human Counting

Can VLMs count people in thermal images? We evaluate a fundamental perceptual capability using road scenes with varying pedestrian counts.

T4

Colorbar Interpretation

A prerequisite for temperature reasoning: can models detect, localize, and extract temperature ranges from embedded colorbars?

T5

Thermal Reasoning

Can models reason about relative temperatures? We test comparative reasoning across individuals and within-body-part ranking tasks.

T6

Temperature Estimation

Absolute temperature estimation at three difficulty levels: coordinate-based, pixel-based (marked locations), and region-based (semantic body parts).

T7

Depth-Varying Estimation

How does imaging distance affect temperature estimation? We evaluate at 2ft, 6ft, and 10ft to assess robustness to depth variation.

ThermEval benchmark tasks overview showing all 7 evaluation tasks
Figure 2. ThermEval defines seven evaluation tasks covering modality identification (T1-T2), human counting (T3), colorbar interpretation (T4), thermal reasoning (T5), and temperature estimation (T6-T7).
Dataset

ThermEval-D: Dense Thermal Annotations

The first thermal dataset combining raw thermal imagery, per-pixel temperature maps, and diverse semantic body-part annotations across indoor and outdoor environments.

Per-Pixel Temperature

Dense temperature annotations from TOPDON TC001 Plus with ±1°C accuracy and sub-40 mK sensitivity.

35 Participants

Diverse demographics (age 18-47, varied body types) captured with informed consent and ethics approval.

Expert Annotations

Three annotators per image with polygonal segmentations for forehead, chest, nose, and full body.

Diverse Environments

Indoor and outdoor scenes: offices, labs, parks, workspaces with varied postures and activities.

High Agreement

Inter-annotator BBox IoU 0.77, Segm. IoU 0.72, BBox Dice 0.87, Segm. Dice 0.84.

Open Access

Released under CC BY-NC 4.0 on Kaggle with Croissant metadata for reproducible research.

Sample images from ThermEval-D dataset showing single and multi-person thermal scenes
Figure 3. Images from ThermEval-D dataset. Top row: single-person scenes. Bottom row: multi-person scenes. Colorbars were added programmatically during task evaluation.
Results

Comprehensive Evaluation

We evaluate 25 VLMs spanning 0.3B to 235B parameters, including open-source, closed-source, and chart-focused models. Performance is assessed under zero-shot prompting, contextual prompting, and supervised fine-tuning.

Tasks 1-4: Modality, Counting & Colorbar

ACC ↑ = higher is better. MAE ↓ = lower is better.

Model Params T1 (ACC ↑) T2 (ACC ↑) T3 (MAE ↓) T4
FLIR LLVIP FLIR LLVIP FLIR LLVIP Detect Position Max Min
Chart-Focused Models
ChartGemma3B 0.500.500.000.00 3.041.250.480.450.040.03
TinyCharts3B 0.500.500.000.00 4.722.990.500.1468.4424.75
ChartInstruct7B 0.500.500.000.01 4.482.360.500.25162.0874.37
Open-Source Models
Qwen-VL 2.58B 0.991.000.991.00 3.550.891.001.000.000.00
Intern-VL 38B 1.001.001.001.00 3.020.641.001.009.150.82
LLaMA-3.211B 1.000.921.000.90 2.840.731.000.910.000.00
Intern-VL 338B 0.991.001.001.00 2.930.481.001.000.000.00
BLIP-28B 0.430.530.930.98 4.692.990.500.25--
Fine-Tuned & Baselines
Qwen-VL 2.5 (SFT)8B 1.001.001.001.00 1.850.551.001.000.000.00
Human-- 0.970.980.980.99 1.730.301.001.000.000.00

Tasks 5-7: Thermal Reasoning & Temperature Estimation

ACC ↑ = higher is better. MAE ↓ = lower is better (in °C).

Model Params T5 (ACC ↑) T6 (MAE ↓) T7 (MAE ↓)
Double Single Coords Marker Region 2ft 6ft 10ft
Phi-34B 0.560.223.493.922.261.191.141.25
Qwen-VL 28B 0.410.512.143.682.011.251.000.89
Qwen-VL 2.58B 0.440.323.212.882.141.260.930.87
LLaMA-3.211B 0.610.423.003.993.032.391.741.48
Intern-VL 338B 0.480.412.973.631.510.890.970.98
Qwen A22235B 0.540.343.593.723.591.231.231.40
Closed-Source Models
Gemini 3 Pro? 0.740.611.941.861.471.000.740.90
Gemini 3 Flash? 0.550.511.962.041.651.191.001.21
Fine-Tuned & Baselines
Qwen-VL 2.5 (SFT)7B 0.580.561.581.551.030.530.490.61
Human-- 0.840.54--2.732.041.231.201.22
Key Findings

What We Discovered

Language Priors Over Thermal Cues

Models default to canonical values like 36.8°C (body temperature) or fixed outputs like 0°C, 273 K, or "11" regardless of the actual thermal scene, indicating reliance on language priors rather than visual grounding.

Colormap Vulnerability

While VLMs handle basic modality identification well, complex colormaps (Summer, Spring) cause significant performance drops, suggesting models rely on low-level color statistics rather than modality-invariant representations.

Scale Doesn't Solve It

Failure modes persist across model scales from 0.3B to 235B parameters. InternVL 3 (38B) lags behind its 8B variant on some tasks. The bottleneck is cross-modal grounding, not model capacity.

Cascading Task Failures

Models that fail at colorbar interpretation (T4) consistently underperform on downstream temperature tasks (T6-T7), demonstrating that errors on prerequisite tasks predict failures on complex thermal reasoning.

Fine-Tuning Helps, But Not Enough

Supervised fine-tuning of Qwen-VL 2.5 brings near-human performance on most tasks, but 1-2°C errors remain in temperature estimation -- insufficient for safety-critical applications like fever screening.

Prompting Has Limits

Contextual prompts improve basic modality recognition (+37% on T2) but yield inconsistent or negative effects on thermal reasoning (T5-T7), showing that prompt engineering cannot compensate for missing thermal grounding.

Thermal images under various colormap transformations
Figure 4. Thermal images rendered under different colormap transformations. While the underlying thermal signal is unchanged, visual appearance varies dramatically, confusing many VLMs.
Citation

BibTeX

If you find ThermEval useful in your research, please cite our paper.

@inproceedings{shrivastava2026thermeval,
  title     = {ThermEval: A Structured Benchmark for Evaluation of
               Vision-Language Models on Thermal Imagery},
  author    = {Shrivastava, Ayush and Gangani, Kirtan and Jain, Laksh
               and Goel, Mayank and Batra, Nipun},
  booktitle = {Proceedings of the 32nd ACM SIGKDD Conference on
               Knowledge Discovery and Data Mining (KDD)},
  year      = {2026}
}
Acknowledgments

We acknowledge Google for providing access to the Gemini Academic Program Award, which enabled us to run and evaluate the Gemini models reported in this work. The study was approved by the Institutional Ethics Committee (IEC) at IIT Gandhinagar.