VayuBench
Executable Benchmarking of LLMs for Multi-Dataset Air Quality Analytics
Overview
VayuBench is the first executable benchmark for air quality analytics, featuring 5,000 natural language queries paired with verified Python code across 7 analytical categories using real Indian environmental data.
Key Features:
- Domain-grounded in real Indian environmental data (CPCB, NCAP, Census)
- Executable Python code with sandboxed evaluation
- Multi-dataset integration (pollution + funding + demographics)
- Systematic evaluation of 13 open-source LLMs
Links:
Quick Stats
| Metric | Value |
|---|---|
| Benchmark Questions | 5,000 |
| Query Categories | 7 |
| LLMs Evaluated | 13 |
| Real-World Datasets | 3 |
| Time Period | 2017-2024 |
What is VayuBench?
Air pollution causes over 1.6 million premature deaths annually in India. Yet decision-makers face barriers in turning diverse data on air pollution, population, and funding into actionable insights.
VayuBench evaluates whether Large Language Models (LLMs) can translate natural-language questions into correct, multi-dataset Python analyses for air quality data.
Key Contributions
- Executable Benchmark: 5,000 natural language queries paired with verified Python code
- Seven Query Categories: Spatial, Temporal, Spatio-Temporal, Population, Area, Funding, Pattern-based
- Comprehensive Evaluation: exec@1 (syntactic) and pass@k (functional) metrics
- Reproducible Framework: Complete pipeline from question generation to evaluation
Example Query
Question: “Which state had the highest average PM2.5 in May 2023?”
Expected Code:
def get_response(data, states_data, ncap_funding_data):
import pandas as pd
filtered = data[(data['Timestamp'].dt.year == 2023) &
(data['Timestamp'].dt.month == 5)]
grouped = filtered.groupby("state")["PM2.5"].mean()
grouped = grouped.dropna()
sorted_data = grouped.sort_values()
return sorted_data.index[-1]Evaluation: Code is executed in a sandbox and output compared to expected answer.
Top Results
Best performing models:
| Model | Size | exec@1 | pass@1 |
|---|---|---|---|
| Qwen3-Coder-30B | 30B | 0.99 | 0.79 |
| Qwen3-32B | 32B | 0.98 | 0.78 |
| Qwen2.5-Coder-14B | 14B | 0.90 | 0.69 |
See Results for full evaluation.
Finetune Your Own Model
Want to improve smaller models on VayuBench? Our finetuning guide shows how to:
- Prepare training data from benchmark questions
- Finetune models using LoRA and 4-bit quantization
- Evaluate and compare finetuned vs base models
- Achieve 45-75% improvement in pass@1 for 3B models
Complete scripts provided for the full workflow: prepare_finetuning_data.py, finetune_model.py, compare_results.py.
Citation
@inproceedings{acharya2025vayubench,
title={VayuBench and VayuChat: Executable Benchmarking and Deployment of LLMs
for Multi-Dataset Air Quality Analytics},
author={Acharya, Vedant and Pisharodi, Abhay and Pasi, Ratnesh and
Mondal, Rishabh and Batra, Nipun},
booktitle={Proceedings of CODS 2025},
year={2025}
}