VayuBench

Executable Benchmarking of LLMs for Multi-Dataset Air Quality Analytics

Overview

VayuBench is the first executable benchmark for air quality analytics, featuring 5,000 natural language queries paired with verified Python code across 7 analytical categories using real Indian environmental data.

Key Features:

Domain-grounded in real Indian environmental data (CPCB, NCAP, Census)
Executable Python code with sandboxed evaluation
Multi-dataset integration (pollution + funding + demographics)
Systematic evaluation of 13 open-source LLMs

Links:

Quick Stats

Metric	Value
Benchmark Questions	5,000
Query Categories	7
LLMs Evaluated	13
Real-World Datasets	3
Time Period	2017-2024

What is VayuBench?

Air pollution causes over 1.6 million premature deaths annually in India. Yet decision-makers face barriers in turning diverse data on air pollution, population, and funding into actionable insights.

VayuBench evaluates whether Large Language Models (LLMs) can translate natural-language questions into correct, multi-dataset Python analyses for air quality data.

Key Contributions

Executable Benchmark: 5,000 natural language queries paired with verified Python code
Seven Query Categories: Spatial, Temporal, Spatio-Temporal, Population, Area, Funding, Pattern-based
Comprehensive Evaluation: exec@1 (syntactic) and pass@k (functional) metrics
Reproducible Framework: Complete pipeline from question generation to evaluation

Example Query

Question: “Which state had the highest average PM2.5 in May 2023?”

Expected Code:

def get_response(data, states_data, ncap_funding_data):
    import pandas as pd
    filtered = data[(data['Timestamp'].dt.year == 2023) &
                     (data['Timestamp'].dt.month == 5)]
    grouped = filtered.groupby("state")["PM2.5"].mean()
    grouped = grouped.dropna()
    sorted_data = grouped.sort_values()
    return sorted_data.index[-1]

Evaluation: Code is executed in a sandbox and output compared to expected answer.

Top Results

Best performing models:

Model	Size	exec@1	pass@1
Qwen3-Coder-30B	30B	0.99	0.79
Qwen3-32B	32B	0.98	0.78
Qwen2.5-Coder-14B	14B	0.90	0.69

See Results for full evaluation.

Finetune Your Own Model

Want to improve smaller models on VayuBench? Our finetuning guide shows how to:

Prepare training data from benchmark questions
Finetune models using LoRA and 4-bit quantization
Evaluate and compare finetuned vs base models
Achieve 45-75% improvement in pass@1 for 3B models

Complete scripts provided for the full workflow: prepare_finetuning_data.py, finetune_model.py, compare_results.py.

Citation

@inproceedings{acharya2025vayubench,
  title={VayuBench and VayuChat: Executable Benchmarking and Deployment of LLMs
         for Multi-Dataset Air Quality Analytics},
  author={Acharya, Vedant and Pisharodi, Abhay and Pasi, Ratnesh and
          Mondal, Rishabh and Batra, Nipun},
  booktitle={Proceedings of CODS 2025},
  year={2025}
}

Authors

Vedant Acharya*, Abhay Pisharodi*, Ratnesh Pasi, Rishabh Mondal, Nipun Batra

*Equal contribution

Affiliation: IIT Gandhinagar, IIIT Surat

Contact: nipun.batra@iitgn.ac.in