Research Paper
VayuBench and VayuChat: Executable Benchmarking and Deployment of LLMs for Multi-Dataset Air Quality Analytics
Publication
Conference: 13th ACM International Conference on Data Science (CODS 2025)
Location: IISER, Pune, India
Date: December 17-20, 2025
Abstract
Air pollution causes over 1.6 million premature deaths annually in India. Yet, decision-makers face persistent barriers in turning diverse tabular data on air pollution, population, and funding into actionable insights. Existing tools demand technical expertise, offer shallow visualizations, or rely on static dashboards, leaving policy questions unresolved.
Large language models (LLMs) offer a potential alternative by translating natural-language questions into structured, multi-dataset analyses; however, their reliability for such domain-specific tasks remains unknown.
We present VayuBench, to our knowledge, the first executable benchmark for air-quality analytics. It comprises 5,000 natural-language queries paired with verified Python code across seven query categories: spatial, temporal, spatio-temporal, population-based, area-based, funding-related and specific pattern queries over multiple real-world datasets.
We evaluate 13 open-source LLMs under a unified, schema-aware protocol. While Qwen3-Coder-30B attains the strongest performance, frequent column-name and variable errors highlight risks for smaller models.
To bridge evaluation with practice, we deploy VayuChat, an interactive assistant that delivers real-time, code-backed analysis for Indian policymakers and citizens. Together, VayuBench and VayuChat demonstrate a reproducible pathway from benchmark to verified execution to deployment, establishing the foundations for trustworthy LLM-driven decision support in environmental monitoring.
Key Contributions
Executable Benchmark
First domain-specific benchmark for air quality analytics, pairing 5,000 natural language queries with verified Python code over multiple real-world datasets and systematically defined complex query categories.
Systematic LLM Evaluation
Unified schema-aware prompting protocol with machine-verifiable metrics (exec@1, pass@k), revealing significant capability gaps across 13 models.
System Deployment
VayuChat, an interactive chatbot demonstrating how VayuBench can translate into accessible, trustworthy decision support for air quality policy and analysis.
Research Context
The Problem
Air pollution is a severe public health crisis in India, contributing to more than 1.6 million premature deaths annually. Exposure to fine particulate matter (PM2.5) reduces average life expectancy in India by over five years.
Yet, monitoring is only the first step. Turning raw readings into timely, actionable insights for policy remains an unsolved problem:
- Simple questions are hard: “How did PM2.5 levels change in Delhi last year?” requires technical expertise
- Complex analysis is inaccessible: “Which cities reduced PM2.5 most relative to their NCAP funding?” requires integrating multiple datasets
- Existing tools fall short: Demand significant technical skills (R, SQL, Python) or offer only limited visualizations
The Opportunity
Large Language Models (LLMs) have emerged as powerful tools for translating natural language into executable analyses. In principle, an LLM could:
- Identify the right datasets
- Select appropriate statistical operations
- Generate correct Python code to produce the answer
But their performance on domain-specific, multi-dataset environmental analytics remains unexplored.
Our Approach
VayuBench provides:
- Domain grounding: Built from real Indian environmental data (CPCB, NCAP, Census)
- Executable evaluation: Verified Python code with sandboxed execution
- Multi-dataset integration: Queries require joining pollution, funding, and demographic data
- Systematic coverage: Seven query categories reflecting real policy needs
Methodology
Stakeholder-Driven Design
We engaged air-quality scientists, academic researchers, and policy think tanks in semi-structured interviews to identify three essential priorities:
- Transparency: Analytical steps must be verifiable
- Reproducibility: Outputs must be consistent and checkable
- Accessibility: Non-technical stakeholders must be able to use the system
These priorities shaped VayuBench’s design:
- Stakeholder Coverage (G1): Query types reflect genuine decision-making needs
- Executable Reliability (G2): Test both syntactic and functional correctness
- Extensible Accessibility (G3): Framework supports new datasets and questions
Benchmark Construction Pipeline
- Seed Question Collection: 65 base questions grounded in stakeholder needs
- Template Design: Handcrafted templates with placeholders for valid method-column pairs
- Systematic Expansion: ~26K candidate queries generated, sampled to 5K for diversity
- Paraphrasing: LLM paraphrasing + human verification for linguistic variety
Evaluation Protocol
Sampling: For each query, we sample n = 5 completions using temperature sampling (temperature = 0.8, top-p = 0.95).
Metrics:
- exec@1: Proportion of samples that execute without error
- pass@k: Fraction of queries where at least one of the top-k samples passes all tests
- Error categorization: Syntax, Column, Name, Other
Execution: Sandboxed subprocess execution with 15-second timeout.
Results
Model Performance
- Best Performance: Qwen3-Coder-30B achieves 0.99 exec@1 and 0.79 pass@1
- Scaling Effect: Approximately linear relationship between parameters and performance
- Specialization Advantage: Code-specialized models outperform general LLMs of similar size
Error Analysis
- Column errors dominate: Nearly 50% of failures across models
- Schema alignment: Primary bottleneck, even for top performers
- Syntax errors rare: Only in very small models (< 3B parameters)
Category-Specific Insights
- Easiest: Spatial Aggregation (SA) — Simple groupby operations
- Hardest: Funding-Based (FQ) — Multi-dataset joins required
- Most variable: Spatio-Temporal (STA) — High variance across models
Impact & Applications
VayuBench enables:
- Researchers: Evaluate LLMs on domain-specific analytics tasks
- Model Developers: Identify weaknesses (e.g., schema alignment) to improve training
- Policymakers: Access air quality insights through natural language (via VayuChat)
- Environmental Scientists: Build trustworthy AI systems for decision support
Future Work
Extensions
- Additional Data Sources: OpenAQ, WHO Global AQ Database, meteorological data
- Multimodal Integration: Sentinel-5P satellite imagery, NCAP policy documents
- Causal Analysis: Link interventions to outcomes with domain-adaptive models
Methodological Improvements
- Domain-Adaptive Finetuning: Improve schema alignment and reduce column errors
- Retrieval-Augmented Generation: Curated corpora of environmental analyses
- Self-Consistency Prompting: Multiple samples with majority voting
Evaluation Enhancements
- Stronger NLU: Better understanding of complex policy questions
- Runtime Checks: Automated validation using data profiling tools
- Fairness Metrics: Assess equity in pollution exposure and monitoring coverage
Citation
@inproceedings{acharya2025vayubench,
title={VayuBench and VayuChat: Executable Benchmarking and Deployment of LLMs
for Multi-Dataset Air Quality Analytics},
author={Acharya, Vedant and Pisharodi, Abhay and Pasi, Ratnesh and
Mondal, Rishabh and Batra, Nipun},
booktitle={Proceedings of 13th International Conference on Data Science (CODS 2025)},
year={2025},
organization={ACM}
}Resources
Contact
For questions, collaborations, or feedback:
Nipun Batra Associate Professor Indian Institute of Technology Gandhinagar Email: nipun.batra@iitgn.ac.in
Sustainability Lab Website: https://nipunbatra.github.io/
Acknowledgments
We thank the air quality scientists, policy analysts, and environmental data practitioners who participated in our stakeholder consultations. Their insights were instrumental in shaping the benchmark categories and evaluation criteria.
This work was supported by IIT Gandhinagar and represents a collaborative effort between computer science, environmental science, and public policy communities.