VayuBench and VayuChat: Executable Benchmarking and Deployment of LLMs for Air Quality Analytics

Abstract

Air pollution causes over 1.6 million premature deaths annually in India. Yet, decision-makers face persistent barriers in turning diverse tabular data on air pollution, population, and funding into actionable insights. Existing tools demand technical expertise, offer shallow visualizations, or rely on static dashboards, leaving policy questions unresolved.

Large language models (LLMs) offer a potential alternative by translating natural-language questions into structured, multi-dataset analyses; however, their reliability for such domain-specific tasks remains unknown.

We present VayuBench, to our knowledge, the first executable benchmark for air-quality analytics. It comprises 5,000 natural-language queries paired with verified Python code across seven query categories: spatial, temporal, spatio-temporal, population-based, area-based, funding-related and specific pattern queries over multiple real-world datasets.

We evaluate 13 open-source LLMs under a unified, schema-aware protocol. While Qwen3-Coder-30B attains the strongest performance, frequent column-name and variable errors highlight risks for smaller models.

To bridge evaluation with practice, we deploy VayuChat, an interactive assistant that delivers real-time, code-backed analysis for Indian policymakers and citizens. Together, VayuBench and VayuChat demonstrate a reproducible pathway from benchmark to verified execution to deployment, establishing the foundations for trustworthy LLM-driven decision support in environmental monitoring.

VayuBench: Query Distribution

Distribution of 5,000 queries across categories: Spatial (48.82%), Spatio-Temporal (24.56%), Temporal (12.15%), Funding (4.42%), Population-Based (3.82%), Area-Based (3.72%), and Specific Pattern (2.52%).

VayuChat: Interactive Interface

VayuChat provides an intuitive interface with AI model selection, quick prompts, natural language query processing, and automated visualizations for air quality analytics.

Key Results

Best Performance: Qwen3-32B achieves 98% executable code with only 1% error rate
Model Scale Matters: Models under 7B parameters show high error rates (53-97%) due to column naming and syntax issues
Schema-Aware Prompting Works: Providing data schema information significantly improves code generation accuracy
Production Deployment: VayuChat serves real users, demonstrating practical viability beyond academic benchmarks

Impact

VayuBench and VayuChat address a critical gap in environmental analytics by making air quality data accessible to non-technical stakeholders. Our benchmark establishes rigorous evaluation standards for LLMs in domain-specific analytics, while our deployed system demonstrates the practical feasibility of LLM-powered decision support for policy-critical applications.

This work has been featured in Indian Express, Ahmedabad Mirror, and other leading environmental technology publications.

BibTeX

@inproceedings{acharya2025vayubench,
  title={VayuBench and VayuChat: Executable Benchmarking and Deployment of LLMs for Multi-Dataset Air Quality Analytics},
  author={Acharya, Vedant and Pisharodi, Abhay and Pasi, Ratnesh and Mondal, Rishabh and Batra, Nipun},
  booktitle={CODS 2025},
  year={2025}
}