Benchmark Categories
VayuBench comprises seven systematically defined query categories, each reflecting distinct analytical tasks for air quality decision-making. These categories were derived from stakeholder consultations with policymakers, air quality scientists, and environmental data practitioners.
Category Distribution
1. Spatial Aggregation (SA)
Spatial Aggregation (SA)
4,897 questions (48.8%)
Location-based summaries at fixed times, requiring geographic grouping across monitoring stations, cities, or states.
Example Questions:
- “Which state had the highest average PM2.5 in May 2023?”
- “Identify the station in Delhi with the lowest PM10 on January 14, 2022”
- “Which city reported the 2nd highest PM2.5 readings in August 2020?”
Code Example:
def get_response(data, states_data, ncap_funding_data):
import pandas as pd
# Filter data for specific time period
filtered = data[(data['Timestamp'].dt.year == 2023) &
(data['Timestamp'].dt.month == 5)]
# Group by state and calculate mean PM2.5
grouped = filtered.groupby("state")["PM2.5"].mean()
grouped = grouped.dropna()
# Sort and get highest
sorted_data = grouped.sort_values()
return sorted_data.index[-1]Skills Tested:
- Temporal filtering (
dt.year,dt.month,dt.day) - Geographic grouping (
groupby) - Aggregation functions (
mean,max,min) - Sorting and ranking (
sort_values,iloc) - Handling missing data (
dropna)
2. Spatio-Temporal Aggregation (STA)
Spatio-Temporal Aggregation (STA)
2,463 questions (24.6%)
Combined space-time analysis requiring summaries across both geographic regions and time periods, such as comparing trends between states or tracking changes over years.
Example Questions:
- “Which state showed the second smallest decrease in PM10 between 2020 and 2023?”
- “Which city had the highest average PM2.5 during summer months (April-June) in 2022?”
- “Identify the state with the largest year-over-year improvement in PM2.5 from 2021 to 2022”
Code Example:
def get_response(data, states_data, ncap_funding_data):
import pandas as pd
# Get data for two years
data_2020 = data[data['Timestamp'].dt.year == 2020]
data_2023 = data[data['Timestamp'].dt.year == 2023]
# Calculate averages by state
avg_2020 = data_2020.groupby('state')['PM10'].mean()
avg_2023 = data_2023.groupby('state')['PM10'].mean()
# Calculate decrease
decrease = avg_2020 - avg_2023
# Sort and get 2nd smallest
sorted_decrease = decrease.sort_values()
return sorted_decrease.index[1]Skills Tested:
- Multi-period temporal filtering
- Cross-time comparisons
- Change/delta calculations
- Multi-dimensional aggregation
- Geographic and temporal grouping
3. Temporal Trends (TT)
Temporal Trends (TT)
1,219 questions (12.2%)
Time-series analysis and longitudinal patterns, focusing on how pollution levels evolve over time without explicit geographic grouping.
Example Questions:
- “During which month is the average PM2.5 level the highest across India?”
- “On which day of the week were PM10 levels highest in 2022?”
- “Which year recorded the lowest annual average PM2.5?”
Code Example:
def get_response(data, states_data, ncap_funding_data):
import pandas as pd
# Group by month name and calculate mean
monthly_avg = data.groupby(
data["Timestamp"].dt.month_name()
)["PM2.5"].mean()
# Sort to find highest
sorted_data = monthly_avg.sort_values()
return sorted_data.index[-1]Skills Tested:
- Temporal aggregation (
dt.month_name(),dt.dayofweek,dt.year) - Time-series grouping
- Trend identification
- Statistical summaries over time
- Date-time manipulation
4. Funding-Based Queries (FQ)
Funding-Based Queries (FQ)
443 questions (4.4%)
NCAP policy and funding analysis, critical for financial accountability and outcome assessment. These queries link intervention (funding) to outcomes (air quality improvement).
Example Questions:
- “Which financial year had the highest average fund release across cities?”
- “Which city received the most total NCAP funding between 2019-2022?”
- “What is the average utilization rate across all NCAP-funded cities?”
Code Example:
def get_response(data, states_data, ncap_funding_data):
import pandas as pd
# Define fiscal year columns
fy_columns = [
'Amount released during FY 2019-20',
'Amount released during FY 2020-21',
'Amount released during FY 2021-22'
]
# Calculate mean for each FY
avg_funding = ncap_funding_data[fy_columns].mean()
# Find FY with maximum funding
max_fy = avg_funding.idxmax()
# Extract year from column name
return max_fy.split('FY ')[1]Skills Tested:
- Multi-dataset integration
- Financial data analysis
- Column selection and aggregation
- String manipulation
- Policy-outcome linking
5. Population-Based Exposure (PB)
Population-Based Exposure (PB)
383 questions (3.8%)
Queries joining air quality data with demographics to assess population-weighted exposure burden and health equity.
Example Questions:
- “Which state has the highest population-weighted average PM2.5 in 2023?”
- “What percentage of India’s population lives in states exceeding WHO PM2.5 limits (>15 µg/m³)?”
- “Which is the 2nd least polluted state when normalized by per-capita PM10 exposure?”
Code Example:
def get_response(data, states_data, ncap_funding_data):
import pandas as pd
# Calculate average PM2.5 by state for 2023
data_2023 = data[data['Timestamp'].dt.year == 2023]
avg_pm25 = data_2023.groupby('state')['PM2.5'].mean().reset_index()
# Merge with population data
merged = avg_pm25.merge(
states_data[['state', 'population']],
on='state',
how='inner'
)
# Calculate population-weighted PM2.5
merged['weighted_pm25'] = merged['PM2.5'] * merged['population']
# Get state with highest weighted exposure
return merged.sort_values('weighted_pm25').iloc[-1]['state']Skills Tested:
- Multi-dataset joins (
merge) - Demographic analysis
- Weighted calculations
- Equity metrics
- Health impact assessment
6. Area-Based Aggregation (AB)
Area-Based Aggregation (AB)
373 questions (3.7%)
Queries normalizing by geographical area, reflecting fairness in monitoring coverage and spatial density of pollution.
Example Questions:
- “Which state has the fewest monitoring stations per square kilometer?”
- “Which union territory has the highest PM10 per km² in 2022?”
- “Rank states by PM2.5 concentration normalized by land area”
Code Example:
def get_response(data, states_data, ncap_funding_data):
import pandas as pd
# Count unique stations per state
station_counts = data.groupby('state')['station'].nunique().reset_index()
# Merge with area data
merged = station_counts.merge(
states_data[['state', 'area (km2)']],
on='state',
how='inner'
)
# Calculate stations per km²
merged['stations_per_km2'] = merged['station'] / merged['area (km2)']
# Get state with lowest density
return merged.sort_values('stations_per_km2').iloc[0]['state']Skills Tested:
- Spatial normalization
- Geographic density calculations
- Multi-dataset integration
- Coverage analysis
- Fairness metrics
7. Specific Patterns (SP)
Specific Patterns (SP)
256 questions (2.5%)
Detection of rule-based violations over short time windows, such as exceeding WHO/CPCB limits or identifying exceedance events.
Example Questions:
- “Over the past five years, how many days did Mumbai exceed WHO PM2.5 limits (>15 µg/m³)?”
- “How many times did Delhi’s PM10 exceed 200 µg/m³ in 2021?”
- “Which date in the last 5 years had the lowest PM2.5 in Jaipur?”
Code Example:
def get_response(data, states_data, ncap_funding_data):
import pandas as pd
from datetime import datetime, timedelta
# Define time window (last 5 years)
cutoff_date = datetime.now() - timedelta(days=5*365)
# Filter for Mumbai within time window
mumbai = data[
(data['city'] == 'Mumbai') &
(data['Timestamp'] >= cutoff_date)
]
# Find exceedances (>15 µg/m³)
exceedance = mumbai[mumbai['PM2.5'] > 15]
# Count unique days
unique_days = exceedance['Timestamp'].dt.date.unique()
return len(unique_days)Skills Tested:
- Threshold detection
- Pattern matching
- Time window filtering
- Exceedance counting
- Rule-based logic
Template-Based Generation
VayuBench uses a systematic template system to generate diverse queries while ensuring coverage. Each category has multiple templates with parameterized variables.
{
"category": "spatial_aggregation",
"question": "Which {location} reported the {stats} {col} during {month} {year}?",
"location": ["state", "city", "station"],
"stats": ["highest", "lowest", "2nd highest", "3rd lowest"],
"col": ["PM2.5", "PM10"],
"month": ["January", "February", ..., "December"],
"year": ["2018", "2019", "2020", "2021", "2022", "2023", "2024"]
}Generated Instances: - “Which state reported the highest PM2.5 during May 2023?” - “Which city reported the lowest PM10 during January 2020?” - “Which station reported the 2nd highest PM2.5 during August 2022?”
This systematic expansion from 67 templates produces ~26K candidate queries, from which we sample 5,000 to maximize diversity.
Paraphrasing for Naturalness
Each sampled query is paraphrased using Gemini and manually verified to add linguistic variety without changing semantics:
| Original | Paraphrased |
|---|---|
| “Which state has the highest average PM10 in May 2023?” | “Identify the state with the top average PM10 concentration for May 2023.” |
| “Which city has the lowest PM2.5 in 2022?” | “Determine the city that recorded minimal PM2.5 levels during 2022.” |
This prevents models from overfitting to rigid templates and better reflects real-world query diversity.
Next Steps
- Explore the Datasets that power these queries
- Learn how to Get Started with VayuBench
- See Results from evaluating 13 LLMs across all categories