AirTalks: Understanding Air Pollution Discourse

A unified view of two studies—Vartalaap (Twitter discourse, Delhi) and Samachar (print news across India)—aligned with measured PM2.5 to reveal seasonal attention spikes, metro-centric coverage, supportive sentiment for contested fixes, and gaps between media narratives and scientific source evidence.

Why this matters

India faces some of the world's highest PM2.5 exposures, yet the attention of news outlets and the public rises mainly during short-lived winter smog episodes. This mismatch between year-round health risks and episodic visibility has consequences for sustained action. By pairing insights from Vartalaap (1.2M Delhi-focused tweets, 2016-2020) with Samachar (17.4K curated articles from 88 cities, 2010-2021), we show how discourse is shaped, whose voices dominate, and where coverage diverges from scientific evidence—offering lessons for more effective risk communication.


Our datasets

  • PM2.5 ground truth from CPCB/OpenAQ; station data aggregated to city-daily series for comparison.
  • Twitter: 2016-2020, Delhi-focused, ~1.2M tweets; News: 2010-2021, 17.4K articles across two national dailies.
Samachar: number of cities / PM2.5 availability
Number of cities for which PM2.5 data is available each year. After 2015, number of cities steadily increase due to newly installed stations
Vartalaap dataset summary (Delhi-centric air pollution Twitter dataset)
Statistic Value
Total Tweets Collected 1.25M
Unique Users ~26K
Queries / Hashtags 34 (Delhi-specific)
Air Quality Stations PM2.5 monitors across Delhi
Collection Period Jan 2016 -Dec 2020

How we studied this

Vartalaap (Twitter)

  • Applied natural language processing techniques to categorize tweets by sentiment (supportive, neutral, unsupportive) toward interventions like Odd-Even and Smog Towers, using a fine-tuned BERT model with nested cross-validation for robustness.
  • Analyzed temporal dynamics by comparing sentiment and tweet volumes with air quality patterns (daily PM2.5 levels).
  • Conducted topic modeling (LDA) to identify dominant narratives and concerns expressed on Twitter.
  • Used Granger-causality analysis to test whether pollution levels could predict changes in online discussion volume.
  • Studied concentration of voices by analyzing the distribution of user activity (who tweets, and how often).
Sentiment over time for Odd-Even intervention
Evolution of sentiment around "Odd-Even" scheme over time. The vertical lines with 'A' tag, signify the instances when scheme was implemented, January of 2016 and November of 2017 and 2019. 'B' and 'C' are two driving events for change in public sentiments.

Samachar (Print Media)

  • Compared article counts across cities and years to map how media attention fluctuated with changing pollution levels.
  • Used topic modeling to reveal how different pollution sources (vehicular, industrial, biomass) were represented in news coverage.
  • Compared media narratives against scientific source-apportionment studies to highlight gaps and biases in coverage.
  • Examined framing differences across national newspapers (Times of India vs. The Hindu) to capture variation in emphasis and agenda.
Samachaar: Media coverage vs PM2.5 levels mismatch
Samachaar: Comparison of Delhi PM2.5 levels and news article counts (2010-2021), showing that media attention is episodic despite consistently high pollution.
.

What we found

  • Seasonal attention gap: Media and online discourse spike sharply in winter, even though hazardous PM2.5 levels persist through much of the year.
  • Geographic skew: Coverage is metro-centric (Delhi, Mumbai), leaving highly polluted but less “visible” Indo-Gangetic Plain cities underrepresented in both news and social media narratives.
  • Narrative distortions: News stories often foreground vehicular emissions, while scientific evidence highlights the greater role of residential biomass burning and regional transport.
  • Public response to fixes: Twitter sentiment shows spikes of support around high-profile interventions like Odd-Even or Smog Towers, but discussion fades quickly after announcements.
  • Concentration of voices: A relatively small set of active Twitter users disproportionately shapes discourse, framing the conversation and amplifying certain narratives.
  • Framing contrasts: Print media tends to frame pollution as a governance and civic issue, while Twitter often emphasizes health risks and immediate lived experiences.
PM2.5 Granger causality (monthly)
Monthly Granger-causality: PM2.5 drives Twitter discussion mainly in Oct–Dec (winter salience).
Choropleth: PM2.5 vs article counts across India
Geographic mismatch: Indo-Gangetic Plain hotspots have high PM2.5 but relatively lower media attention vs metros.
Source contributions vs media coverage (Delhi)
Source mismatch: media overrepresent vehicular sources while underreporting residential/biomass contributions compared to scientific apportionment.
Year-round PM2.5 exceedances across Indo-Gangetic Plain cities
Samachar - Year-long PM2.5 exceedances across multiple Indo-Gangetic Plain cities (Apr 2018-Apr 2021), showing pollution is persistent across the year (except monsoon months), not only a winter problem.

Most actionable takeaways

  • Communicate beyond winter: Pollution is not a seasonal crisis—maintain continuous public engagement that reflects year-round health risks.
  • Broaden geographic lens: Spotlight smaller Indo-Gangetic Plain cities and peri-urban regions with chronic exceedances, not only high-profile metros.
  • Anchor in science: Frame stories and policies with evidence on dominant sources (e.g., biomass, regional transport) and clearly convey uncertainties.
  • Expand trusted voices: Involve health professionals, teachers, and local community actors - beyond a handful of influencers - to diversify and democratize the conversation.
  • Sustain momentum: Link high-attention events (e.g., policy launches, smog alerts) to follow-up reporting and monitoring, preventing quick fade-out.

Limitations & next steps

  • Scope: Vartalaap focuses on Delhi Twitter; Samachar covers two English dailies—extend to more cities, languages, and platforms.
  • Methods: Add multimodal (text+image) analysis; use active learning/weak supervision for scalable, fresh sentiment.
  • Operations: Move from batch to streaming dashboards with QC, uncertainty, and reproducible refreshes.