This intensive 10-day program prepares interns with essential ML fundamentals, practical skills, and research methods. Each day combines lectures from established courses with hands-on exercises and assessments using real datasets.
Target Audience: Summer interns, 6-month interns, JRFs, and prospective lab members
Prerequisites: Basic Python programming knowledge
Format: Daily 4-5 hour sessions with lectures, practice, and assessment
Day 1: NumPy Fundamentals
Array Computing for Data Science
Learning Objectives
Master NumPy array operations and indexing
Understand broadcasting and vectorization
Apply NumPy to mathematical computations
Daily Exercises
Exercise 1A: Array Fundamentals (30 minutes)
Create arrays using different methods (zeros, ones, arange, linspace)
Practice array indexing and slicing
Reshape and transpose operations
Array concatenation and splitting
Exercise 1B: Mathematical Operations (45 minutes)
Element-wise operations vs matrix operations
Statistical functions (mean, std, min, max)
Sorting and searching in arrays
Random number generation and seeding
Exercise 1C: Broadcasting Practice (30 minutes)
Add scalar to array
Operations between arrays of different shapes
Create meshgrids for plotting
Normalize arrays using broadcasting
Daily Assessment: Activity Recognition Analysis (60 minutes)
Dataset: UCI Human Activity Recognition Tasks:
Load and explore accelerometer/gyroscope data
Calculate basic statistics for each activity type
Find patterns in sensor readings for walking vs sitting
Create visualizations using NumPy operations
Identify which sensors are most informative
Deliverable: Jupyter notebook with clean code and insights
Day 2: Pandas & Matplotlib
Data Manipulation and Visualization
Learning Objectives
Master DataFrame operations and data cleaning
Create professional visualizations
Combine data from multiple sources
Daily Exercises
Exercise 2A: DataFrame Operations (45 minutes)
Load CSV data and inspect structure
Handle missing values (dropna, fillna, interpolate)
Filter and query operations
Group by operations and aggregations
Merge and join different datasets
Exercise 2B: Data Cleaning Pipeline (45 minutes)
Remove duplicates and outliers
Convert data types appropriately
Create derived columns
Handle datetime data
Export cleaned data
Exercise 2C: Visualization Mastery (60 minutes)
Line plots with multiple series
Scatter plots with color coding
Histograms and distribution plots
Subplots and figure customization
Save high-quality figures
Daily Assessment: Weather Data Analysis (90 minutes)
Dataset: UCI Weather Dataset or built-in Seaborn Flights Dataset
Tasks:
Load and clean weather data (temperature, humidity, pressure)
Handle missing values using simple methods
Create basic time-based features (month, season)
Calculate monthly averages and trends
Create simple visualizations:
Temperature trends over time
Correlation between weather variables
Monthly distribution plots
Simple seasonal patterns
Write 1-page summary of findings
Deliverable: Clean dataset + basic EDA notebook
Day 3: Python Mastery Assessment
Comprehensive Skills Evaluation
📚 Review Materials
Consolidate learning from Days 1-2
Python Data Science Handbook (Chapters 1-4)
40 Challenge Questions (180 minutes)
NumPy Section (Questions 1-15)
Create a 3D array (5×4×3) filled with random integers from 1-100
Find all elements greater than 50 and replace with their square root
Compute the covariance matrix for a 2D dataset
Implement matrix multiplication without using np.dot()
Create a function to normalize arrays to 0-1 range
Find the indices of the maximum value in each row of a 2D array
Create a moving average function using array slicing
Solve a system of linear equations using NumPy
Create a function to compute pairwise distances between points
Implement k-means centroid update using broadcasting
Create a 2D Gaussian kernel for image filtering
Find connected components in a binary image (0s and 1s)
Implement efficient computation of Euclidean distance matrix
Create a function to remove outliers using z-score
Compute eigenvalues and eigenvectors for PCA implementation
Pandas Section (Questions 16-30)
Load multiple CSV files and combine them efficiently
Create a function to detect and handle different types of missing data
Implement time-based resampling for irregular time series
Create a pivot table with multiple aggregation functions
Implement efficient groupby operations on large datasets
Create a function to standardize column names
Merge datasets with different time zones
Implement outlier detection using IQR method
Create a function to generate summary statistics report
Handle categorical data encoding efficiently
Implement sliding window operations on time series
Create a function to validate data quality
Implement efficient data type optimization
Create custom aggregation functions for groupby
Handle hierarchical/multi-index DataFrames
Matplotlib Section (Questions 31-40)
Create a publication-ready figure with subplots
Implement interactive plots with widgets
Create custom colormaps for scientific data
Design a dashboard-style multi-panel figure
Create animations for time series data
Implement error bars and confidence intervals
Create geographic plots using basemap principles
Design custom plot styles and themes
Create 3D visualizations for scientific data
Implement plot export pipeline for publications
Daily Assessment: Multi-Dataset Analysis (120 minutes)
Dataset: Seaborn Tips + Seaborn Flights
Tasks:
Load both tips and flights datasets
Clean and explore each dataset separately
Create 4 visualizations for tips data (bill vs tip patterns)
Create 4 visualizations for flights data (passenger trends)
Write simple summary comparing patterns in both datasets
Evaluation Criteria: - Code quality and readability (40%) - Visualization clarity (30%) - Data insights (30%)
Day 4: Machine Learning Introduction
Foundations of Supervised Learning
Learning Objectives
Understand supervised vs unsupervised learning
Master the ML workflow and evaluation
Implement basic algorithms from scratch
Daily Exercises
Exercise 4A: ML Workflow (60 minutes)
Implement train-validation-test splits
Create cross-validation from scratch
Implement basic performance metrics
Practice bias-variance tradeoff concepts
Create learning curves
Daily Assessment: Housing Price Prediction (90 minutes)
Dataset: UCI Boston Housing Dataset Tasks:
Predict house prices using neighborhood features
Compare linear regression, ridge, and lasso performance
Create simple feature combinations (e.g., rooms per capita)
Create basic model evaluation with scatter plots
Interpret which features matter most for price
Bonus: Implement simple gradient descent from scratch
Day 5: Tree-Based Methods
Non-Linear Models and Ensemble Learning
Learning Objectives
Master decision tree algorithms
Understand ensemble methods
Apply to complex real-world problems
Daily Exercises
Exercise 5A: Decision Tree Implementation (90 minutes)
Build decision tree from scratch using Gini impurity
Implement tree pruning techniques
Visualize decision boundaries
Compare with scikit-learn implementation
Analyze tree depth vs performance
Exercise 5B: Ensemble Methods with Scikit-learn (75 minutes)
Use StandardScaler
and MinMaxScaler
for data preprocessing
Implement RandomForestClassifier
with different parameters
Compare individual trees vs ensemble performance
Feature importance analysis using .feature_importances_
Cross-validation with cross_val_score
Exercise 5C: K-Nearest Neighbors (45 minutes)
Implement KNN for classification and regression
Experiment with different distance metrics
Analyze curse of dimensionality
Implement efficient neighbor search
Daily Assessment: Iris Species Classification (120 minutes)
Dataset: Seaborn Iris Dataset (built-in, no download needed!)
Tasks:
Build classifier to predict iris species (setosa, versicolor, virginica)
Compare decision tree, random forest, and KNN performance
Use tree-based feature importance to find key measurements
Create simple model interpretation plots
Visualize decision boundaries for 2D projections
Create simple prediction function for new flowers
Evaluation: Model performance + interpretability + code clarity
Day 6: Neural Networks & PyTorch
Introduction to Deep Learning
Learning Objectives
Understand neural network fundamentals
Master PyTorch for deep learning
Implement backpropagation from scratch
Daily Exercises
Exercise 6A: Neural Network from Scratch (90 minutes)
Implement forward pass for multi-layer perceptron
Code backpropagation algorithm step by step
Add different activation functions (sigmoid, ReLU, tanh)
Implement mini-batch gradient descent
Compare with analytical solutions
Exercise 6B: PyTorch Framework Deep Dive (75 minutes)
Build neural networks using torch.nn.Module
Implement CNNs using torch.nn.Conv2d
and torch.nn.MaxPool2d
Create data loaders with torch.utils.data.DataLoader
Add regularization (dropout, weight decay)
Implement learning rate scheduling with torch.optim.lr_scheduler
Exercise 6C: Advanced PyTorch Architectures (60 minutes)
PDF: Convolutional Neural Networks
Build CNN using torch.nn.Conv2d
for image classification
Implement LSTM using torch.nn.LSTM
for time series
Practice sequence modeling with torch.nn.Sequential
Compare CNN vs LSTM performance on time series data
PDF: 1D CNN
Daily Assessment: Simple Time Series Prediction (120 minutes)
Dataset: Seaborn Flights Dataset (passenger numbers over time)
Tasks:
Build simple neural network to predict passenger numbers
Compare basic MLP vs simpler approaches
Create train/validation/test splits for time series
Plot predictions vs actual values
Calculate and interpret prediction errors
Experiment with different number of hidden layers
Bonus: Try predicting multiple steps ahead
Day 7: Development Workflow: Git, GitHub & Remote Servers
From Local Code to Collaborative & Remote Execution
📚 Learning Materials
Git & GitHub:
Remote Servers & SSH:
Learning Objectives
Master Git version control for tracking changes.
Understand collaborative development on GitHub using forks and pull requests.
Apply version control to ML project workflows with proper structure.
Connect securely to remote servers using SSH and key-based authentication.
Monitor and manage server resources like GPUs and disk space.
Run persistent, long-running experiments using tmux
.
Daily Exercises
Exercise 7A: Git Fundamentals (60 minutes)
Initialize repository and make commits
Practice branching and merging strategies
Handle merge conflicts effectively
Use git log, diff, and status commands
Create and apply patches
Exercise 7B: GitHub Collaboration (75 minutes)
Fork repository and create pull requests
Practice code review process
Use GitHub Issues for project management
Create project documentation with README
Set up GitHub Pages for project showcase
Exercise 7C: ML Project Structure (60 minutes)
Organize ML projects with proper structure
Use .gitignore for ML artifacts
Version control datasets and models
Create reproducible environments
Document experiments and results
Exercise 7D: Remote Server Connection (45 minutes)
First Connection: Connect to a remote server using the ssh
command with the username, IP address, and port provided by your instructor.
SSH Key Authentication: Secure your connection and enable passwordless login.
Generate an SSH key pair on your local machine : ssh-keygen -t rsa -b 4096
Copy your public key to the remote server using ssh-copy-id
.
# Adjust the user, host, and port as needed
ssh-copy-id -p 2222 user@remote.server.com
SSH Shortcut: Create a host alias in your local ~/.ssh/config
file for quick access to the server.
Exercise 7E: Server Management & Persistent Sessions (45 minutes)
Server Monitoring: Log in and practice these essential monitoring commands.
Check GPU usage: watch nvidia-smi
Check CPU and memory usage: htop
Check your disk usage: du -sh ~
Persistent Sessions with tmux
: Run experiments that survive disconnections.
Start a new named tmux
session: tmux new -s my_experiment
Inside the session, run a long-running command (e.g., top
).
Detach from the session: Press Ctrl+b
, then d
.
Log out, log back in, and re-attach to your session to see it is still running: tmux attach -t my_experiment
Daily Assessment: Collaborative Remote ML Project (150 minutes)
Tasks:
Work in teams of 2-3 people.
Choose one simple dataset (tips, iris, housing) and create a shared GitHub repository.
Each member implements a different ML algorithm on a separate branch, then merges via a pull request.
Remote Execution: One team member logs into the remote lab server, clones the final repository, and runs the main training script inside a tmux
session.
Documentation: Create a basic README.md
file that summarizes the results and includes the command to re-attach to the tmux
session, proving the experiment is running persistently.
Present the final comparison of approaches and the running remote session.
Evaluation: Git usage + collaboration quality + successful remote execution + final project.
Day 8: Scientific Writing & LaTeX
Academic Communication and Documentation
📚 Learning Materials
Resource: LaTeX Tutorial on Overleaf
Guide: Lab Handbook - Technical Writing section
Examples: Sustainability Lab publications for reference
Learning Objectives
Master LaTeX for academic writing
Understand scientific paper structure
Create publication-ready documents
Daily Exercises
Exercise 8A: LaTeX Fundamentals (75 minutes)
Set up Overleaf account and create first document
Learn document structure and basic formatting
Create mathematical equations and formulas
Insert figures, tables, and references
Use bibliography management with BibTeX
Exercise 8C: Paper Structure (60 minutes)
Analyze structure of top-tier ML papers
Write effective abstracts and introductions
Create methodology sections with equations
Present experimental results clearly
Write conclusions and future work
Daily Assessment: Research Paper Draft (120 minutes)
Task: Write a 4-page research paper on your Day 6 neural network project Required Sections: 1. Abstract (150 words) 2. Introduction with literature review 3. Methodology with mathematical formulation 4. Experimental results with tables and figures 5. Conclusion and future work 6. Properly formatted references
Evaluation: Writing clarity + technical accuracy + LaTeX formatting
Day 9: Research Methods Bootcamp
Scientific Thinking and Research Skills
Learning Objectives
Develop scientific thinking skills
Master research communication
Learn systematic problem-solving
Daily Exercises
Exercise 9A: Email Communication (45 minutes)
Based on Bootcamp Session 1
Write professional email to potential research advisor
Compose progress report email to supervisor
Request help from external researcher professionally
Practice concise and clear communication
Learn email etiquette for academia
Exercise 9B: Abstract Analysis (60 minutes)
Based on Bootcamp Session 2
Analyze 5 abstracts from top ML conferences
Identify key components of effective abstracts
Rewrite weak abstracts to improve clarity
Practice identifying scientific flaws in papers
Create abstract for your own research idea
Exercise 9C: Scientific Method Application (75 minutes)
Based on Bootcamp Session 3
Formulate testable hypotheses for ML problems
Design controlled experiments
Identify potential confounding variables
Practice systematic observation and data collection
Apply statistical hypothesis testing
Exercise 9D: Debugging Mastery (45 minutes)
Based on Bootcamp Session 4
Analyze effective StackOverflow questions
Practice systematic debugging approaches
Create minimal reproducible examples
Learn to ask precise technical questions
Develop problem isolation skills
Daily Assessment: Research Proposal (120 minutes)
Tasks:
Problem statement with motivation
Literature review (10+ papers)
Research methodology and experimental design
Timeline and resource requirements
Expected contributions and impact
Potential challenges and mitigation strategies
Presentation: 15-minute presentation + 10-minute Q&A Evaluation: Scientific rigor + presentation quality + feasibility
Day 10: Integration & Lab Projects
Connecting Skills to Research Impact
📚 Learning Materials
Lab Website: Sustainability Lab Projects
Research Papers: Recent lab publications
Resources: Lab handbook and current project overviews
Learning Objectives
Connect ML skills to sustainability applications
Understand lab research domains
Design internship project proposal
Daily Exercises
Exercise 10A: Lab Project Deep Dive (90 minutes)
JoulesEye Analysis: Understand energy expenditure estimation using thermal imagery
SpiroMask Study: Explore respiratory monitoring through smart face masks
VayuBuddy Exploration: Analyze AI-powered air quality chatbot
Space to Policy: Investigate satellite imagery for environmental compliance
Choose one project and analyze technical approach in detail
Exercise 10B: Research Gap Identification (60 minutes)
Read 3 recent papers from chosen lab research area
Identify current limitations and challenges
Propose novel extensions using learned ML techniques
Consider practical implementation challenges
Assess potential societal impact
Final Assessment: Internship Project Proposal (180 minutes)
Phase 1: Proposal Development (120 minutes)
Create comprehensive project proposal including:
Problem Statement (500 words)
Sustainability challenge being addressed
Current approaches and limitations
Proposed ML solution overview
Technical Approach (800 words)
Detailed methodology with equations
Dataset requirements and collection plan
ML algorithms: specify traditional ML + modern tools integration
Include at least ONE of: Hugging Face, YOLO, or PyTorch LSTM
Evaluation metrics and baselines
Computational resource specifications (GPU, RAM, storage)
Implementation Plan (400 words)
6-month timeline with deliverables
Computational resource requirements
Risk assessment and mitigation strategies
Expected Impact (300 words)
Scientific contributions
Practical applications
Societal benefits
Phase 2: Final Presentation (30 minutes)
Presentation: 20 minutes + 10 minutes Q&A
Audience: Lab members, instructors, and peers
Evaluation Criteria:
Technical soundness (30%)
Innovation and creativity (25%)
Feasibility and planning (25%)
Presentation quality (20%)
Phase 3: Peer Review (30 minutes)
Review and provide feedback on 2 other proposals
Practice constructive scientific criticism
Learn from diverse approaches and ideas
Assessment Framework & Progress Tracking
Daily Assessments (70% of total score)
Knowledge Application: Coding challenges and implementations (40%)
Project Work: Daily mini-projects with real datasets (20%)
Research Skills: Writing, communication, and analysis (10%)
Final Assessment (30% of total score)
Capstone Project: Comprehensive internship proposal (20%)
Presentation: Communication and defense of ideas (10%)
Progress Tracking Methods
Daily Check-ins: 15-minute individual meetings with instructor
Code Reviews: Live coding sessions to verify understanding
Peer Presentations: Explain your solution to fellow participants
Version Control: All work tracked through Git commits with timestamps
Randomized Questions: Different datasets/parameters for each participant
Anti-Cheating Measures
Live Coding Sessions: Random code explanation during daily reviews
Unique Datasets: Each participant gets different subset/parameters
Timed Assessments: Completed under supervision
Oral Defense: Must explain methodology and code decisions
Progressive Complexity: Later exercises build on earlier work
Pair Programming: Rotate partners to verify individual skills
Grading Scale
Outstanding (95-100%): Exceptional preparation, ready for independent research
Excellent (90-94%): Strong foundation, ready for advanced projects
Good (80-89%): Solid preparation, ready for guided research
Satisfactory (70-79%): Adequate foundation, needs continued mentorship
Needs Improvement (<70%): Additional preparation required
Technical Requirements
Software Stack
Python 3.8+ with Anaconda distribution
Core Libraries: NumPy, Pandas, Matplotlib, Seaborn
Traditional ML: Scikit-learn (StandardScaler, MinMaxScaler, RandomForestClassifier)
Deep Learning: PyTorch (NN, CNN, LSTM)
Computer Vision: YOLO (ultralytics package)
NLP/LLMs: Hugging Face Transformers (zero-shot, few-shot, fine-tuning)
Data Annotation: Label Studio (minimal usage)
Development: Jupyter Lab, VS Code, Git
Documentation: LaTeX (Overleaf), Markdown
Collaboration: Git, GitHub, Slack
Hardware Access
Personal Setup: Laptop with minimum 8GB RAM, 256GB storage
Modern ML Requirements:
YOLO/Computer Vision: RTX 3080 (8GB VRAM) minimum, 16GB RAM
Hugging Face LLMs: 16GB RAM minimum, 32GB recommended
PyTorch Training: CUDA-capable GPU, 50GB storage for models
Lab Resources: Access to computational servers (Ramanujan, Bhaskar, Sustain)
Ramanujan: 4x A100 GPUs (80GB each) - ideal for large model training
Bhaskar: 2x RTX A5000 (24GB each) - perfect for YOLO + Hugging Face
Accounts: GitHub, Overleaf, Google Colab, Hugging Face Hub
Extended Resources
Primary Textbooks
Python Data Science Handbook - Jake VanderPlas
An Introduction to Statistical Learning - James, Witten, Hastie, Tibshirani
Pattern Recognition and Machine Learning - Christopher Bishop
Online Courses (Optional)
NPTEL Machine Learning - Balaram Ravindran
Andrew Ng’s ML Course - Coursera
Fast.ai Practical Deep Learning
Lab-Specific Resources
Sustainability Lab Publications: Latest research papers
Computational Infrastructure: Server access and usage guidelines
Lab Culture Guide: Expectations and best practices
Success Outcomes
Technical Competencies
Participants will be able to:
✅ Implement complete ML pipelines from data to deployment
✅ Apply appropriate algorithms for sustainability problems
✅ Conduct rigorous experimental evaluation
✅ Communicate findings through papers and presentations
✅ Collaborate effectively using modern development tools
Research Readiness
Independent Work: Execute research projects with minimal supervision
Critical Thinking: Evaluate and improve existing approaches
Innovation: Propose novel solutions to sustainability challenges
Communication: Present work at conferences and in publications
Lab Integration
Culture Fit: Understand lab values and working style
Technical Skills: Ready to contribute to ongoing projects
Collaboration: Effective team member and mentor to new students
Research Impact: Ability to produce high-quality, publishable research
Key Resources
Infrastructure
Lab Servers : GPU access (A100, RTX A5000) for computational work
Google Colab : Cloud computing for exercises