Links - Final Year Project

Timeline/Tasks

22 Jul 25 - Offline Meeting

Research RL agent based approach
Do literature survey and find what to optimize

20 Aug 25 - Online Meeting

Chemotherapy dosing - scheduling - papers to be give by Ma’am
need to prepare objective PPT on the methodology we will follow

17 Oct 25 - Offline Meeting

Prepare PPT on entire project idea starting from ground up to final idea

28 Oct 25 - Offline Meeting

Green light with initial PPT
Proceed with rough literature survey with around 6 papers

19 Nov 25 - Offline Reporting to HOD

5th Dec 2025 - Online Meeting

introduction - overview, current work then propose next work - 1.5 page
literature survey - write about specific paper and work (include references) 6-7 para, 2.5-3 pages
proposed work - limitations from literature, proposal, technologies, their descriptions, upto from current planning, proposed model - flow diagram, steps, make it look good, 3-4 pages
future work - conclusion
references - 10-12 papers
in total - 20 pages, each section 3-4 pages
presentation - 19/20 Dec - 15-16 pages
Difference between research gaps and future work - what and what not to include???

14th Dec 2025 - Online Meeting

Instructions on how to fix synopsis report + make PPT for final day
General
1. use AI to get gist and then paraphrase
2. ETA of full project january-february
3. no italics, no boldings
4. 11pts
5. Times New Roman
6. college visit either on 16th or 18th 1st half
7. only one can visit to sign
8. questions based on how extensively study has been done not how much work has been done
9. 4 panel members
10. TOC shouldn’t include Abstract, Acknowledgement, Approval
11. 10-15 citations - in proper order, should start with 1 → 2 → 3 → 4 …
Questions in panel
1. How RL is used? - proposed algorithms
2. Why RL is used?
3. How RL works?
4. Explanation of MDP - action, state, …
5. relation to traffic control systems
6. detail of algorithms in proposed work - how and why
Acknowledgement
1. grammatical mistakes their
Abstract
1. 150-200 words
2. intent of work, challenges
3. don’t include limitations
4. in 1-2 lines
  1. Traffic Management System → why RL → why MARL
5. focus should be on traffic signal - to fix problems with traffic congestion
Introduction
1. no proposed methodology, overview, background section
2. in normal paras
3. no bullet points
4. overall work description
  1. Traffic → RL → MARL implementation → What gaps and proposals
5. What, Why, How?
6. 2 pages
Literature survey
1. info obtained from other people’s work
2. MDP in detail not needed - maybe just “To apply RL … MDP needed”
3. don’t include self proposed figures (include in proposed methodology section)
4. don’t include formulas
5. RL, MDP, MARL, traffic signal/control systems, relation of RL with traffic signal elaborate descriptions FIX!
6. What gaps are there in traffic signal optimization by virtue of which we are applying MARL?
7. limitations in traffic control systems
8. algorithms used in different papers by other people
9. don’t make headings - in paras
Proposed work
1. include proposed figures here (of solutions, methods)
2. don’t include monetary facts
3. discuss drawbacks → easing into MARL solutions
4. RL why?, MDP, MARL extensive description
5. can use bold for MDP sub fields
6. road map
7. must cover extensively
8. mention all of the research gaps - including future work
  1. mention specifically which gap we are intending to work on
  2. exact algorithm and road map of gaps
9. proposed algorithms
  1. like CLRS algorithms - 12-14 lines
  2. flowcharts
  3. how exactly to renovate the present conditions
  4. just proposed works - step wise not pseudo code
  5. include after implementation roadmap
Conclusion
1. don’t give separate heading for upcoming research phase
PPT structure
1. 10-15 pages (max 15)
2. in bullet points
  1. which problem was highlighted
  2. what is MDP, RL
  3. contribution of RL in the problem
  4. what is MARL
  5. limitations of MARL in traffic congestion system
3. add pictures
4. proposed work
  1. planning/road map
  2. which gaps were found out
  3. explain using images
  4. proposed algorithm
  5. proposed methodology
5. conclusion - MARL

Project/Thesis Topic Suggestion

Miscellaneous

Potential Topics

AI based caching algorithm
continuous intruder detection by learning usage patterns
deep learning for physics discovery
stroke detection
CA → AI (financial; investments)
https://rentry.org/finalYearML

Literature Study

Paper	Analysis	Rating
Advances in reinforcement learning for traffic signal control: a review of recent progress	Good introductory analysis + Review of recent times	4

Final 7th Semester PPT Presentation Study Guide

Speech

Quick Reference Summary (Cheat Sheet)

Core Project Idea: Traditional traffic light systems are failing because they can’t adapt to real-time, chaotic traffic. We are using Multi-Agent Reinforcement Learning (MARL) to create “smart” intersections that learn from traffic flow, cooperate with each other, and make better decisions, even with real-world problems like sensor failures and unexpected events.

Slide	Title	Key Talking Point	Key Terms
1-4	Intro & The Problem	Traffic congestion is a massive economic drain. Old systems (Fixed-Time, Actuated) are rigid and fail because traffic is dynamic and non-stationary.	Non-Stationary: The rules of the traffic environment are always changing.
5-7	The New Paradigm: RL & MDP	Reinforcement Learning is the solution. An “agent” (the traffic light) learns the best strategy by trial-and-error. We model this using a Markov Decision Process (MDP).	Agent: The controller. State: What it sees (queue length). Action: What it does (change phase). Reward: Feedback (-delay).
8-9	(Your Section) From Single to Multi-Agent	A single smart light isn’t enough; it creates bottlenecks. We need a team of agents (MARL) that cooperate. State-of-the-art examples are PressLight (uses traffic theory) and CoLight (uses Graph Attention Networks to learn who to listen to).	MARL: Multiple agents learning together. PressLight: Reward based on “Max-Pressure” theory. CoLight/GATs: Learns the importance of neighbors’ information dynamically.
10-11	(Your Section) The Reality Gap & Research Gaps	Lab success doesn’t translate to the real world due to Partial Observability (faulty sensors) and Non-Stationarity (sudden events). Our project addresses five key research gaps no one has fully solved.	Partial Observability: Agents are partially blind. Catastrophic Forgetting: Forgetting old knowledge when a new event happens. Our Gaps: Temporal Misalignment, Causal Feedback, Observation Robustness, Dynamic Topology, Memory Management.
12-14	Our Proposal & Algorithms	We propose a resilient learning framework with two core solutions: Uncertainty-Aware Control to handle sensor failures (partial observability) and Regime-Aware Memory to handle sudden traffic changes (non-stationarity).	Algorithm 1: If sensor data is unreliable (low confidence), use a safe, default action. Algorithm 2: Detect when traffic patterns change (“regime shift”) and update the memory buffer to prevent learning from outdated data.
15-17	Roadmap & Future Work	Our project is in four phases. We are currently “In Progress” on setting up the baseline (Phase 1). Our future work will implement our novel algorithms and push the frontier with concepts like entropy-driven scheduling and causal world models.	SUMO: Our traffic simulator. Baseline: A standard model (like CoLight) that we must first implement and prove our new ideas against.

Part 1: The Problem (Presented by Member 1)

Slide 1: Title Slide

What to Say: “Good morning, professors. We are Group 31, and today we will be presenting our project: ‘Robust Multi-Agent Reinforcement Learning for Traffic Signal Optimization Under Partial Observability and Non-Stationarity.’ Our work aims to address the critical gap between theoretical traffic control algorithms and their practical, real-world deployment.”

Slide 2: Group Members

What to Say: “I’d like to quickly introduce our team members: [Read names]. My colleagues and I will walk you through the problem, our proposed solution, and our plan for implementation.”

Slide 3: Urban Traffic Congestion

Opening: “So, let’s start with the core problem. Urban traffic congestion is far more than just an inconvenience; it’s a critical economic drain.”
Main Points:
- “In major Indian cities, the annual cost of congestion is estimated to be around 1.5 lakh crore rupees, or approximately 20 billion US dollars. This inefficiency leads directly to wasted fuel, increased pollution, and significant economic losses for everyone.”
- “The root cause is that traditional traffic control systems—from simple fixed-time schedules to actuated signals—are fundamentally ill-equipped for the dynamic and non-stationary nature of modern urban traffic.”
- “This highlights the urgent need for intelligent, adaptive systems that can learn optimal control policies on their own, without relying on complex and often inaccurate analytical models.”
Transition: “To understand why these traditional systems are failing, let’s look at the gridlock problem in more detail.”

Slide 4: The Gridlock Problem

Opening: “The failure of traditional traffic control can be broken down into three main categories.”
Main Points:
- “First, we have Fixed-Time Control. This is like a clock that runs the same schedule all day, every day. It completely fails to adapt to real-time demand, causing unnecessary delays during off-peak hours and severe congestion during rush hour.”
- “Next is Adaptive or Actuated Control. While it uses sensors to detect cars, it’s constrained by simplistic, pre-programmed rules. It only sees its own intersection and lacks a network-wide view, which means it can solve a local problem while accidentally creating a bigger one down the road.”
- “Finally, both of these systems often require Manual Tuning. This is a laborious process where traffic engineers rely on oversimplified assumptions to calibrate the signals. It’s expensive, time-consuming, and simply cannot keep up with the non-linear complexity of modern urban traffic.”
Transition: “Because these reactive systems have failed, we need to shift to a new paradigm—one that learns directly from the traffic itself.”

Part 2: The Proposed Paradigm (Presented by Member 2)

Slide 5: A New Paradigm: Learning Directly From Traffic Flow

Opening: “This new paradigm is Reinforcement Learning. Instead of following rigid rules, we propose an AI ‘agent’ that learns the optimal strategy through trial-and-error.”
Main Points:
- “The agent, which is the traffic signal controller, directly interacts with the traffic environment. Its goal is to maximize a specific objective, such as minimizing delay.”
- “Here’s how it works: The agent observes the current State (S) of the intersection—things like queue length, vehicle speed, and wait times.”
- “Based on this state, it takes an Action (A), like changing the signal phase.”
- “The environment then gives the agent a Reward (R). If delay is reduced, it gets a positive reward. If congestion worsens, it gets a negative reward. A key metric we use here is minimizing ‘Max-Pressure,’ a concept from transportation theory that helps stabilize the entire network.”
- “Through thousands of these interactions, the agent learns which actions lead to the best rewards in any given state.”
Transition: “This entire learning process is formally structured using a mathematical framework known as the Markov Decision Process.”

Slide 6: A Data-Driven Approach: Teaching Intersections to Learn

Opening: “So, how do we teach an intersection to learn? The foundation of our approach is the Markov Decision Process, or MDP.”
Main Points:
- “The core concept of RL is that it provides a model-free framework. This is transformative for Traffic Signal Control (TSC) because we don’t need to create a perfect mathematical model of traffic, which is practically impossible. The agent learns directly from data.”
- “This allows it to discover complex strategies that a human engineer might never design.”
- “The MDP provides the mathematical structure for this sequential decision-making. As shown in the diagram, an intersection has multiple phases, or signal plans, it can choose from. The MDP helps the agent learn the best sequence of these phases.”
Transition: “Let’s break down the components of this MDP to see how it provides the blueprint for perfect flow.”

Slide 7: The Blueprint for Perfect Flow: The Markov Decision Process

Opening: “The MDP gives our intelligent agent the mathematical scaffolding it needs to learn optimal control.”
Main Points (Walk through each component):
- “The State (S) is the agent’s perception of the world. It’s high-dimensional sensor data capturing vehicle counts, speed, and lane occupancy.”
- “The Action (A) is the agent’s control. In our case, it’s the flexible choice of the next signal phase, known as ‘Phase Selection’.”
- “The Reward (R) is the optimization goal. We aim to maximize network throughput using metrics like Max Pressure, which is theoretically proven to stabilize traffic flow.”
- “And finally, the Transition (P) represents the environment’s dynamics. Since our approach is model-free, this is learned implicitly through interaction, not explicitly programmed.”
If asked, “How does this relate to traffic control?” say: “This framework is a perfect fit for traffic control. Each intersection is an agent. The ‘state’ is the real-time traffic data from its sensors. The ‘action’ is its decision to switch the light. And the ‘reward’ is the immediate impact on traffic flow, like reduced waiting time. By trying to maximize its cumulative reward, the agent automatically learns to become an efficient traffic manager.”
Transition: “However, applying this process to a single intersection is not enough. To manage a city, we need to move from a single agent to a coordinated network.”

Part 3: The Core Challenge & Our Focus (Your Section)

Slide 8: From a Single Intersection to a Coordinated Network

Opening: “Thank you, [Previous Speaker’s Name]. So far, we’ve discussed how a single intersection can learn. But as this slide illustrates, local optimization can lead to global gridlock.”
Main Points:
- “If one intersection optimizes its own flow in isolation, it often creates cascading bottlenecks for its neighbors. You can see this on the left—the red cars are piling up because one intersection is sending them traffic without any coordination.”
- “The solution to this is Multi-Agent Reinforcement Learning, or MARL. Here, each intersection acts as a cooperative, decentralized agent. They work as a team.”
- “This approach overcomes the ‘curse of dimensionality’ of trying to control every single light from one central brain, which is computationally impossible. Instead, we get scalable, network-wide synchronization.”
- “The image on the right shows the goal: coordinated action leading to network efficiency, where agents communicate and make decisions that benefit the entire system.”
Transition: “Pioneering work in MARL has given us powerful models that we build upon. Let’s look at two state-of-the-art examples.”

Slide 9: From Isolation to Cooperation: The Rise of Multi-Agent RL

Opening: “The evolution to MARL has produced two very influential architectures that demonstrate how to achieve this cooperation: PressLight and CoLight.”
Main Points:
- “On the left, we have PressLight. Its key innovation was embedding transportation theory directly into the reward function. It uses the concept of Max Pressure—which is the difference between incoming and outgoing traffic flow. By rewarding agents for minimizing this pressure, it provides a theoretical guarantee for network stability and maximizing throughput.”
- “On the right is CoLight. CoLight addresses a different problem: who should an agent listen to? It uses Graph Attention Networks (GATs) to allow agents to dynamically weigh the importance of information from their neighbors. For example, during rush hour, information from an upstream highway exit is far more important than a quiet side street. CoLight learns these cooperation patterns directly from the data.”
If asked, “What’s the difference between CoLight and PressLight?” say: “They solve cooperation in two different, complementary ways. PressLight focuses on what to optimize by using a theoretically sound reward signal—Max Pressure. CoLight focuses on how to coordinate by learning which neighboring agents have the most relevant information at any given moment. Our work draws inspiration from both.”
Transition: “However, even these advanced models, which perform exceptionally well in simulations, face a major challenge. This is what we call the ‘Reality Gap’.”

Slide 10: The Reality Gap is Defined by Two Core Failures

Opening: “The ‘Reality Gap’ is the reason why algorithms that succeed in the lab often fail on real streets. It’s defined by two core failures of the real world: Non-Stationarity and Partial Observability.”
Main Points:
- “First, Non-Stationarity. This means the rules of the environment are constantly changing. Data from ‘normal’ traffic becomes obsolete during a sudden accident, a special event, or even a weather shift. Standard RL agents suffer from what’s called ‘catastrophic forgetting’—they overwrite previously learned knowledge, and their performance collapses.”
- “Second, Partial Observability. This means our agents are partially blind. In the real world, sensors fail, they are noisy, and they have a limited range. An agent that acts on deterministic assumptions with faulty data can make dangerously incorrect decisions. For example, if a sensor fails and reports zero cars, a standard agent might turn the light red, causing a massive traffic jam.”
If asked, “Why are these two problems linked?” say: “That’s an excellent question. They create a vicious cycle. Partial observability (like a broken sensor) makes the environment seem even more non-stationary and unpredictable to an agent. And a non-stationary event (like an accident) can cause traffic patterns that sensors weren’t designed to capture, leading to worse partial observability. A truly robust system must solve both problems at the same time.”
Transition: “These two core failures lead to a series of specific, unsolved research gaps that our project is designed to address.”

Slide 11: Research Gaps: Why Lab Successes Fail on Real Streets

Opening: “Our literature review identified five critical research gaps that prevent MARL from being deployed reliably in the real world.”
Main Points (Briefly introduce each gap, relating it back to the core failures):
- “Observation Robustness: Current models assume 100% sensor accuracy, making them brittle. We need policies that can handle real-world sensor ‘blackouts’.” (This addresses Partial Observability)
- “Memory Management: Standard replay buffers suffer from ‘catastrophic forgetting’ during sudden events like accidents.” (This addresses Non-Stationarity)
- “Temporal Misalignment: Agents update at fixed intervals, like every 10 seconds. But traffic can be stable one minute and chaotic the next. This static timing leads to lagging in chaotic traffic and overfitting in stable traffic.”
- “Causal Feedback: Algorithms treat traffic as a ‘black box.’ They don’t understand that their own action—like extending a green light—causes the congestion their neighbor sees two minutes later. They react to problems instead of proactively preventing them.”
- “Dynamic Topology: Models like CoLight assume neighbors are fixed. But during rush hour, an intersection’s most important ‘neighbor’ might be a highway exit three kilometers away. The communication graph should be dynamic.”
Transition to Member 4: “Our project proposes a novel, resilient learning framework that directly targets these gaps. I’ll now hand it over to [Next Speaker’s Name] to explain our proposed solutions.”

Part 4: The Proposed Solution & Implementation (Presented by Member 4)

Slide 12: Our Proposal: A Resilient Learning Framework

Opening: “Thank you, [Your Name]. To tackle the challenges just described, we are proposing a resilient learning framework for real-world deployment. Our solution is built on two key innovations.”
Main Points:
- “First, Uncertainty-Aware Control. This directly addresses the problem of partial observability. It enables safe, reliable action even during sensor failure or data noise. This solves the ‘Observation Robustness’ gap.”
- “Second, Regime-Aware Memory. This directly addresses non-stationarity. It prevents catastrophic forgetting by detecting and adapting to sudden traffic shifts. This solves the ‘Memory Management’ gap.”
Transition: “Let’s look at the first algorithm, which makes our control system uncertainty-aware.”

Slide 13: Algorithm 1 - Uncertainty-Aware Control for Sensor Failures

Opening: “Algorithm 1 provides a mechanism for robust control when sensor data is unreliable.”
Main Points:
- “The core idea is to quantify the confidence we have in our sensor observations. The agent’s encoder doesn’t just process the raw data; it also outputs a confidence score.”
- “As the pseudocode shows, if the confidence score is above a certain threshold (if ct > τ), it means the observation is reliable. In this case, the agent uses its learned policy to take the optimal action.”
- “However, if the confidence score is below the threshold, it signals a potential sensor failure or noise. Instead of acting on bad data, the agent switches to a ‘safe fallback’ controller. This could be a simple, proven method like Max-Pressure or just extending the current phase to avoid risky decisions.”
- “Crucially, we also use this confidence score to ‘down-weight’ low-confidence experiences in the replay memory, so the agent doesn’t learn from bad data.”
Transition: “While this handles observation uncertainty, we also need a way to manage environmental uncertainty. This brings us to Algorithm 2.”

Slide 14: Algorithm 2 - Regime-Aware Memory for Sudden Traffic Shifts

Opening: “Algorithm 2 is designed to prevent ‘catastrophic forgetting’ by making the agent’s memory aware of traffic regimes.”
Main Points:
- “The algorithm constantly monitors a key traffic statistic, like pressure variance, using a sliding window.”
- “It performs a divergence test to check if the current traffic pattern is statistically different from the long-term baseline. If a significant change is detected—for example, a sudden spike in pressure variance due to an accident—it declares a ‘new regime’.”
- “When a new regime is detected, two things happen. First, it triggers active forgetting. The algorithm starts evicting the oldest, most outdated data from its memory buffers. Second, it prioritizes learning from the most recent data to enable fast adaptation to the new situation.”
- “This process allows the agent to quickly adapt to sudden shifts without completely forgetting what it learned about previous, ‘normal’ traffic patterns.”
Transition: “Now that we’ve outlined our core algorithms, I’ll pass it to [Next Speaker’s Name] to discuss our project roadmap and how we plan to validate these ideas.”

Part 5: Project Plan & Conclusion (Presented by Member 5)

Slide 15: Project Roadmap: From Theory to Validation

Opening: “Thank you. Our project is structured into a four-phase roadmap to take these ideas from theory to validation.”
Main Points:
- “Phase 1, which is currently In Progress, is ‘Baseline & Environment Setup’. Here, we are configuring the SUMO traffic simulator for Indian traffic conditions, including the ‘Free Left Turn’ policy. We are also implementing the CoLight benchmark, which will serve as the state-of-the-art model we need to outperform.”
- “Phase 2 is ‘Robustness & Memory Implementation’. This is our next step, where we will implement the two algorithms we just discussed: Uncertainty-Aware Control and Regime-Aware Memory.”
- “Phase 3 and Phase 4 are future work. In Phase 3, we will tackle the more complex research gaps like Dynamic Topology and Causal Feedback. Finally, in Phase 4, we will integrate all modules and perform full-scale validation on real-world datasets from CityFlow.”
Transition: “Looking even further ahead, our work opens up several exciting future research directions.”

Slide 16: Pushing the Frontier: Next Steps in Agent Intelligence

Opening: “Our proposed framework lays the groundwork for the next generation of agent intelligence in traffic control.”
Main Points:
- “One future direction is Entropy-Driven Scheduling. Instead of fixed update intervals, we can develop agents that dynamically adjust their learning frequency based on local traffic chaos, or entropy. They would learn faster in chaotic conditions and slower in stable ones.”
- “Another is developing Fully Autonomous Topology Learning, where agents automatically learn which other intersections to communicate with, creating a dynamic communication network.”
- “The ultimate goal is to build Causal World Models. This would enable proactive congestion prevention by creating agents that can anticipate the downstream impacts of their decisions, moving from a reactive to a truly predictive system.”
Transition: “To conclude…”

Slide 17: Thank You

What to Say: “In summary, our project addresses the critical simulation-to-reality gap in traffic control by proposing a robust MARL framework designed for non-stationarity and partial observability. By developing solutions for observation robustness and memory management, we aim to transform reinforcement learning from a powerful simulation tool into a reliable, deployable solution for smart cities. Thank you for your time. We are now open for questions.”

Panel Questions

Category 1: Core RL/MDP Fundamentals

Q1: You Mentioned MDP. Walk Me through why the Markov Property is Essential Here. What Happens if Your Traffic State is NOT Markovian?

A: “Excellent question. The Markov property assumes that the current state contains all information necessary to make optimal decisions—the future is independent of the past given the present. In traffic, this means our state representation (queue lengths, speeds, elapsed times) should fully capture what the agent needs to know.

If the state is NOT Markovian—say we’re missing crucial information like an upstream accident that’s about to send a wave of cars—then the agent’s decisions become suboptimal. It’s making choices without the full picture. This is actually one reason why partial observability is so problematic: when sensors fail, we lose state information, and the Markov property breaks down. The agent thinks it has complete information, but it doesn’t.

That’s why sophisticated architectures like CoLight use spatial attention mechanisms—they try to recover missing spatial dependencies by learning which distant intersections affect local traffic, essentially trying to restore the Markovian property by expanding what ‘state’ means.”

Q2: Explain the Exploration-exploitation Tradeoff in Your Traffic Control Context. How Does it Manifest Differently than In, Say, a Game-playing agent?

A: “That’s a critical distinction. In traffic control, the exploration-exploitation tradeoff has real-world consequences.

Exploration means trying new signal timing strategies to discover potentially better policies. Exploitation means using the current best-known policy to minimize delays right now.

The key difference from game-playing: in traffic, you cannot pause the game or reset the environment. When AlphaGo explores, it can play millions of games. But in traffic, if an agent explores poorly—say, holding a red light too long—real cars are stuck, real fuel is wasted, and you’ve potentially caused a cascading jam.

This is why most traffic RL uses off-policy learning methods like DQN or soft actor-critic. The agent can learn from a replay buffer of past experiences without needing to constantly explore dangerous actions in real-time. It learns from historical ‘what-if’ scenarios.

Additionally, during deployment, we’d likely use a safe exploration strategy—maybe epsilon-greedy with a very low epsilon (like 0.01), or confidence-based exploration where we only explore when we’re certain it won’t cause catastrophic delays. The stakes are just fundamentally different than in simulation.”

Q3: You Use a Negative Reward for Delay. Why not Positive Reward for Throughput? Aren’t They equivalent?

A: “Great observation—they seem equivalent, but they’re not, and the difference matters for learning dynamics.

Negative reward for delay (like -Σ queue_length) is dense and immediate. Every timestep with cars waiting produces a negative signal. This gives the agent continuous feedback for learning.

Positive reward for throughput (like +Σ vehicles_passed) is sparse. The agent only gets a reward when cars actually exit the network. If traffic is gridlocked, throughput is zero, so reward is zero—there’s no gradient to learn from. It’s like training in silence.

But here’s the deeper insight: we actually use Max Pressure as the reward, which is conceptually different from both. Max Pressure is the difference between incoming and outgoing flow—it’s a stability metric, not just a delay metric.

Why does this matter? Minimizing delay alone can create local selfishness. An intersection might aggressively clear its queue but push a massive wave to its neighbor. Max Pressure penalizes imbalance, which naturally encourages network-wide coordination. It’s theoretically proven to maximize throughput while preventing the ‘greedy local optimization’ problem. That’s why PressLight embedded it directly into the reward function.”

Category 2: Multi-Agent Systems

Q4: You Mentioned the ‘curse of Dimensionality’ as a Reason for MARL. Quantify This. How Does the State-action Space Grow with Centralized vs. Decentralized control?

A: “Let me make this concrete. Suppose we have N intersections, and each intersection has K possible actions (signal phases).

Centralized approach: The joint action space is K^N. For just 10 intersections with 4 phases each, that’s 4^10 = 1,048,576 possible joint actions at every timestep. The state space explodes even faster—if each intersection has a state vector of dimension S, the joint state is S^N dimensional.

Decentralized MARL approach: Each agent has only K actions and an S-dimensional local state (plus maybe some information from immediate neighbors). The learning problem is linear in N, not exponential.

So for 10 intersections:

Centralized: ~1 million joint actions
Decentralized: 10 agents × 4 actions each = just handling 4 actions per agent

The tradeoff is that decentralized agents must learn to coordinate without a global view, which is why communication and attention mechanisms (like in CoLight) are crucial. But the computational savings make it the only scalable approach for city-scale networks.

This is exactly why we can’t just use a ‘god-like’ centralized controller—it’s mathematically intractable.”

Q5: In MARL, how Do You Prevent Agents from Converging to a Suboptimal Nash Equilibrium instead of the Global optimum?

A: “This is one of the fundamental challenges in cooperative MARL. Let me explain the problem first, then how modern architectures address it.

The Problem: In game theory, a Nash equilibrium is a state where no agent can improve by changing only its own strategy. But Nash equilibria can be locally optimal but globally suboptimal. For example, two intersections might settle into a rhythm that works for them, but creates gridlock for the rest of the network.

How we address it:

Shared Reward Structures: Models like PressLight use Max Pressure, which is a network-level stability metric. All agents get rewarded based on system-wide pressure reduction, not just local metrics. This aligns individual incentives with global objectives.
Communication/Attention Mechanisms: CoLight’s Graph Attention Networks allow agents to incorporate neighbors’ states into their decision-making. An agent isn’t just optimizing for itself—it’s explicitly considering how its actions affect others it’s paying attention to.
Centralized Training, Decentralized Execution (CTDE): During training, we can use a centralized critic that sees the global state and guides all agents toward globally optimal policies. But during deployment, agents execute independently using only local information. This is the paradigm used in many state-of-the-art MARL algorithms.
Experience Sharing: Agents can share their replay buffers, so they learn from each other’s experiences. This helps prevent local convergence by exposing agents to diverse scenarios.

The theoretical guarantee comes from Max Pressure itself—it’s proven that if all agents minimize their local pressure, the network as a whole achieves maximum throughput. So the local Nash equilibrium IS the global optimum when the reward is structured correctly.”

Q6: Explain the Difference between Cooperative, Competitive, and Mixed MARL. Which One is Traffic Control, and why Does that matter?

A: “Excellent framing question. Let me define each and explain why classification matters.

Cooperative MARL: All agents share a common objective. They’re on the same team. Examples: robot swarms, traffic control.

Competitive MARL: Agents have opposing objectives. It’s zero-sum. Examples: poker, two-player games.

Mixed MARL: Some agents cooperate, some compete. Examples: autonomous vehicle negotiation, market trading.

Traffic control is fundamentally cooperative because all intersections want to maximize network throughput and minimize total delay. There’s no incentive for one intersection to ‘win’ at another’s expense—that would hurt the shared objective.

Why this classification matters:

Reward Design: In cooperative settings, we can use shared or team rewards (like Max Pressure for the whole network). In competitive settings, rewards are zero-sum.
Information Sharing: Cooperative agents can freely share observations and policies without strategic deception. In competitive settings, agents must hide information strategically.
Convergence Guarantees: Cooperative MARL has better convergence properties. We can use techniques like centralized training because agents aren’t adversarial. In competitive settings, we need game-theoretic equilibrium concepts like Nash equilibria.
Credit Assignment: In cooperative MARL, we face the credit assignment problem—if the network improves, which agent deserves credit? This is why attention mechanisms (CoLight) are so valuable—they help identify causal relationships.

If we mistakenly treated traffic as competitive (each intersection trying to maximize only its own throughput), we’d get exactly the ‘local optimization leads to global gridlock’ problem. The cooperative framing with shared rewards is what enables network-wide coordination.”

Category 3: Partial Observability & Non-Stationarity

Q7: You Keep Saying ‘partial observability.’ Formally Define This in the Context of POMDPs. How Does it Differ from just ‘noisy observations’?

A: “This is a crucial distinction. Let me define it formally, then explain why it matters.

POMDP Definition: A Partially Observable Markov Decision Process extends an MDP by adding an observation function. Instead of observing the true state s, the agent receives an observation o drawn from O(s, a), the observation distribution.

Key point: The agent never sees the true state s. It only sees o, which is an incomplete or noisy projection of s.

Partial Observability vs. Noisy Observations:

Noisy observations: The agent sees a corrupted version of the full state. If the true queue length is 10, it might observe 9 or 11. But it sees values for all variables—just with noise.
Partial observability (strict sense): The agent is fundamentally missing information. For example, a sensor failure means the agent doesn’t see anything about that lane—not a noisy estimate, but a total absence of information. Or it can’t see what’s happening 500 meters upstream beyond its sensor range.

Why this matters in traffic:

In a POMDP, optimal policies must account for uncertainty over the true state. You might need to maintain a belief state—a probability distribution over possible true states given your observations.

For example, if a sensor fails and reports zero cars:

Noisy observation approach: “The sensor says 0, but it’s probably around 2-3 cars.”
POMDP approach: “The sensor failed. I have no information. The true state could be 0 cars or 50 cars. Given my belief distribution and the risk, I should take a safe action like extending the current phase.”

This is exactly what our Uncertainty-Aware Control algorithm does—it maintains a confidence score (essentially a proxy for belief state uncertainty) and switches to safe fallback policies when uncertainty is high.

Standard RL assumes full observability (MDP). Traffic requires POMDP methods, which are computationally harder but necessary for safety and robustness.”

Q8: Catastrophic forgetting—explain the Mechanism. Why Does it Specifically Occur during Regime Shifts, and how is it Different from Normal Concept drift?

A: “Excellent question that gets at the learning dynamics. Let me break this down.

Catastrophic Forgetting Mechanism:

Neural networks learn by adjusting weights to minimize loss on current data. When the data distribution suddenly shifts, the network adjusts weights to fit the new data. But in doing so, it overwrites the weight configurations that were optimal for the old data. The old knowledge is lost—catastrophically.

Why it occurs during regime shifts:

Imagine the agent learned perfect policies for morning rush hour. Then an accident happens at 2:00 PM. The agent now sees completely different traffic patterns—massive queues where there are usually none. It starts updating its weights to minimize loss on this ‘accident data.’

Because neural networks use gradient descent, every update moves weights in the direction of the current data. If the ‘accident regime’ persists for enough updates, the weights drift far from their ‘rush hour’ configuration. When the accident clears and rush hour patterns return, the agent has forgotten how to handle them. Its performance collapses.

Catastrophic Forgetting vs. Concept Drift:

Concept Drift is gradual. The data distribution slowly changes over time (e.g., traffic patterns changing over months as new buildings open). The agent can slowly adapt its weights if the learning rate is tuned correctly.
Catastrophic Forgetting is sudden and severe. The distribution changes abruptly (accident, special event, weather). The agent doesn’t have time to gradually adapt—it’s forced to rapidly update, which obliterates old knowledge.

How Regime-Aware Memory helps:

Our algorithm detects sudden shifts using divergence tests. When a new regime is detected:

It stops training on old data from the previous regime (active forgetting of outdated experiences)
It prioritizes recent data for fast adaptation
Crucially, it can also store regime-specific sub-policies—imagine having separate ‘accident mode’ and ‘normal mode’ networks that can be switched between

This prevents the weights from being violently pulled back and forth between incompatible regimes. Instead, the agent maintains multiple ‘memory banks’ and learns which one to use when.”

Q9: Your State Representation Includes ‘elapsed time since Last green.’ Why is that Failure-proof, and what Does it Actually Tell the agent?

A: “Great observation—this is a subtle but brilliant design choice in robust state representations.

Why it’s failure-proof:

Elapsed time depends only on the controller’s internal clock, not on any external sensors. Even if every camera, loop detector, and radar fails, the controller still knows: ‘I turned this lane green 45 seconds ago.’

This makes it completely immune to sensor failures, unlike queue length, speed, or occupancy, which all require working sensors.

What it tells the agent:

Fairness metric: If one direction hasn’t had a green light in 120 seconds while another just got one 10 seconds ago, the agent knows there’s a massive imbalance. This helps prevent starvation—where one direction is completely ignored because its sensors are broken.
Urgency signal: Elapsed time correlates with queue buildup. If a lane hasn’t been served in a long time, there’s likely a long queue, even if sensors can’t see it. It’s an indirect observation of traffic state.
Temporal context: It helps the agent learn temporal patterns. For example, ‘if I’ve held this green for 60 seconds during rush hour, I should probably switch soon to avoid overwhelming downstream intersections.’
Safe fallback logic: In our Uncertainty-Aware Control, when sensor confidence is low, the agent can use a simple rule: ‘Extend green for directions with high elapsed time since last green.’ This ensures basic fairness even when blind.

Real-world example:

Imagine a camera gets blocked by fog. The agent can no longer see queue lengths. Without elapsed time, it might freeze or make random decisions. With elapsed time, it can at least say: ‘North-South hasn’t had a green in 90 seconds, so I should give them service now, even though I can’t see how many cars are there.’

It’s essentially a degraded-mode sensor—not ideal, but far better than nothing. This is exactly the kind of robustness engineering that bridges the reality gap.”

Category 1: The “Why RL?” & High-Level Justification

These questions test if you understand the philosophical and practical reasons for choosing this paradigm over others.

Question 1:

“You’ve positioned Reinforcement Learning as the solution to the failures of traditional systems. However, advanced adaptive systems like SCOOT or SCATS have existed for decades and also adjust to real-time traffic. What, fundamentally, makes RL a different paradigm and not just a more complex version of adaptive control? What can an RL agent discover that a well-tuned heuristic system cannot?”

How to Answer:
- “That’s a critical distinction. Traditional adaptive systems operate on a set of pre-defined, human-engineered heuristics. For example, ‘if queue length exceeds X, extend green time by Y seconds.’ They are reactive and limited by the rules we give them. They can optimize within those rules, but they can’t discover entirely new ones.”
- “Reinforcement Learning, on the other hand, is a true learning paradigm. It is not given explicit rules. Its only goal is to maximize a reward signal, like network throughput. Through trial-and-error, it can discover complex, non-obvious control policies that a human might never design. For example, it might learn that a short green phase now, while seemingly inefficient, can prevent the formation of a much larger congestion wave ten minutes later. It learns the long-term consequences of its actions, something a purely heuristic system struggles with.”

Question 2:

“You emphasize that RL is ‘model-free.’ Define what ‘model-free’ means in this context. Is being model-free always an advantage? Could a ‘model-based’ RL approach offer benefits that you are ignoring?”

How to Answer:
- “‘Model-free’ means the agent learns a control policy directly from experience without trying to build an explicit mathematical model of the traffic environment. It doesn’t need to know the physics of car acceleration or driver behavior. It just learns that ‘in this state, this action produced a good reward.‘”
- “This is a huge advantage because creating an accurate model of a complex, stochastic system like urban traffic is practically impossible.”
- “However, it’s not always superior. The primary drawback of model-free learning is that it can be very sample-inefficient—it requires a massive amount of trial-and-error experience. A ‘model-based’ approach, where the agent first learns a model of the world and then plans using that model, could potentially learn much faster. The trade-off is that if the learned model is inaccurate, the resulting policy will be suboptimal. Given the immense complexity and randomness of real traffic, the model-free approach is generally considered more robust for this domain.”

Category 2: The Markov Decision Process (MDP) & Technical Core

Question 3:

“The entire foundation of your work rests on the Markov Decision Process. The key assumption of an MDP is the Markov Property. First, define this property. Second, and more importantly, does a real-world, multi-intersection traffic network actually satisfy the Markov Property? If not, why is the MDP still a useful framework?”

How to Answer:
- “The Markov Property states that the future is independent of the past, given the present. In RL terms, it means that all the information needed to make the optimal decision is contained within the current state (S). The agent doesn’t need to know the history of all previous states to act optimally.”
- “To be frank, a real-world traffic network does not strictly satisfy the Markov Property. For example, the current queue length doesn’t tell you if a massive platoon of vehicles was just released from an upstream intersection and will arrive in 30 seconds. That ‘hidden’ information, which is part of the history, is critical.”
- “However, the MDP remains an invaluable approximation. We design our state representation to capture as much relevant information as possible—for instance, by including data from neighboring intersections, as seen in models like CoLight. While not perfect, this creates a state that is ‘Markovian enough’ for the agent to learn highly effective policies. The goal is not to perfectly model reality, but to create a sufficiently rich state that enables effective decision-making.”

Question 4:

“You mention Max-Pressure as a key metric for your reward. Intuitively, why is minimizing the difference between incoming and outgoing queues a good proxy for maximizing network throughput and minimizing travel time?”

How to Answer:
- “That’s a great question that gets to the core of why PressLight was so effective. The intuition is about stability and balance. ‘Pressure’ is a measure of the imbalance or ‘potential work’ at an intersection. A high pressure in one direction means there’s a large number of vehicles waiting to move into a lane with available capacity.”
- “By always choosing the action that serves the direction with the highest pressure, the agent is constantly working to resolve the biggest imbalances in the network. This prevents queues from growing uncontrollably and spilling back to block other intersections. By keeping the network stable and preventing gridlock, you are ensuring vehicles can move through the system efficiently, which in turn maximizes throughput and, as a consequence, minimizes overall travel time.”

Category 3: Multi-Agent Systems & Coordination

These questions test your understanding of the complexities that arise when you have multiple learning agents.

Question 5:

“You correctly state that moving from a single agent to multiple agents introduces the problem of non-stationarity. From the perspective of a single agent, what does ‘non-stationarity’ mean, and why does it fundamentally break the learning process for a standard RL algorithm?”

How to Answer:
- “From the perspective of a single agent, the environment is supposed to be stationary. This means that if it’s in State S and takes Action A, the probability of transitioning to State S’ and receiving Reward R should be consistent. This consistency is what allows it to learn.”
- “In a multi-agent system, this breaks. As my neighboring agents are also learning and updating their policies, their behavior changes. So, today, my action of sending a platoon of cars to my neighbor might be fine. But tomorrow, my neighbor might have learned a new policy that causes that platoon to hit a red light, creating a backup that affects my own state.”
- “This means the rules of the world are constantly changing from my perspective. The environment is no longer stationary because the other agents are part of my environment. A standard single-agent algorithm fails because it can’t learn a stable mapping from actions to outcomes when the outcomes themselves keep changing unpredictably.”

Question 6:

“On Slide 9, you introduce CoLight and its use of Graph Attention Networks. Explain intuitively what an ‘attention mechanism’ is doing here. How is it more sophisticated than, say, just taking a weighted average of your neighbors’ queue lengths?”

How to Answer:
- “An attention mechanism allows the agent to learn which information is most important right now. A simple weighted average would be static—for example, always considering the highway 50% important and the side street 10% important.”
- “CoLight’s attention mechanism is dynamic and context-dependent. The agent learns to calculate ‘attention weights’ based on the current traffic state. It asks, ‘Given the current situation at my intersection and the situations at all my neighbors, which neighbor’s data is most predictive of my future success?‘”
- “So, during morning rush hour, it might learn to place almost 100% of its attention on the upstream intersection that feeds it heavy traffic. But late at night, when traffic is light and random, it might learn to pay more attention to its own local sensors. It learns to dynamically focus its attention, filtering out noise and focusing on the most critical signals for coordination.”

Category 1: Reinforcement Learning (RL) Fundamentals

Question 1: “In one sentence, what is the fundamental goal of Reinforcement Learning?”

Crisp Answer: The goal of RL is for an agent to learn an optimal policy—a mapping from states to actions—that maximizes the total cumulative reward it receives from an environment over time.

Question 2: “Contrast Reinforcement Learning with Supervised and Unsupervised Learning.”

Crisp Answer:
- Supervised Learning learns from a labeled dataset with correct answers, like a student with a textbook.
- Unsupervised Learning finds hidden patterns in unlabeled data, like a detective looking for clues.
- Reinforcement Learning learns from trial-and-error interaction with an environment, using only a reward signal as guidance, like a player learning a new game.

Question 3: “Explain the ‘Exploration vs. Exploitation’ trade-off. Why is it a dilemma?”

Crisp Answer: It’s the fundamental dilemma of choosing between two actions:
- Exploitation: Taking the action that is currently known to give the best reward.
- Exploration: Trying a new or random action to discover if there is an even better strategy.
- The Dilemma: If an agent only exploits, it might get stuck in a suboptimal strategy. If it only explores, it never uses the good strategies it has found. The key is to balance both.

Question 4: “What is the difference between a value-based and a policy-based RL method?”

Crisp Answer:
- Value-Based (like DQN): The agent learns a value function (Q-value) that estimates the long-term reward for each action in a given state. The policy is implicit: always pick the action with the highest value.
- Policy-Based (like PPO): The agent directly learns the policy itself—a function that explicitly outputs which action to take (or the probability of taking each action) in a given state.

Category 2: The Markov Decision Process (MDP) Core

Question 5: “What are the five components of the MDP tuple, and what does each represent?”

Crisp Answer: The MDP is defined by a 5-part tuple (S, A, P, R, γ):
- S (States): All possible situations the agent can be in (e.g., all possible traffic conditions).
- A (Actions): All possible decisions the agent can make (e.g., all available signal phases).
- P (Transition Probability): The rules of the environment; the probability of moving from state S to state S' after taking action A.
- R (Reward Function): The feedback signal; the immediate reward the agent gets for taking action A in state S.
- γ (Discount Factor): A value between 0 and 1 that determines the importance of future rewards versus immediate rewards.

Question 6: “Define the Markov Property. Why is this assumption so important for RL?”

Crisp Answer: The Markov Property states that the future is independent of the past, given the present.
- This means the current state contains all information necessary to make an optimal decision. The agent doesn’t need to know the entire history of events.
- This assumption is critical because it simplifies the problem immensely, allowing the agent to learn a policy based only on what it currently observes.

Question 7: “What is a ‘policy’ (π) and how does it differ from a ‘value function’ (Q or V)?”

Crisp Answer:
- A Policy (π) is the agent’s strategy or “brain.” It dictates what action the agent will take in a given state.
- A Value Function is a prediction of future reward. It tells the agent how “good” a certain state (V) or a state-action pair (Q) is in the long run. The policy uses the value function to make its decisions.

Category 3: Multi-Agent Reinforcement Learning (MARL) Essentials

Question 8: “Why can’t we just use a single, powerful RL agent to control an entire traffic network?”

Crisp Answer: Because of the curse of dimensionality. A single agent controlling N intersections would have to choose from an action space that grows exponentially with N. This makes it computationally intractable to find the optimal network-wide action. MARL provides a scalable, decentralized alternative.

Question 9: “What is the single biggest challenge that MARL introduces, which is not present in single-agent RL?”

Crisp Answer: Non-stationarity. From any single agent’s perspective, the environment is constantly changing because the other agents are simultaneously learning and changing their policies. This breaks the core “stationary environment” assumption that single-agent RL relies on.

Question 10: “Explain the concept of ‘Centralized Training with Decentralized Execution’ (CTDE).”

Crisp Answer: CTDE is the dominant MARL training paradigm that gets the best of both worlds:
- Centralized Training: During the training phase (in simulation), we use a centralized system that can see everything—all agents’ states and actions. This allows for stable and efficient learning of cooperative strategies.
- Decentralized Execution: During deployment, each agent acts on its own using only its local observations. It no longer needs the centralized trainer. This makes the system fast, scalable, and robust in the real world.

Question: “Why can’t you just use simpler Machine Learning models, like a supervised classifier or a regressor, to solve this problem instead of the complexities of Reinforcement Learning?”

Crisp Answer: Simple ML models are unsuitable because traffic signal control is a sequential decision-making problem, not a one-shot prediction task.
1. Supervised Learning needs a “correct answer” for every situation, which doesn’t exist. There is no perfect dataset of “right” and “wrong” traffic signal actions, because the best action depends on the long-term consequences, not just the current traffic state.
2. The model’s own actions change the environment. A supervised model predicts an outcome based on static inputs. In traffic control, the agent’s action (changing the light) directly influences the next state (the new traffic pattern). The system needs to learn from this feedback loop, which is exactly what Reinforcement Learning is designed for.

Avi's Notes

Explorer

Final Year Project