Excursions in Reinforcement Learning

This course (IFT6760C) is intended for advanced graduate students with a strong background in machine learning, mathematics, operations research or statistics. Prior exposure to the topic is expected. Please request persmission if in doubt. If you are looking for an introductory-level course on reinforcement learning and dynamic programming, you can take COMP-767 at McGill University and IFT6521 at UdeM. You can register to IFT6760C on Synchro if your affiliation is with UdeM, or via the CREPUQ if you are from McGill or another institution in Quebec.

Due to the research-oriented nature of this class, you need to be comfortable with a teaching format involving open-ended questions and assignments. You will be required to think critically and adopt an open mindset. My teaching goal with this course is for all the participants to build their own understanding of reinforcement learning in relation to their primary research area while sharing their unique perspective and insights with the entire class. Active class participation is expected.


Origin: from the Latin verb excurrere which means to run out. This is also the intended meaning behind the title of this course. I want us to deviate from the usual paths and explore the rich connections between reinforcement learning and other disciplines, in particular: optimization, control theory and simulation. And of course, I'm also hoping that this will be a fun activity for everyone.

Time and Location

Twice a week, on Tuesday from 9:30 to 11:30AM and on Friday from 13h30 to 15h40. The course will be taught at Mila.


The following evaluation structure is subject to change depending on the class size.

There is no mandatory textbook. I will however be referencing content from:


The tentative week-by-week schedule (according to the UdeM calender) is the following:

Week Topics
January 6 First class. Review of Markov Decision Processes and examples
January 13 Criteria: finite horizon, infinite horizon, average reward
January 20 Methods: value iteration, policy iteration, LP formulation, generalized Bellman operator and matrix splitting methods
January 27 LSTD(lambda), TD(lambda), oblique perspective, variational inequality perspective, stability
February 3 Off-policy learning: importance sampling and the conditional monte-carlo method
February 10 Fitted value methods: FQI, NFQI, DQN, proximal methods and GTD/TDC
February 17 Policy gradients: occupation measures, discounted objective, implicit differentiation and derivation in the infinite horizon case
February 24 Policy gradients: derivative estimation, likelihood ratio methods (REINFORCE), reparametrization (IPA), baselines (control variates), actor-critic systems
March 2 Spring break
March 9 Policy gradients: application for learning temporal abstractions, the option-critic architecture, hierarchical and goal-conditioned RL
March 16 Policy gradients: Linear-Quadratic Regulator, Lagrangian formulation, MPC, Monte-Carlo Tree Search
March 23 Automatic differentiation as discrete-time optimal control
March 30 Formulation of inverse RL and meta-RL as bilevel optimization.
April 6 Methods (contd.): KKT "trick", forward, reverse, implicit, competitive. Case studies
April 13 Challenge and opportunities
April 20 Final project presentations