This course (IFT6760C) is intended for advanced graduate students with a strong background in machine learning, mathematics, operations research or statistics. Prior exposure to the topic is expected. Please request persmission if in doubt. If you are looking for an introductory-level course on reinforcement learning and dynamic programming, you can take COMP-767 at McGill University and IFT6521 at UdeM. You can register to IFT6760C on Synchro if your affiliation is with UdeM, or via the CREPUQ if you are from McGill or another institution in Quebec.
Due to the research-oriented nature of this class, you need to be comfortable with a teaching format involving open-ended questions and assignments. You will be required to think critically and adopt an open mindset. My teaching goal with this course is for all the participants to build their own understanding of reinforcement learning in relation to their primary research area while sharing their unique perspective and insights with the entire class. Active class participation is expected.
Origin: from the Latin verb excurrere which means to run out. This is also the intended meaning behind the title of this course. I want us to deviate from the usual paths and explore the rich connections between reinforcement learning and other disciplines, in particular: optimization, control theory and simulation. And of course, I'm also hoping that this will be a fun activity for everyone.
Twice a week, on Tuesday from 9:30 to 11:30AM and on Friday from 13h30 to 15h40. The course will be taught at Mila.
The following evaluation structure is subject to change depending on the class size.
There is no mandatory textbook. I will however be referencing content from:
The tentative week-by-week schedule (according to the UdeM calender) is the following:
|January 6||First class. Review of Markov Decision Processes and examples|
|January 13||Criteria: finite horizon, infinite horizon, average reward|
|January 20||Methods: value iteration, policy iteration, LP formulation, generalized Bellman operator and matrix splitting methods|
|January 27||LSTD(lambda), TD(lambda), oblique perspective, variational inequality perspective, stability|
|February 3||Off-policy learning: importance sampling and the conditional monte-carlo method|
|February 10||Fitted value methods: FQI, NFQI, DQN, proximal methods and GTD/TDC|
|February 17||Policy gradients: occupation measures, discounted objective, implicit differentiation and derivation in the infinite horizon case|
|February 24||Policy gradients: derivative estimation, likelihood ratio methods (REINFORCE), reparametrization (IPA), baselines (control variates), actor-critic systems|
|March 2||Spring break|
|March 9||Policy gradients: application for learning temporal abstractions, the option-critic architecture, hierarchical and goal-conditioned RL|
|March 16||Policy gradients: Linear-Quadratic Regulator, Lagrangian formulation, MPC, Monte-Carlo Tree Search|
|March 23||Automatic differentiation as discrete-time optimal control|
|March 30||Formulation of inverse RL and meta-RL as bilevel optimization.|
|April 6||Methods (contd.): KKT "trick", forward, reverse, implicit, competitive. Case studies|
|April 13||Challenge and opportunities|
|April 20||Final project presentations|