This course (IFT6760C) is intended for advanced graduate students with a good background in machine learning, mathematics, operations research or statistics. You can register to IFT6760C on Synchro if your affiliation is with UdeM, or via the CREPUQ if you are from another institution. Due to the research-oriented nature of this class, you need to be comfortable with a teaching format involving open-ended questions and assignments. You will be required to think critically and adopt an open mindset. My teaching goal with this course is for all the participants to build their own understanding of reinforcement learning in relation to their primary research area while sharing their unique perspective and insights with the entire class.
Origin: from the Latin verb excurrere which means to run out. This is also the intended meaning behind the title of this course. I want us to deviate from the usual paths and explore the rich connections between reinforcement learning and other disciplines, in particular: optimization, control theory and simulation. And of course, I'm also hoping that this will be a fun activity for everyone.
Twice a week, on Tuesday from 9:30 to 11:30AM and on Friday from 13h30 to 15h40. The course is taught at Mila in the Agora of the 6650 Saint-Urbain. You don't need badge access to enter the classroom. Here's a video showing you how to access the classroom from Saint-Zotique Ouest.
The following evaluation structure is subject to change depending on the class size.
There is no mandatory textbook. I will however be referencing content from:
First class. Markov Decision Processes, induced process
Examples of MDPs, constrained MLE as sequential allocation, criteria: finite horizon, infinite horizon, average reward, random horizon interpretation of infinite discounted setting. Bellman optimality
@article{Shwartz2001,
doi = {10.1109/9.917668},
url = {https://doi.org/10.1109/9.917668},
year = {2001},
month = apr,
publisher = {Institute of Electrical and Electronics Engineers ({IEEE})},
volume = {46},
number = {4},
pages = {644--647},
author = {A. Shwartz},
title = {Death and discounting},
journal = {{IEEE} Transactions on Automatic Control}
}
Vector notation: Section 5.6 in Puterman (1994)
Neumann series and policy evaluation: Theorem 6.1.1 and corrolary C.4 in Puterman (1994)
Matrix splitting methods:
@book{varga1962,
title = {Matrix iterative analysis},
author = {Richard S. Varga},
lccn = {62021277},
year = {1962},
publisher = {Prentice-Hall},
address = {Englewood Cliffs}
}
Primal and Dual LP formulations: section 6.9.1 in Puterman (1994)
@book{Derman1970,
author = {Derman, Cyrus},
title = {Finite State Markovian Decision Processes},
year = {1970},
isbn = {0122092503},
publisher = {Academic Press, Inc.},
address = {Orlando, FL, USA},
}
@book{altman1999,
author = {Altman, E.},
publisher = {Chapman and Hall},
title = {Constrained Markov Decision Processes},
year = 1999,
pages = 256
}
Occupation measures, induced randomized stationary policy: theorem 6.9.1 in Puterman (1994)
Characterization of basic feasible solutions as Markov Deterministic decision rules: proposition 6.9.3 in Puterman (1994)
Bias of the MRP in the average reward formulation: section 8.2.1 in Puterman (1994)
@book{KantorovichAkilov1982,
title={Functional Analysis},
author={Kantorovich, L.V. and Akilov, G.P.},
isbn={9780080230368},
lccn={80041734},
year={1982},
publisher={Pergamon Press}
}
See On the Convergence of Policy Iteration in Stationary Dynamic Programming by Puterman and Brumelle (1979)
@article{PutermanBrumelle1979,
author = {Martin L. Puterman and Shelby L. Brumelle},
title = {On the Convergence of Policy Iteration in Stationary Dynamic Programming},
journal = {Mathematics of Operations Research},
volume = {4},
number = {1},
pages = {60-69},
year = {1979}
}
See Numerical dynamic programming in economics by John Rust (1996)
@incollection{Rust1996,
doi = {10.1016/s1574-0021(96)01016-7},
url = {https://doi.org/10.1016/s1574-0021(96)01016-7},
year = {1996},
publisher = {Elsevier},
pages = {619--729},
author = {John Rust},
title = {Chapter 14 Numerical dynamic programming in economics},
booktitle = {Handbook of Computational Economics}
}
Modified policy iteration: section 6.5
@TechReport{vannunen1976,
author = "J.A.E.E. van Nunen and J. Wessels",
title = "Stopping times and Markov programming",
institution = "Technische Hogeschool Eindhoven",
address = "Eindhoven:",
number = "76-22",
year = "1976"
}
@techreport{vannunentechreport1976,
author = {van Nunen, J. A. E. E. and Wessels, J.},
year = {1976},
title = {The generation of successive approximation methods for Markov decision processes by using stopping times},
institution = {Eindhoven: Technische Hogeschool Eindhoven.},
number = {Memorandum COSOR; Vol. 7622}
}
@article{Rust1988,
doi = {10.1137/0326056},
url = {https://doi.org/10.1137/0326056},
year = {1988},
month = sep,
publisher = {Society for Industrial {\&} Applied Mathematics ({SIAM})},
volume = {26},
number = {5},
pages = {1006--1024},
author = {Rust John},
title = {Maximum Likelihood Estimation of Discrete Control Processes},
journal = {{SIAM} Journal on Control and Optimization}
}
Section 8.1 of Nocedal and Wright (2006)
Intro to the likelihood ratio method
@article{LEcuyer1990,
doi = {10.1287/mnsc.36.11.1364},
url = {https://doi.org/10.1287/mnsc.36.11.1364},
year = {1990},
month = nov,
publisher = {Institute for Operations Research and the Management Sciences ({INFORMS})},
volume = {36},
number = {11},
pages = {1364--1383},
author = {Pierre L{\textquotesingle}Ecuyer},
title = {A Unified View of the {IPA}, {SF}, and {LR} Gradient Estimation Techniques},
journal = {Management Science}
}
Spring break
Formulation of inverse RL and meta-RL as bilevel optimization.
Methods (contd.): KKT "trick", forward, reverse, implicit, competitive. Case studies
Challenge and opportunities
Final project presentations
Academic life can sometimes be overwhelming. Don't hesitate to find support: