This course (IFT6760C) is intended for advanced graduate students with a good background in machine learning, mathematics, operations research or statistics. You can register to IFT6760C on Synchro if your affiliation is with UdeM, or via the CREPUQ if you are from another institution. Due to the research-oriented nature of this class, you need to be comfortable with a teaching format involving *open-ended* questions and assignments. You will be required to think critically and adopt an open mindset. My teaching goal with this course is for all the participants to build their own understanding of reinforcement learning in relation to their primary research area while sharing their unique perspective and insights with the entire class.

*A short journey or trip, especially one taken as a leisure activity*.*A deviation from a regular activity or course.*

Origin: from the Latin verb *excurrere* which means *to run out*. This is also the intended meaning behind the title of this course. I want us to deviate from the usual paths and explore the rich connections between reinforcement learning and other disciplines, in particular: optimization, control theory and simulation. And of course, I'm also hoping that this will be a fun activity for everyone.

Twice a week, on Tuesday from 9:30 to 11:30AM and on Friday from 13h30 to 15h40. The course is taught at Mila in the Agora of the 6650 Saint-Urbain. You don't need badge access to enter the classroom. Here's a video showing you how to access the classroom from Saint-Zotique Ouest.

The following evaluation structure is subject to change depending on the class size.

- Class participation: 5%
- Assignments: 3 x 15% = 45%
- Final project: 50%

There is no mandatory textbook. I will however be referencing content from:

- Markov Decision Processes: Discrete Stochastic Dynamic Programming by Martin Puterman
- Reinforcement Learning and Optimal Control by Dimitri Bertsekas
- Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto
- Simulation and the Monte Carlo Method, Second Edition by Rubinstein and Kroese (2008)
- Nonlinear Programming: 3rd Edition by Dimitri Bertsekas
- Practical Methods for Optimal Control and Estimation Using Nonlinear Programming by John T. Betts
- Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition by Andreas Griewank and Andrea Walther

First class. Markov Decision Processes, induced process

Examples of MDPs, constrained MLE as sequential allocation, criteria: finite horizon, infinite horizon, average reward, random horizon interpretation of infinite discounted setting. Bellman optimality

- Absorbing states and discounting.
- Section 5.3, proposition 5.3.1 and page 127 in Puterman (1994)
- Death and discounting by Adam Shwartz
`@article{Shwartz2001, doi = {10.1109/9.917668}, url = {https://doi.org/10.1109/9.917668}, year = {2001}, month = apr, publisher = {Institute of Electrical and Electronics Engineers ({IEEE})}, volume = {46}, number = {4}, pages = {644--647}, author = {A. Shwartz}, title = {Death and discounting}, journal = {{IEEE} Transactions on Automatic Control} }`

Vector notation: Section 5.6 in Puterman (1994)

Neumann series and policy evaluation: Theorem 6.1.1 and corrolary C.4 in Puterman (1994)

Matrix splitting methods:

- Iterative Methods by Martin H. Gutknecht
- Matrix iterative analysis by Richard Varga
`@book{varga1962, title = {Matrix iterative analysis}, author = {Richard S. Varga}, lccn = {62021277}, year = {1962}, publisher = {Prentice-Hall}, address = {Englewood Cliffs} }`

Primal and Dual LP formulations: section 6.9.1 in Puterman (1994)

- Finite State Markovian Decision Processes by Cyrus Derman
`@book{Derman1970, author = {Derman, Cyrus}, title = {Finite State Markovian Decision Processes}, year = {1970}, isbn = {0122092503}, publisher = {Academic Press, Inc.}, address = {Orlando, FL, USA}, }`

- Markov Decision Processes by Lodewijk Kallenberg
- Constrained Markov Decision Processes by Eitan Altman
`@book{altman1999, author = {Altman, E.}, publisher = {Chapman and Hall}, title = {Constrained Markov Decision Processes}, year = 1999, pages = 256 }`

Occupation measures, induced randomized stationary policy: theorem 6.9.1 in Puterman (1994)

Characterization of basic feasible solutions as Markov Deterministic decision rules: proposition 6.9.3 in Puterman (1994)

Bias of the MRP in the average reward formulation: section 8.2.1 in Puterman (1994)

- Laurent series expansion in the average reward forumulation: section 8.2.2 in Puterman (1994)
- Representation of the total expected discounted reward in terms of the interest rate
- Corollary 8.2.4: decomposition of the total expected discounted reward of a policy into the gain and bias terms
- Corollary 8.2.5: gain as the limit of discount factor going to 1 in the total expected discounted reward

- Bellman optimality equations in the total expected discounted reward criterion
- Equality between taking the maximum over the set of deterministic Markovian decision rules instead of the randomized Markovian decision rules: proposition 6.2.1 in Puterman (1994)
- Contraction mappings: section 6.2.3 in Puterman (1994)
- Banach fixed-point theorem: theorem 6.2.3 in Puterman (1994)

- Contraction mapping theorem: theorem 6.2.3 of Puterman (1994)
- The Bellman optimality operator is a contraction: proposition 6.2.4
- Conserving decision rules: section 6.2.4
- Existence of stationary policies: theorem 6.2.7
- Value iteration: section 6.3.2
- Termination criterion: theorem 6.3.1 c) and d)

- Policy Iteration: section 6.4
- Monotonicity of the value iterates: proposition 6.4.1
- Convergence of policy iteration: theorem 6.4.2
- PI as a constructive proof for the existence of a solution to the Bellman optimality equations

- Newton-Kantorovich Iterations: section 6.4.3
- Gateaux derivatives
- Policy iteration as Newton-Kantorovich: proposition 6.4.4
- See Functional Analysis 2nd edition by Kantorovich and Akilov (1982)
`@book{KantorovichAkilov1982, title={Functional Analysis}, author={Kantorovich, L.V. and Akilov, G.P.}, isbn={9780080230368}, lccn={80041734}, year={1982}, publisher={Pergamon Press} }`

See On the Convergence of Policy Iteration in Stationary Dynamic Programming by Puterman and Brumelle (1979)

`@article{PutermanBrumelle1979, author = {Martin L. Puterman and Shelby L. Brumelle}, title = {On the Convergence of Policy Iteration in Stationary Dynamic Programming}, journal = {Mathematics of Operations Research}, volume = {4}, number = {1}, pages = {60-69}, year = {1979} }`

See Numerical dynamic programming in economics by John Rust (1996)

`@incollection{Rust1996, doi = {10.1016/s1574-0021(96)01016-7}, url = {https://doi.org/10.1016/s1574-0021(96)01016-7}, year = {1996}, publisher = {Elsevier}, pages = {619--729}, author = {John Rust}, title = {Chapter 14 Numerical dynamic programming in economics}, booktitle = {Handbook of Computational Economics} }`

Modified policy iteration: section 6.5

- Generalized Bellman optimality operator
- See Stopping times and Markov programming by Van Nunen 1976
`@TechReport{vannunen1976, author = "J.A.E.E. van Nunen and J. Wessels", title = "Stopping times and Markov programming", institution = "Technische Hogeschool Eindhoven", address = "Eindhoven:", number = "76-22", year = "1976" }`

- See The generation of successive approximation methods for Markov decision processes by using stopping times by Van Nunen and Wessels (1976)
`@techreport{vannunentechreport1976, author = {van Nunen, J. A. E. E. and Wessels, J.}, year = {1976}, title = {The generation of successive approximation methods for Markov decision processes by using stopping times}, institution = {Eindhoven: Technische Hogeschool Eindhoven.}, number = {Memorandum COSOR; Vol. 7622} }`

- Derivation of the smooth Bellman optimality equations from the point of view of a noisy reward function where the noise process comes from a Gumbel distribution
- Social surplus function
- Williams-Daly-Zachary theorem
- See McFadden's Econometric Models of Probabilistic Choice (1981)
- See Rust (1996) and:
`@article{Rust1988, doi = {10.1137/0326056}, url = {https://doi.org/10.1137/0326056}, year = {1988}, month = sep, publisher = {Society for Industrial {\&} Applied Mathematics ({SIAM})}, volume = {26}, number = {5}, pages = {1006--1024}, author = {Rust John}, title = {Maximum Likelihood Estimation of Discrete Control Processes}, journal = {{SIAM} Journal on Control and Optimization} }`

- Introduction to the projected bellman equations, Galerkin method, variational formulation

- Lemma 6.3.1 on the property of the weighted Euclidean norm taken under a stationary distribution
- Proposition 6.3.1 using Lemma 6.3.1 to show that the projected policy evaluation operator is a \(\gamma\)-contraction and proof of the bound on the error between the fixed-point solution and the true value function
- Use of the Pythagorean theorem (footnote on page 425, section 6.3.1)
- Matrix form of the projected equations (section 6.3.2)

- Projection by Monte Carlo simulation, section 6.1.4 of Bertsekas
- Derivation of the closed-form solution of the projection under the weighted Euclidean norm
- Monte Carlo estimates of the closed-form solution (see page 406 of Bertsekas)
- Simulated-based version of the direct solution to the projected policy evaluation equations: LSTD(0)
- Section 6.3.3 of Bertsekas
- Section 6.3.4 of Bertsekas

- Assignment 1 released
- Perron-Frobenius theorem and numerical methods for computing the stationary distribution
- Projected Value Iteration algorithm, see section 6.3.2 of Bertsekas
- Simulated-based counterpart to PVI: LSPE(0) see section 6.3.4 of Bertsekas

- Oblique Projections
- Projected Bellman equations with matrix splitting methods
- Simulated-based counterpart: LSTD(\(\lambda\))

- Valentine's day
- See (Wikipedia](https://en.wikipedia.org/wiki/Valentine%27s_Day)
- ODE Method and Stochastic Approximation
- Fitted value methods:
- Non-linear case
- Counterexample

- Intuition for stochastic approximation. Robbins-Monro algorithm
- Root-finding problems
- Newton's method (live coding)
- Martingale differences
- See sections 1.1.1 and 1.1.2 of Kushner and Yin (2003)

- ODE of Q-learning in the tabular setting
- Martingale differences for Q-learning
- See page 44 of Kushner and Yin (2003)

- Euler's method
- See Nocedal and Wright (2006)

- Martingale differences in stochastic approximation (continued)
- Section 1.1.2 of Kushner and Yin (2003)
- Demonstration with colab notebook

- Derivative approximation with finite differences
- Forward-difference/one-sided-difference
- Central-difference formula
- Taylor's Theorem: see thm. 2.1 in Nocedal and Wright 2006
Section 8.1 of Nocedal and Wright (2006)

- Effect of noise on finite differences
- See section 9.1 of Nocedal and Wright (2006)
- Theorem 9.1 of Nocedal and Wright

Intro to the likelihood ratio method

- The likelihood ratio method
## See L'Ecuyer 1990

`@article{LEcuyer1990, doi = {10.1287/mnsc.36.11.1364}, url = {https://doi.org/10.1287/mnsc.36.11.1364}, year = {1990}, month = nov, publisher = {Institute for Operations Research and the Management Sciences ({INFORMS})}, volume = {36}, number = {11}, pages = {1364--1383}, author = {Pierre L{\textquotesingle}Ecuyer}, title = {A Unified View of the {IPA}, {SF}, and {LR} Gradient Estimation Techniques}, journal = {Management Science} }`

- Distributional vs structural parameters
- Section 7.1 of Rubinstein and Kroese (2008)

- The score function estimator as a subcase of LR
- See the L'Ecuyer paper and section 7.2 of Rubinstein and Kroese (2008)

- Connection between the score function and the Fisher information matrix
- See section 1.14.4 of Rubinstein and Kroese (2008)

- The connection between the score function and maximum likelihood estimation
- See section 1.14.3 of Rubinstein and Kroese (2008)

- Recap of the likelihood ration method
- Transformations of random variables
- See section 1.7.2 of Rubinstein and Kroese (2008)
- The inverse transform method
- See section 2.3.1 of Rubinstein and Kroese (2008)
- Example with the exponential distribution (live coding)
- See section 2.4.1.1 of Rubinstein and Kroese (2008)

- From LR to IPA:
- See L'Ecuyer (1990)

- Derivation and empirical comparision of LR and IPA estimators for the exponential distribution
- See colab notebook (shared in the mailing list)

Spring break

- Application of the LR estimator to MDPs
- Likelihood ratio martingale
- Conditional Monte Carlo method
- See Bratley, Fox and Schrage (1987)
- Law of total variance
- Law of total expectation

- The REINFORCE estimator
- Eligibility traces for the REINFORCE estimator
- Control variates
- Implicit function theorem and the policy gradient theorem
- Connection with the average reward objective

- Discrete-time optimal control, Lagrangian formulation, reverse-mode AD

- Continuous-time optimal control, sequential vs simultaneous methods, forward sensitivity analysis and adjoint method

Formulation of inverse RL and meta-RL as bilevel optimization.

Methods (contd.): KKT "trick", forward, reverse, implicit, competitive. Case studies

Challenge and opportunities

Final project presentations

Academic life can sometimes be overwhelming. Don't hesitate to find support:

- UdeM: Centre de santé et de consultation psychologique
- McGill: Student wellness hub
- Polytechnique: Soutien à la réussite
- HEC: Psychological support and ressources