tutorial on implementing muzero

I've been working for the last month on implementing [[2020SchrittwieserEtAlMasteringAtariGo|MuZero]] in [[2018BradburyEtAlJAXComposableTransformations|JAX]]. How would I have spent my time differently? # [[computer science|software development]] I should already have learned this lesson from various [[computer science|software development]] experiences ([[DataStax]], [[2024-05-28 Summer at QuantCo|QuantCo internship]]), but I should have started writing [[software test]]s earlier. This goes against the typical paradigm of [[research]] code where people are iterating quickly on ideas and breaking things. Maybe other people have their own ways of making sure they don't introduce bugs; for me just having the `tests` folder there reminds me to run them once in a while as a sanity check. It also gives the added bonus of using [[2004KrekelEtAlPytest|pytest]] as a convenient cli. I wish I had known about the `--pdb` flag sooner! It enables [[Python debugger|pdb]] for post-mortem debugging. Way faster for debugging simple [[debugging and errors|software bug]]s. ## [[sequential decision environment|environment]] framework I also spent some time writing my own "framework" (set of abstractions) for getting my code to work with different [[sequential decision environment|environment]]s. I think [[2019MuldalEtAlDm_envPythonInterface|dm_env]]'s interface is basically correct but it's implemented in a [[functional programming|stateful]] way and I'm working on a [[functional programming|pure functional programming]] [[2018BradburyEtAlJAXComposableTransformations|JAX]] implementation. In this paradigm, we explicitly pass around the state, so it's confusing to model environments using [[object oriented language|class]]es: What's an "instance" of an environment? Nonsense. An [[sequential decision environment|environment]] is just the tuple `(reset, step)`, where `reset: (EnvParams, Key) -> TimeStep` and `step: (EnvState, Action, EnvParams, Key) -> TimeStep[EnvState]`. A wrapper is just a mapping `(reset, step) |> (reset_wrapped, step_wrapped)`. # [[reinforcement learning]] The most annoying part was when my code didn't work. ![[code doesnt work code works meme.jpg]] [[debugging and errors|Debugging]] in [[machine learning]] feels like a [[episodic reward|sparse reward]] problem. Lots of the time I was just stuck and not sure what might have introduced the issue. Was it an [[discount factor|off by one]] issue? Should I tune [[hyperparameter]]s? I should have watched / read a lot more advice on [[reinforcement learning debugging tips and implementation advice|rl debugging]] before getting started. And by that I mean before starting [[reinforcement learning]] at all since I might not have wanted to any longer afterwards. I watched [[2016SchulmanNutsBoltsDeep|The Nuts and Bolts of Deep RL Research]] my first week. [[2021JonesDebuggingReinforcementLearning|Debugging Reinforcement Learning Systems]] is also a gem. ![[2021JonesDebuggingReinforcementLearning]] In addition: I also found the warmup learning rate to be helpful, as well as a slow decay. The trajectories collected at the very beginning are basically random so it doesn't make sense to update the weights a lot on them just yet since the distribution will rapidly change throughout training.