I've been working for the last month on implementing [[2020SchrittwieserEtAlMasteringAtariGo|MuZero]] in [[2018BradburyEtAlJAXComposableTransformations|JAX]].
How would I have spent my time differently?
# [[computer science|software development]]
I should already have learned this lesson from various [[computer science|software development]] experiences ([[DataStax]], [[2024-05-28 Summer at QuantCo|QuantCo internship]]),
but I should have started writing [[software test]]s earlier.
This goes against the typical paradigm of [[research]] code
where people are iterating quickly on ideas and breaking things.
Maybe other people have their own ways of making sure they don't introduce bugs;
for me just having the `tests` folder there reminds me to run them once in a while as a sanity check.
It also gives the added bonus of using [[2004KrekelEtAlPytest|pytest]] as a convenient cli.
I wish I had known about the `--pdb` flag sooner!
It enables [[Python debugger|pdb]] for post-mortem debugging.
Way faster for debugging simple [[debugging and errors|software bug]]s.
## [[sequential decision environment|environment]] framework
I also spent some time writing my own "framework" (set of abstractions)
for getting my code to work with different [[sequential decision environment|environment]]s.
I think [[2019MuldalEtAlDm_envPythonInterface|dm_env]]'s interface is basically correct
but it's implemented in a [[functional programming|stateful]] way
and I'm working on a [[functional programming|pure functional programming]] [[2018BradburyEtAlJAXComposableTransformations|JAX]] implementation.
In this paradigm,
we explicitly pass around the state,
so it's confusing to model environments using [[object oriented language|class]]es:
What's an "instance" of an environment? Nonsense.
An [[sequential decision environment|environment]] is just the tuple `(reset, step)`,
where `reset: (EnvParams, Key) -> TimeStep` and `step: (EnvState, Action, EnvParams, Key) -> TimeStep[EnvState]`.
A wrapper is just a mapping `(reset, step) |> (reset_wrapped, step_wrapped)`.
# [[reinforcement learning]]
The most annoying part was when my code didn't work.
![[code doesnt work code works meme.jpg]]
[[debugging and errors|Debugging]] in [[machine learning]] feels like a [[episodic reward|sparse reward]] problem.
Lots of the time I was just stuck and not sure what might have introduced the issue.
Was it an [[discount factor|off by one]] issue?
Should I tune [[hyperparameter]]s?
I should have watched / read a lot more advice on [[reinforcement learning debugging tips and implementation advice|rl debugging]] before getting started.
And by that I mean before starting [[reinforcement learning]] at all since I might not have wanted to any longer afterwards.
I watched [[2016SchulmanNutsBoltsDeep|The Nuts and Bolts of Deep RL Research]] my first week.
[[2021JonesDebuggingReinforcementLearning|Debugging Reinforcement Learning Systems]] is also a gem.
![[2021JonesDebuggingReinforcementLearning]]
In addition:
I also found the warmup learning rate to be helpful,
as well as a slow decay.
The trajectories collected at the very beginning are basically random
so it doesn't make sense to update the weights a lot on them just yet
since the distribution will rapidly change throughout training.