RL Without TD Learning: Divide-and-Conquer Cracks Long-Horizon Off-Policy Challenges
Reinforcement learning without TD learning? It's here, via divide and conquer, slashing error accumulation logarithmically for tasks that once broke every algorithm. This could redefine off-policy RL.