Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States | Steady Practice | SteadyPractice