一、Q-learning

Q-Learning的目的是学习特定state下，特定action的价值。是建立一个Q-table，以state为行、action为列，通过每个动作带来的奖赏更新Q-table。
是异策略，行动策略和评估策略不是一个策略。
在这里插入图片描述

def update():
    for episode in range(100):
        # initial observation
        observation = env.reset()
        while True:
            # fresh env
            env.render()
            # RL choose action based on observation
            action = RL.choose_action(str(observation))
            # RL take action and get next observation and reward
            observation_, reward, done = env.step(action)
            # RL learn from this transition
            RL.learn(str(observation), action, reward, str(observation_))
            # swap observation
            observation = observation_
            # break while loop when end of this episode
            if done:
                break
    # end of game
    print('game over')
    env.destroy()

在这里插入图片描述

def update():
    for episode in range(100):
        # initial observation
        observation = env.reset()
        # RL choose action based on observation
        action = RL.choose_action(str(observation))
        while True:
            # fresh env
            env.render()
            # RL take action and get next observation and reward
            observation_, reward, done = env.step(action)
            # RL choose action based on next observation
            action_ = RL.choose_action(str(observation_))
            # RL learn from this transition (s, a, r, s, a) ==> Sarsa
            RL.learn(str(observation), action, reward, str(observation_), action_)
            # swap observation and action
            observation = observation_
            action = action_
            # break while loop when end of this episode
            if done:
                break
    # end of game
    print('game over')
    env.destroy()

二、SARSA

其行动策略和评估策略一致，先做出动作再进行更新，并且做出的动作和更新时采用的动作一致。
Q-learning，先假设下一步选取最大奖赏的动作，更新值函数，再通过随机策略选取动作

参考：https://www.zhihu.com/column/c_1291396094732595200

原文链接：https://blog.csdn.net/m0_51607165/article/details/126540683