- Intelligent Projects Using Python
- Santanu Pattanayak
- 487字
- 2021-07-02 14:10:45
Q-learning
We will now look at a popular reinforcement learning algorithm, called Q-learning. Q-learning is used to determine an optimal action selection policy for a given finite Markov decision process. A Markov decision process is defined by a state space, S; an action space, A; an immediate rewards set, R; a probability of the next state, S(t+1), given the current state, S(t); a current action, a(t); P(S(t+1)/S(t);r(t)); and a discount factor, . The following diagram illustrates a Markov decision process, where the next state is dependent on the current state and any actions taken in the current state:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/83330458-c0b9-4660-a1e3-6cd4927244a6.png?sign=1739404039-KIl0IFc6bPhXQcdB5RlPxXFirXq33pGU-0-eed3b1d29a195f9bfc28de8fb2478014)
Let's suppose that we have a sequence of states, actions, and corresponding rewards, as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/3c565184-9ac9-488b-8caa-6136105ef059.png?sign=1739404039-eetN306ArDIBqO35O79zDIH48E96K9YU-0-5f7fe83c37178f95fa3ad7bee386d011)
If we consider the long term reward, Rt, at step t, it is equal to the sum of the immediate rewards at each step, from t until the end, as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/94a7a7da-3528-4fb2-9e42-1af592e0c024.png?sign=1739404039-1EwL04NY2M02OGkYTURKhvzagnlnljNd-0-36b844d5f62dca21e2cb8f1b361d38aa)
Now, a Markov decision process is a random process, and it is not possible to get the same next step, S(t+1), based on S(t) and a(t) every time; so, we apply a discount factor, , to future rewards. This means that, the long-term reward can be better represented as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/4a4b48fe-0a74-4d0f-92ad-bad7fdc7fb35.png?sign=1739404039-qtoxwaDDOJDHuDkIaIl8lGvJpUPhfDnl-0-8bd58e20a932e2108b164770a40dd62d)
Since at the time step, t, the immediate reward is already realized, to maximize the long-term reward, we need to maximize the long-term reward at the time step t+1 (that is, Rt+1), by choosing an optimal action. The maximum long-term reward expected at a state S(t) by taking an action a(t) is represented by the following Q-function:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/b3ecdb7c-d4cf-48a2-af93-c969b62fb41e.png?sign=1739404039-bm9sZieI6W2HCxTLYcuTDiJfWBvqBlBY-0-cd23000c01db64de143d896b20b5100b)
At each state, s ∈ S, the agent in Q-learning tries to take an action, , that maximizes its long-term reward. The Q-learning algorithm is an iterative process, the update rule of which is as follows:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/b6e4a367-e0aa-406d-bd8b-5e1b2cb8c1d7.png?sign=1739404039-0LJUIdKmRv7L88i9aReDjFGL4ECXBhSJ-0-a3851ed712bf592ec754c51e1df6270f)
As you can see, the algorithm is inspired by the notion of a long-term reward, as expressed in (1).
The overall cumulative reward, Q(s(t), a(t)), of taking action a(t) in state s(t) is dependent on the immediate reward, r(t), and the maximum long-term reward that we can hope for at the new step, s(t+1). In a Markov decision process, the new state s(t+1) is stochastically dependent on the current state, s(t), and the action taken a(t) through a probability density/mass function of the form P(S(t+1)/S(t);r(t)).
The algorithm keeps on updating the expected long-term cumulative reward by taking a weighted average of the old expectation and the new long-term reward, based on the value of .
Once we have built the Q(s,a) function through the iterative algorithm, while playing the game based on a given state s we can take the best action, , as the policy that maximizes the Q-function:
![](https://epubservercos.yuewen.com/759B01/19470382808830006/epubprivate/OEBPS/Images/88e6ee5c-d64d-4d70-a9cf-ee3eb0e6c32a.png?sign=1739404039-VQzTIn52oybgUQgTclRFv3vMSfkUWjsO-0-908170c20b5738a820ca5938d3103512)