With Deep-Q Learning we can program AI agents that can operate in environments with discrete actions spaces.

A discrete action space refers to actions that are well-defined, e. The AI agent can move either left or right.

A and q The movement in each direction is happening with a certain velocity. If the agent could determine the velocity, then we would have a continues action space with an infinite amount of possible actions movement with a different velocity. This case will be considered in the future. In the last article, I introduced the concept of the action-value function Q s,agiven by Eq.

Q s,a tells the agent the value or quality of a possible action a in a particular state s.

Higher quality means a better action with regards to the given objective. If we execute the expectation operator E in Eq. Our goal in Deep Q-Learning is to solve the action-value function Q s,a.

Why do we want this? The reason for this is the fact that the knowledge of Q s,a would enable the agent to determine the quality of any possible action in any given state. Thus the agent could behave accordingly.

But since we are considering recursion and furthermore dealing with probabilities using this equation is not practical. Rather we must use the so-called Temporal Difference TD learning algorithm to solve Q s,a iteratively.

The estimated return is also called the TD-Target. The TD-Learning algorithm can be summarized in the following steps:. Take a look at Fig. Assume the AI agent is in state s blue arrow. If A and q look on the definition of Q s,a in equation Eq. The right side of the equation is also what we call the TD-Target.

Welcome to A & Q... SARSA is a good example for the special kind A and q learning algorithms which are called on-policy algorithms. This means we are following and improving the same policy at the same time. We finally arrive at the heart of the article where we will discuss the concept of Q-Learning. But before we must take a look at the second special type of algorithms that are called off-policy algorithms.

In the case of SARSA, the behavior policy would be the policy that we follow and try to optimize at the same time. This concept will be more comprehensive in the next section, where actual calculations are made.

Q s,awhich means that our strategy is taking actions which result in highest values of Q. That yields following target policy:.

Navigation menu In this case, the target policy is called the Greedy-Policy. Greedy-Policy means that we only pick actions that result in highest Q s,a values. The last line in Eq. With greedy target policy the TD-learning update step for Q s,a in Eq. The TD-Learning algorithm for Q s,a A and q a A and q target policy be summarized in the following steps:.

Consider previous figure Fig. Following the greedy target policy, the agent would take the action with the highest A and q blue path in Fig. If you look on the update rule for Q s,a you may recognize that we don't get any updates if the TD-Target and Q s,a have the same values. In this case Q s,aconverged to the true action-values and the goal is achieved.

This means that our objective is minimizing the distance between the TD-Target and Q s,awhich can be expressed by the squared error loss function Eq. Minimization of this loss function can be achieved by usual gradient descent algorithms.

Meaning the Target-Network parameter are frozen in time. They get updated after n iterations with the parameters of the Q-Network. The research has shown that using two different neural networks for TD-Target and Q s,a calculation leads to a better stability of the models.

We have all the DIY... Otherwise, the action is chosen greedily according to the leaned action-value Q s,a: Decision-making with A and q to which action to take involves a fundamental choice:.

But this may result in a problem. Maybe sometimes there is another alternative action that can be taken that results long term in a better path through the sequence of states, but this alternative action may be not taken if we follow the behavior policy.

In this case, we exploit the current policy but A and q not explore other alternative actions. This is called exploration. Here n is the number of iterations. In the past, it could be shown that the neural network approach to estimate the TD-Target and Q s,a becomes more stable if the Deep-Q Learning model implements experience replay. All A and q we have discussed previously are incorporated in this algorithm in the right order, exactly how it would be implemented in code.

