Eka - 4 months ago 34

R Question

I am learning about q-learning and found a Wikipedia post and this website.

According to the tutorials and pseudo code I wrote this much in R

`#q-learning example`

#http://mnemstudio.org/path-finding-q-learning-tutorial.htm

#https://en.wikipedia.org/wiki/Q-learning

set.seed(2016)

iter=100

dimension=5;

alpha=0.1 #learning rate

gamma=0.8 #exploration/ discount factor

# n x n matrix

Q=matrix( rep( 0, len=dimension*dimension), nrow = dimension)

Q

# R -1 is fire pit,0 safe path and 100 Goal state########

R=matrix( sample( -1:0, dimension*dimension,replace=T,prob=c(1,2)), nrow = dimension)

R[dimension,dimension]=100

R #reward matrix

################

for(i in 1:iter){

row=sample(1:dimension,1)

col=sample(1:dimension,1)

I=Q[row,col] #randomly choosing initial state

Q[row,col]=Q[row,col]+alpha*(R[row,col]+gamma*max(Qdash-Q[row,col])

#equation from wikipedia

}

But I have problem in

`max(Qdash-Q[row,col]`

`Max[Q(next state, all actions)]`

The second problem is this pseudo code

`Do While the goal state hasn't been reached.`

Select one among all possible actions for the current state.

Using this possible action, consider going to the next state.

Get maximum Q value for this next state based on all possible actions.

Compute: Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]

Set the next state as the current state.

End Do

Is it this

`while(Q<100){`

Q[row,col]=Q[row,col]+alpha*(R[row,col]+gamma*max(Qdash-Q[row,col])

}

Answer

This post is by no means a complete implementation of Q-learning in R. It is an attempt to answer the OP with regards to the description of the algorithm in the website linked in the post and in Wikipedia.

The assumption here is that the reward matrix `R`

is as described in the website. Namely that it encodes reward values for possible actions as non-negative numbers, and -1's in the matrix represent null values (i.e., where there is no possible action to transition to that state).

With this setup, an R implementation of the `Q`

update is:

```
Q[cs,ns] <- Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]) - Q[cs,ns])
```

where

`cs`

is the current state at the current point in the path.`ns`

is the new state based on a (randomly) chosen action at the current state. This action is chosen from the collection of possible actions at the current state (i.e., for which`R[cs,] > -1`

). Since the state transition itself is deterministic here, the action is the transition to the new state.For this action resulting in

`ns`

, we want to add its maximum (future) value over all possible actions that can be taken at`ns`

. This is the so-called`Max[Q(next state, all actions)]`

term in the linked website and the "estimate of optimal future value" in Wikipedia. To compute this, we want to maximize over the`ns`

-th row of`Q`

but consider only columns of`Q`

for which columns of`R`

at the corresponding`ns`

-th row are valid actions (i.e., for which`R[ns,] > -1`

). Therefore, this is:`max(Q[ns, which(R[ns,] > -1)])`

An interpretation of this value is a one-step look ahead value or an estimate of the cost-to-go in dynamic programming.

The equation in the linked website is the special case in which

`alpha`

, the learning rate, is`1`

. We can view the equation in Wikipedia as:`Q[cs,ns] <- (1-alpha)*Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]))`

where

`alpha`

"interpolates" between the old value`Q[cs,ns]`

and the learned value`R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)])`

. As noted in Wikipedia,In fully deterministic environments, a learning rate of

`alpha=1`

is optimal

Putting it all together into a function:

```
q.learn <- function(R, N, alpha, gamma, tgt.state) {
## initialize Q to be zero matrix, same size as R
Q <- matrix(rep(0,length(R)), nrow=nrow(R))
## loop over episodes
for (i in 1:N) {
## for each episode, choose an initial state at random
cs <- sample(1:nrow(R), 1)
## iterate until we get to the tgt.state
while (1) {
## choose next state from possible actions at current state
## Note: if only one possible action, then choose it;
## otherwise, choose one at random
next.states <- which(R[cs,] > -1)
if (length(next.states)==1)
ns <- next.states
else
ns <- sample(next.states,1)
## this is the update
Q[cs,ns] <- Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]) - Q[cs,ns])
## break out of while loop if target state is reached
## otherwise, set next.state as current.state and repeat
if (ns == tgt.state) break
cs <- ns
}
}
## return resulting Q normalized by max value
return(100*Q/max(Q))
}
```

where the input parameters are:

`R`

is the rewards matrix as defined in the blog`N`

is the number of episodes to iterate`alpha`

is the learning rate`gamma`

is the discount factor`tgt.state`

is the target state of the problem.

Using the example in the linked website as a test:

```
N <- 1000
alpha <- 1
gamma <- 0.8
tgt.state <- 6
R <- matrix(c(-1,-1,-1,-1,0,-1,-1,-1,-1,0,-1,0,-1,-1,-1,0,-1,-1,-1,0,0,-1,0,-1,0,-1,-1,0,-1,0,-1,100,-1,-1,100,100),nrow=6)
print(R)
## [,1] [,2] [,3] [,4] [,5] [,6]
##[1,] -1 -1 -1 -1 0 -1
##[2,] -1 -1 -1 0 -1 100
##[3,] -1 -1 -1 0 -1 -1
##[4,] -1 0 0 -1 0 -1
##[5,] 0 -1 -1 0 -1 100
##[6,] -1 0 -1 -1 0 100
Q <- q.learn(R,iter,alpha,gamma,tgt.state)
print(Q)
## [,1] [,2] [,3] [,4] [,5] [,6]
##[1,] 0 0 0.0 0 80 0.00000
##[2,] 0 0 0.0 64 0 100.00000
##[3,] 0 0 0.0 64 0 0.00000
##[4,] 0 80 51.2 0 80 0.00000
##[5,] 64 0 0.0 64 0 100.00000
##[6,] 0 80 0.0 0 80 99.99994
```