In this post, I'm going to go into detail about how the natural gradient works, explained at a level for people with only a small understanding of the linear algebra and statistical methods.

First off, we must understand standard gradient descent.

Let's say we have a neural network, parameterized by some vector of parameters. We want to adjust the parameters of this network, so the output of our network changes in some way. In most cases, we will have a loss function that tells our network how it's output should change.

Using backpropagation, we calculate the derivatives of each parameter with respect to our loss function. These derivatives represent the direction in which we can update our parameters to get the biggest change in our loss function, and is known as the gradient.

We can then adjust the parameters in the direction of the gradient, by a small distance, in order to train our network.

A slight problem appears in how we define "a small distance".

In standard gradient descent, distance means Euclidean distance in the parameter space.

```
#for example, with two parameters (a, b)
distance = sqrt(a^2 + b^2)
```

However, defining "distance" in terms of how much our parameters is adjusted isn't always correct. To visualize this, let's take a look at two pairs of gaussian distributions.

In both distributions, the mean is changed from -1 to 1, a distance of two. However, it's clear that the first distribution changed far more than the second.

This leads to a key insight: our gradient measures how much our output is affected by changing a parameter. However, this affect on the output must be seen in context: a shift of +2 in the first distribution means a lot more than a shift of +2 in the second one.

What the natural gradient does is redefine the "small distance" we update our parameters by. Not all parameters are equal. Rather than treating a change in every parameter equally, we need to scale each parameter's change according to how much it affects our network's entire output distribution.

First off, let's define a new form of distance, that corresponds to distance based on KL divergence, a measure of how much our new distribution differs from our old one.

We do this by defining a metric matrix, which allows us to calculate the distance of a vector according to some custom measurement.

For a network with 5 parameters, our metric matrix is 5x5. To compute the distance of a change in parameters `delta`

using `metric`

, we use the following:

```
totaldistance = 0
for i in xrange(5):
for j in xrange(5):
totaldistance += delta[i] * delta[i] * metric[i][j]
```

If our metric matrix is the identity matrix, the distance is the same as if we just used Euclidean distance.

However, most of the time our metric won't be the identity matrix. Having a metric allows our measurement of distance to account for relationships between the various parameters.

As it turns out, we can use the Fisher Information Matrix as a metric, and it will measure the distance of `delta`

in terms of KL divergence.

The Fisher Information Matrix is the second derivative of the KL divergence of our network with itself. For more information on why, see this article.

The concept of what a Fisher matrix is was confusing for me: the code below might help clear things up.

```
# KL divergence between two paramaterized guassian distributions
def gauss_KL(mu1, logstd1, mu2, logstd2):
var1 = tf.exp(2*logstd1)
var2 = tf.exp(2*logstd2)
kl = tf.reduce_sum(logstd2 - logstd1 + (var1 + tf.square(mu1 - mu2))/(2*var2) - 0.5)
return kl
# KL divergence of a gaussian with itself, holding first argument fixed
def gauss_KL_firstfixed(mu, logstd):
mu1, logstd1 = map(tf.stop_gradient, [mu, logstd])
mu2, logstd2 = mu, logstd
return gauss_KL(mu1, logstd1, mu2, logstd2)
# self.action_mu and self.action_logstd are computed elsewhere in the code: they are the outputs of our network
kl_fixed = gauss_KL_firstfixed(self.action_mu, self.action_logstd)
first_derivative = self.gradients(kl_fixed, parameter_list)
fisher_matrix = self.gradients(first_derivative, parameter_list)
```

Now we've got a metric matrix that measures distance according to KL divergence when given a change in parameters.

With that, we can calculate how our standard gradient should be scaled.

```
natural_grad = inverse(fisher) * standard_grad
```

For an explanation on how to get the above formula, see this article.

Notice that if the Fisher is the identity matrix, then the standard gradient is equivalent to the natural gradient. This makes sense: using the identity as a metric results in Euclidean distance, which is what we were using in standard gradient descent.

Finally, I want to note that computing the natural gradient may be computationally intensive. In an implementation of TRPO, the conjugate gradient algorithm is used to solve for `natural_grad`

in `natural_grad * fisher = standard_grad`

.

Information in this post was taken from these three articles. I don't have any code specifcally for this, but my TRPO code uses the natural gradient and is relatively well-commented.

]]>However, the algorithm does take quite a bit of time to converge on these environments. In this post,

]]>However, the algorithm does take quite a bit of time to converge on these environments. In this post, I'll go over the methods and thinking behind using both parallelization and parameter adaptation for a more efficient algorithm, following this request for research.

The first thing we need to understand: what exactly is TRPO? The paper on the subject goes into more detail, but essentially, it's a policy gradient method with a fancy gradient optimizer.

Rather than updating the policy's parameters directly in accordance to its gradient, we use the natural gradient. This limits the KL divergence between our new and old policy, instead of limiting the change in our parameter values (stepsize in traditional gradient descent). For more details on the natural gradient, check out this post.

Other than the special gradient function, TRPO behaves as a standard policy gradient method. We run through the environment many times and calculate a value function that estimates the total future reward of being in a certain state. Then, we update the our agent's policy as to increase the probability of entering states with greater predicted value. Repeat this many times and (hopefully) the agent will learn the optimal policy!

Here's the big question. How can we make it faster? To get some hints as to where to start, I ran a trial on Reacher-v1 and measured how long the agent spent on each phase.

Clearly, it's taking a long time gathering experience! This makes sense: we're collecting 20,000 timesteps of data before every update, and each timestep requires the computer to not only simulate the environment, but run through the policy as well.

We now know which portion requires improvement, but how do we go about doing that?

Computers these days don't run only on a single thread; multiple things can be happening at the same time. Of course, how many threads we can use is limited by the number of cores our CPU has.

We can take advantage of this by splitting up the work of gathering experience. Instead of having a single actor run the environment 20,000 times, we can have multiple actors run only 6,000 times each.

Single vs Multiple actor setup

This is where the Monte Carlo sampling portion of TRPO comes in handy.

In some reinforcement learning algorithms, such as one-step Q learning, we take one step in the environment, update the policy, and repeat. This often involves having another neural network represent the value function, and training it in addition to the policy.

With Monte Carlo, we record 20,000 timesteps with the exact same policy before every update, and recalculate the value function every time. This is more stable, as a neural network doesn't always represent the value function accurately, especially in earlier iterations.

Running through all the timesteps with the same policy is key in parallelization: in python multiprocessing, separate threads can't communicate with each other until the task has been finished.

If we were using, say, the one-step Q learning method, each actor would have to keep its own independent policy and update it after every timestep. Then they would average their policies every few hundred updates. This might lead to suboptimal results that differ from a single-threaded implementation.

However, in our case, we don't update the policy until all the threads have finished. The data we collect is exactly the same as if we were single-threaded!

I ran both a single-threaded and a 5-threaded version on a 5 core CPU to see how they compared.

In every case, the parallel implementation was faster in terms of wall-clock time by a major factor. In a completely ideal situation, the speedup would have been 5x, plus or minus some time taken to merge the experience and compute a gradient update.

A potential method to improve even further is to simply add more threads. I ran trials using varying amounts of threads on Reacher-v1, an environment proven to be stable.

As it turns out, there are some big diminishing returns on just blindly adding threads. Using two threads reduces the time taken to collect data by 50%, three by 66%, and from four onwards we start to flatline.

The most likely reason is that we reached a hardware limit. My computer doesn't have infinite parallel computing power, it only has 5 cores. If we had used a machine with, for example, 16 cores, I'm sure the speeds would go down even lower before leveling off.

Finally, there will always be a bit of overhead to running the learner, regardless of how fast we can collect the data. The green line in the first graph shows the time taken to actually run gradient computation and create a new policy. The 5 threaded version still takes a majority of the time collecting data, but as the times go lower, we would have to start thinking about optimizing other parts of the learning process.

So we've got a parallel method that speeds up our data collection significantly. However, we're still using this data inefficiently. For every update, we collect 20,000 timesteps worth of experience, use it to make one small update, then discard it. Is there a way to improve this?

One potential option is to decrease the number of timesteps we experience between each update. We're only using this experience to estimate a value function, and we can potentially do just fine with less samples. I ran some tests on Reacher to see how reducing the timesteps/update affected the learning curve.

It works! We can get faster convergence by reducing the steps/update. However, there are downsides as well. Using less experience results in a less accurate value function estimate, which in turn results in a less accurate policy update.

When comparing by the number of updates, rather than number of timesteps, we can see that using 1000 steps/update results in a noisy and sub-optimal update.

However, at least for this environment, we can converge faster using many less-accurate updates, rather than a few accurate but experience-intensive updates.

In a similar vein, we can also adjust the maximum KL parameter.

In TRPO, we update the policy using a natural gradient. Rather than having a stepsize to limit our change, as in traditional gradient updates, we have a maximum KL that the policy is allowed to diverge.

Conceptually, we can think of stepsize and max KL as the same kind of setting: how much we want to change the policy.

Of course, we always want to be taking the biggest update we can, assuming the gradient direction is accurate and not overshooting the optimal policy. Again, I ran a test to see how the agent performed with a bigger max KL parameter.

A greater max KL means we converge faster! Of course, Reacher is a simpler environment where the policy is usually updating in the same direction. The same cannot be said for every environment. Using too big of a max KL would not only have a chance to overshoot the optimal policy, but it would also make it harder to make the small and precise updates required when nearing convergence.

Both the steps/update and max KL parameters have the potential to increase efficiency. However, they both have downsides that can lead to bad convergence.

The problem here is that the "perfect" parameters are different for each environment. For example, in a simple setting such as Reacher, updating only based on 1,000 steps may work fine. In a more complicated environment such as Swimmer, having so few steps wouldn't be enough to update accurately.

The blue lines show how having only a few steps/update leads to unstable results. In Reacher, the effect is not as pronounced: we still perform well, just with spikes of bad performance. In Swimmer, we really get the worst case. There is so little data that the policy can barely improve at all, keeping the average reward down near 40.

In both environments, 5,000 steps/update was the optimal parameter, with both high speeds and very little variation. In other environments, the optimal parameter may very well be something completely different. However, it is impractical to run a trial on every environment that for the best parameters: we want a method that can work across everything.

To find a solution, I tried to think about the problem from the perspective of what I would do if I could change these parameters in real-time.

If the total average reward is steadily rising, that means the policy is updating in the right direction. I would increase the KL, allowing the policy to make bigger updates, as they have proven to be accurate. I would also decrease the steps/update, as we know that the current steps/update leads to an accurate update, and we should squeeze the most value out of our experience.

On the other hand, if the policy was stuck and not improving, I would decrease the KL, as smaller steps would prevent overshooting the optimal policy. I would also increase the steps/update, in an attempt to get a more accurate update.

The adaptive method is simply these two rules, built-in to the agent itself. Every update, we compare the total reward to that of the last update. If we have improved, we can then reduce steps/update, and try for a more efficient but less accurate update. If we're stuck, we can increase steps/update, and sacrifice some efficiency for more accurate updates.

We can do the same with max KL, although we'd have to reverse the change (increase max KL when decreasing steps/update).

One thing to note is that increasing KL and decreasing steps/update essentially accomplish similar goals, and it may be redundant to do both.

I tested three kinds of adaptive methods: adapt the steps/update, adapt the max KL, or adapt both. For comparison, I plotted the reward curves using both the original parameters, and the optimal parameters for that environment using a brute-force search.

While we don't reach the performance of the optimal parameters on every environment, we definitely outperformed the original fixed parameters!

Interesting to note is that there is no definite best parameter to adapt. In Reacher and Swimmer, adapting steps/update performed better, while in HalfCheetah, adapting the max KL performed better.

We can also take a look at the trial where both steps/update and max KL were adapted. In Reacher, this extreme push for efficiency worked well, outperforming every other method.

However, in the other two environments, adapting both parameters fails horribly. We can't even pass the performance of the original parameters, instead we get stuck at a low reward with no signs of improvement.

After running the trials again, I realized the potential problem. The low steps/update and high max KL led the agent to make big updates in completely inaccurate directions. This put the policy in a state where a single update wouldn't make any improvement. The agent would then be stuck, constantly increasing and decreasing the steps/update due to random variation, rather than based on its performance.

Essentially, when the average reward was staying relatively constant, we weren't changing the parameters at all. This is bad! If we're not improving, we want to be increasing the steps/update so we have a more accurate gradient!

How could this be fixed? I implemented an improvement margin required in order for the agent to decrease the steps/update.

For example, we could impose a margin of 5%, meaning the agent would have to improve by 5% in order to decrease its steps/update. If the agent isn't improving by at least 5%, we'll increase the steps/update.

This ensures that when the average reward is neither moving up or down, it won't get stuck adapting its parameters back and forth, as described earlier. Instead, the situation is considered under the 5% margin, so the agent would increase the steps/update for better accuracy.

With pleasant surprise, the margin worked very well! The trials that adapted both KL and steps no longer diverged badly, instead performing on a level consistently near the top.

The performance on Reacher also took no hits: in the simple environment, the adaptive-margin was still able to focus on full efficiency and improve quickly.

Finally, one last thing to note is we don't want to adapt every single update. Rather, we sum up the rewards in the past 10 updates, and compare it to the 10 before that. This reduces variation and gets us a better picture of how the policy is improving.

Utilizing parallelization and hyperparameter adaptation, we were able to take the Trust Region Policy Optimization algorithm and increase its speed to a major extent.

I'm currently running more trials for Humanoid, a more complex environment, to see how the adaptive method performs when the optimal parameters change throughout learning.

I'll probably also upload some of these results to gym, and see how they compare to others.

Big thanks to John Schulman for helping me with many things and providing feedback!

Also, thanks to Danijar Hafner for helping a lot as well; we are currently working to update this preliminary paper to include the hyperparameter adaptation.

The code for this post is available on my Github.

]]>In this post, I'll go over the steps and thinking behind taking a working implementation of DRAW, and adjusting it to work on colored pictures.

The first question to ask is: what changed? In the original DRAW model, the input data consists of many 28x28 arrays of either 1 or 0. There's already a problem: images don't consist of only pure white or pure black pixels. They have a range, usually from 0 - 255.

To account for this, we'll replace the log likelihood loss function with an L2 loss, which takes a sum of the squared difference between each pixel.

```
//log likelihood loss func
self.generation_loss = tf.reduce_mean(-tf.reduce_sum(self.images * tf.log(1e-10 + self.generated_images) + (1-self.images) * tf.log(1e-10 + 1 - self.generated_images),1))
//l2 loss func
self.generation_loss = tf.nn.l2_loss(x - self.generated_images)
```

Although these two functions compute loss differently, they're essentially accomplishing the same task: a measurement of how well our image matches the original.

On a side note, this also brings up a key part in the variational autoencoder: balance between latent and generation loss.

The lower our latent loss is, the closer our latent vector distribution is to a unit gaussian, which forces the model to generalize. However, we also want the generation loss to be low, and have accurate reconstructions.

In some implementations, such as this one by ikostrikov, the total loss is the sum of the latent and generation loss. By changing the generation loss to L2, the "correct" ratio may not be one to one anymore.

I didn't look any farther into using training/test splits to find a perfect ratio, although that may be worth trying in the future. For our current model, a one-to-one ratio between L2 generation loss and the KL latent loss performed well.

In the DRAW model, we calculate an attention patch for every timestep, dictating where the model is observing.

green dots mark attention location

To allow for colored images, we need to decide how to handle attention with multiple color channels. When a human looks at something, it's looking at the same area for all colors. Therefore, the attention location should be exactly the same for all color channels in our model.

How can this be done in practice? It turns out we don't need to mess with any of the code for the attention filter.

Instead, we treat each color channel (red, green, blue) as a separate image. We're already handling multiple images simultaneously in our minibatch, so it all fits together nicely.

Essentially, instead of having a batchsize of 64, we have a "batchsize" of 64*3. The only thing we need to do now is take our 64-length vector of attention parameters and stretch it out into a 64*3 vector, with every three values being the same.

structure of stretched out batch, from a batch size of two images

Once we've passed the color-separated through the attention gate, we then have a 5x5 matrix for each color channel. These can be concatenated to form a 5x5x3 matrix for each image. Finally, we pass this 5x5x3 into the encoder, instead of the 5x5 in single-color DRAW.

How did it do? I ran the model trained on the celebA dataset, resized down to 64x64 each image.

the original images

the recreated images

The results are pretty blurry. However, they do resemble faces and match up to their respective original images.

The interesting part is how attention is shifted to accommodate a dataset of faces. We can plot the attention parameters for each face at every timestep to visualize it.

attention patches at each timestep

attention patches on top of image

non-animated version

These dots help us see where the network is assigning its attention at any given timestep.

We can see how at step 3, the attention starts out spread out and fills in the edges of the images. As the timesteps go on, the attention patch shrinks, with the smallest at the final timestep.

This makes sense when looking at how the images are created, starting from the out and working its way inwards with increasing detail.

An interesting thing to note it that these attention patches are in relatively the same location, regardless of the face being created. The DRAW video has attention move differently when different digits are being recreated.

The results show that the model works, but the results are quite blurry. There are some potential steps we can take to make them better.

**Train Longer:** I only trained the model for one epoch (~2400 iterations). That's a pretty low amout of training given the data size, and more training is probably the next best step.

**Bigger latent variable:** Currently, the latent variable is a vector of 10 real numbers. If we increase this, more information, and therefore more details about the image, can be passed. This has the downside that the model might overfit the current dataset, so we need a balance.

**More timesteps:** If we give the model more timesteps, it will have more operations in which to update the image, ideally allowing it to create more detail. In addition, the way DRAW is setup has an additional latent variable for every timestep, so more information is passed as well (whether this is good or bad is up for debate).

**Better loss function:** In our color DRAW model, we're using an L2 pixel-by-pixel loss. The problem here is the model may get stuck in the middle of two possibilities. For example, if a certain pixel could either be bright red or completely black, the model would converge to the average of a dark red, which is neither of the options. This is why it's often better to be using classification rather than regression, as an average may often be incorrect.

The paper Autoencoding beyond pixels using a learned similarity metric describes a combination of a generative adversial network and a variational autoencoder.

In short, a generative adversial network naturally trains a discriminator network which attempts to tell real and fake images apart. It will take some tweaking, but we can use this discriminator as a measurement of how close we are to an original image, rather than an L2 loss.

Using the GAN usually leads to more crisp images, as we don't suffer from blurriness caused by converging to an average as described above. The downsides are that we introduce more parameters to train, and the GAN has a chance to become unstable and diverge.

As usual, the code for this project can be found on my Github.

]]>In most image generation methods, the actual generation network is a bunch of deconvolution layers. These map from some initial latent matrix of parameters to a bigger matrix, which then maps to an even bigger matrix, and so on.

how images are generated from deconvolutional layers. [source]

However, there's another way to think about image generation. In the real world, artists don't create paintings instantly. It's a sequence of brush strokes that eventually make up something that looks amazing.

DRAW attempt to replicate this behavior. Instead of creating an image instantly, it uses a recurrent neural network as both the encoder and decoder portions of a typical variational autoencoder. Every timestep, a new latent code is passed from the encoder to the decoder.

simple recurrent VAE setup

Here's a model of a simplified version of DRAW. If you saw the diagram in my previous VAE post, you would notice that the first column in the recurrent model is exactly the same as a typical variational autoencoder. The difference here is that instead of generating a final image directly, we break up its generation into many iterations. Every iteration, the model improves upon its generated image until the final image (hopefully) looks like the original.

Above, the horizontal arrows represent recurrent neural networks. These are fully-connected layers that maintain an internal hidden state along with taking in an input. In practice, we use LSTMs. The uppermost horizontal arrow simply represents the iterative construction of our goal image, as each timestep's image is simply elementwise addition.

```
new image = [the last image](shape=28x28) + [some improvements](shape=28x28)
```

By itself, this simple recurrent version of a variational autoencoder performs pretty well. We can successfully generate nice-looking MNIST images by iterative improvements.

MNIST results of recurrent VAE

However, artists in real life don't draw by continuously making small changes to the entire canvas. Brush strokes occur only in one portion of the image. In the cast of MNIST: when writing a 5, the typical person does not start from a blob and gradually erase portions until it looks nice. They just make one smooth motion following the shape of the 5.

The DRAW model acknowledges this by including an *attention* gate. This is the more complicated part of DRAW, so I'll start from a high-level explanation and go into detail on how attention is implemented.

An attention gate allows our encoder and decoder to focus on specific parts of our image.

Let's say we're trying to encode an image of the number 5. Every handwritten number is drawn a little differently: some portions may be thicker or longer than others. Without attention, the encoder would be forced to try and capture all these small nuances at the same time.

However, if the encoder could choose a small crop of the image every frame, it could examine each portion of the number one at a time.

Reading MNIST [video source]

The same goes for generating the number. The attention unit will determine where to draw the next portion of the 5, while the latent vector passed will determine if the decoder generates a thicker area or a thinner area.

Writing MNIST [video source]

In summary, if we think of the latent code in a VAE as a vector that represents the entire image, the latent codes in DRAW can be thought of as vectors that represent a brush stroke. Eventually, a sequence of these vectors creates a recreation of the original image.

In the simple recurrent VAE model, the encoder takes in the entire input image at every timestep. Instead of doing this, we want to stick in an attention gate in between the two, so the encoder only receives the portion of our image that the network deems is important at that timestep. We will refer to this first attention gate as the "read" attention.

There's two parts to this attention gate: choosing the important portion, and cropping the image to get rid of the other parts.

We'll start with the first part. In order to determine which part of the image to focus on, we need some sort of observation to make a decision based on. In DRAW, we use the previous timestep's decoder hidden state. Using a simple fully-connected layer, we can map the hidden state to three parameters that represent our square crop: center x, center y, and the scale.

How the "read" attention gate works. Note the first timestep should have an attention gate but it is omitted for clarity. The attention gate shown is in the next timestep.

Now, instead of encoding the entire image, only a small of the image is encoded. This code is then passed through the system, and decoded back into a small patch.

We have a second attention gate after the decoder, that's job is to determine where to place this small patch. It's the same setup as the "read" attention gate, except the "write" attention gate uses the current decoder instead of the previous timestep's decoder.

Model with both read and write attention gates. Note that the first timestep should also have read/write attention gates but are omitted for clarity.

To review, the read attention gate takes a 9x9 crop from the 28x28 original image. This crop is then passed through the autoencoder, and the write attention gate places the 9x9 crop at its appropriate position in the 28x28 generated image. This process is then repeated for a number of timesteps until the original image is recreated.

Describing the attention mechanism as a crop makes sense intuitively. However, in practice, we use a different method. The model structure described above is still accurate, but we use a matrix of gaussian filters instead of a crop.

What is a gaussian filter? Imagine that our attention gate consisted of taking a 9x9 crop of the original image, and storing the average grayscale value. Then, when reconstructing the image, the a 9x9 patch of that average grayscale value in added on.

A gaussian filter does essentially that, except instead of taking a mean average of the 9x9 area, more influence is placed on the grayscale values near the center.

To gain a better understanding, let's think in one dimension first.

Let's say we had a vector of 10 random values, say [3,5,2,7,4,9,4,6,1,8].

To find the mean average, we would multiply each value by 0.1, and the sum them up.

However, another way to do this is by multiplying each value by its corresponding frequency in a gaussian (otherwise known as normal) distribution, and summing those values up.

Gaussian/normal distribution [source]

This would place more of an emphasis on the center values such as 4 and 9, and less on the outer values such as 3 and 8.

Furthermore, we can choose the center and spread of our gaussian filter. If we place our filter near the side, values near that area will have more influence.

It's important to note that every single value still has some influence, even though it may be tiny. This allows us to pass gradients through the attention gate so we can run backprop.

We can extend this into two dimensions by having two gaussian filters, one along the X axis and one along the Y axis.

To bring it all together: in the previous section we went over a "read" attention gate that chose a center x, center y, and scale for a crop of the original image.

We can use those same parameters to define a gaussian filter. The center x and y will determine at what location in the original image to place the filter, and the scale will determine how much influence is spread out or concentrated.

However, in the end our gaussian filter only leaves us with one scalar average of a certain portion. Ideally, we would want more information than that.

In DRAW, we take an array of gaussian filters, each with their centers spaced apart evenly. For example, we may have a 3x3 array of gaussian filters, with the position of the entire array parameterized by with center x and center y.

green dots = center of a gaussian filter. X,y coordinates of the center dot are determined by the attention gate.

To account for multiple filte, we need a fourth parameter in our attention gate: stride, or how far apart each filter should be from one another.

These multiple filters allow us to capture 3x3 = 9 values of information during each timestep, instead of only one.

Finally, just for fun we can consider an extreme situation. If we had 28x28 gaussian filters with a stride of one, and each filter had a spread so small it was only influenced by its center pixel, our attention gate would essentially capture the entire original image.

I ran some experiments on generating the MNIST dataset using an implementation of DRAW in tensorflow.

First, let's look at the plain recurrent version without any attention. The image starts out like a gray blob, and every frame, it gets a little clearer.

DRAW without attention

Here's the fun part. With attention, the image creation gets a bit crazier.

DRAW with attention

It's kind of hard to tell what the network is doing. While it's not exactly the brushstroke-like drawing behavior we were expecting, at the later timesteps you can see that the network only edits a portion of the image at a time.

A thing to note is at the first timestep, the network pays attention to the entire image. Whether this is to create the black background, or to understand what number to draw, is hard to know.

Interestingly, there's actually signs of the network changing its mind about what number to draw. Take a look at the second to last row, fourth from the left. The network starts out by drawing the outline of a 4. However, it goes back and changes that 4 into a 7 midway.

The code for these results is on my Github, which is a slightly fancier and commented version of ericjang's implementation.

]]>In this post, I'll go over an explanation of the

]]>In this post, I'll go over an explanation of the natural gradient that tries to keep the mathematical terminology down to a minimum.

To start, we first have to consider our standard gradient descent. We have some parameterized model, such as a neural network. In backpropagation, we compute an error term (loss) that compares the network's output to a target. Then, we can derive the gradients for each parameter, which tell us how to change the value of the parameter to make the loss smaller. Finally, we change the parameters in their gradient's negative direction, by a global learning rate.

The goal in standard backpropagation is to keep resampling the gradient of the network's parameters after every update, and update them accordingly until reaching a (hopefully global) minimum.

However, there's another way we can think of optimization. To better understand it, we first need to understand KL divergence.

In simple terms, KL divergence is a measure of how close a distribution is to another distribution. Here, it's helpful to think of a neural network as a distribution over output values, given an input.

For example, let's take a simple classification network that, given an input image, outputs probabilities that the image is either an apple, a banana, or an orange. If we had two of these networks with the same parameters, their KL divergence would be 0.

On the other hand, if the networks had different parameters, they would likely output different probabilities, given the same image.

The higher the difference between these probabilities are, the higher the KL divergence is between the two networks.

This brings us to the natural gradient. If we blindly update our network in the direction of its gradients, there are no guarantees the distribution of the new network will be similar to the old one.

To fix this, we first consider all combinations of parameters that result in a new network a constant KL divergence away from the old network. This constant value can be viewed as the step size or learning rate. Out of all these possible combinations, we choose the one that minimizes our loss function.

Basically, we're adding in a constraint to our update, that the new network will behave relatively similar to our old one. Our step size corresponds directly to the actual distribution of the network, not it's parameter's values.

This comes with the added benefit of a more stable learning process. Especially when we're updating using a randomly sampled batch, some outliers may make drastic changes to a network's parameters. With natural gradient descent, a single update can't make too much of an impact.

Furthermore, the natural gradient updates based on KL divergence, which only considers the output of a network. It doesn't matter how the network is modeled. Replacing a sigmoid activation with a tanh function would change the standard gradient, but not the natural gradient.

Of course, in practice we don't actually loop through every possible combination of parameters within a certain KL divergence. I won't go into the mathematical explanations here, but it's still true that the natural gradient is more computationally expensive compared to straight gradient descent.

For more in-depth details, check out this paper or these notes.

]]>However, there were a couple of downsides to using a plain GAN.

First, the images are generated off some arbitrary noise. If you wanted to generate a

]]>However, there were a couple of downsides to using a plain GAN.

First, the images are generated off some arbitrary noise. If you wanted to generate a picture with specific features, there's no way of determining which initial noise values would produce that picture, other than searching over the entire distribution.

Second, a generative adversial model only discriminates between "real" and "fake" images. There's no constraints that an image of a cat has to look like a cat. This leads to results where there's no actual object in a generated image, but the style just looks like picture.

In this post, I'll go over the variational autoencoder, a type of network that solves these two problems.

To get an understanding of a VAE, we'll first start from a simple network and add parts step by step.

An common way of describing a neural network is an approximation of some function we wish to model. However, they can also be thought of as a data structure that holds information.

Let's say we had a network comprised of a few deconvolution layers. We set the input to always be a vector of ones. Then, we can train the network to reduce the mean squared error between itself and one target image. The "data" for that image is now contained within the network's parameters.

Now, let's try it on multiple images. Instead of a vector of ones, we'll use a one-hot vector for the input. [1, 0, 0, 0] could mean a cat image, while [0, 1, 0, 0] could mean a dog. This works, but we can only store up to 4 images. Using a longer vector means adding in more and more parameters so the network can memorize the different images.

To fix this, we use a vector of real numbers instead of a one-hot vector. We can think of this as a code for an image, which is where the terms encode/decode come from. For example, [3.3, 4.5, 2.1, 9.8] could represent the cat image, while [3.4, 2.1, 6.7, 4.2] could represent the dog. This initial vector is known as our latent variables.

Choosing the latent variables randomly, like I did above, is obviously a bad idea. In an autoencoder, we add in another component that takes in the original images and encodes them into vectors for us. The deconvolutional layers then "decode" the vectors back to the original images.

We've finally reached a stage where our model has some hint of a practical use. We can train our network on as many images as we want. If we save the encoded vector of an image, we can reconstruct it later by passing it into the decoder portion. What we have is the standard autoencoder.

However, we're trying to build a generative model here, not just a fuzzy data structure that can "memorize" images. We can't generate anything yet, since we don't know how to create latent vectors other than encoding them from images.

There's a simple solution here. We add a constraint on the encoding network, that forces it to generate latent vectors that roughly follow a unit gaussian distribution. It is this constraint that separates a variational autoencoder from a standard one.

Generating new images is now easy: all we need to do is sample a latent vector from the unit gaussian and pass it into the decoder.

In practice, there's a tradeoff between how accurate our network can be and how close its latent variables can match the unit gaussian distribution.

We let the network decide this itself. For our loss term, we sum up two separate losses: the generative loss, which is a mean squared error that measures how accurately the network reconstructed the images, and a latent loss, which is the KL divergence that measures how closely the latent variables match a unit gaussian.

```
generation_loss = mean(square(generated_image - real_image))
latent_loss = KL-Divergence(latent_variable, unit_gaussian)
loss = generation_loss + latent_loss
```

In order to optimize the KL divergence, we need to apply a simple reparameterization trick: instead of the encoder generating a vector of real values, it will generate a vector of means and a vector of standard deviations.

This lets us calculate KL divergence as follows:

```
# z_mean and z_stddev are two vectors generated by encoder network
latent_loss = 0.5 * tf.reduce_sum(tf.square(z_mean) + tf.square(z_stddev) - tf.log(tf.square(z_stddev)) - 1,1)
```

When we're calculating loss for the decoder network, we can just sample from the standard deviations and add the mean, and use that as our latent vector:

```
samples = tf.random_normal([batchsize,n_z],0,1,dtype=tf.float32)
sampled_z = z_mean + (z_stddev * samples)
```

In addition to allowing us to generate random latent variables, this constraint also improves the generalization of our network.

To visualize this, we can think of the latent variable as a transfer of data.

Let's say you were given a bunch of pairs of real numbers between [0, 10], along with a name. For example, 5.43 means apple, and 5.44 means banana. When someone gives you the number 5.43, you know for sure they are talking about an apple. We can essentially encode infinite information this way, since there's no limit on how many different real numbers we can have between [0, 10].

However, what if there was a gaussian noise of one added every time someone tried to tell you a number? Now when you receive the number 5.43, the original number could have been anywhere around [4.4 ~ 6.4], so the other person could just as well have meant banana (5.44).

The greater standard deviation on the noise added, the less information we can pass using that one variable.

Now we can apply this same logic to the latent variable passed between the encoder and decoder. The more efficiently we can encode the original image, the higher we can raise the standard deviation on our gaussian until it reaches one.

This constraint forces the encoder to be very efficient, creating information-rich latent variables. This improves generalization, so latent variables that we either randomly generated, or we got from encoding non-training images, will produce a nicer result when decoded.

I ran a few tests to see how well a variational autoencoder would work on the MNIST handwriting dataset.

left: 1st epoch, middle: 9th epoch, right: original

Looking good! After only 15 minutes on my laptop w/o a GPU, it's producing some nice results on MNIST.

Here's something convenient about VAEs. Since they follow an encoding-decoding scheme, we can compare generated images directly to the originals, which is not possible when using a GAN.

A downside to the VAE is that it uses direct mean squared error instead of an adversial network, so the network tends to produce more blurry images.

There's been some work looking into combining the VAE and the GAN: Using the same encoder-decoder setup, but using an adversial network as a metric for training the decoder. Check out this paper or this blog post for more on that.

You can get the code for this post on my Github. It's a cleaned up version of the code from this post.

]]>I decided to try and simulate the typical Twitch viewer. Type a few words

]]>I decided to try and simulate the typical Twitch viewer. Type a few words into the box below and press enter to see what the model predicts a Twitch viewer would say.

In this post, I'll go over a brief explanation of recurrent neural networks, and put a focus on how the data was collected and formatted, as well as some analysis on how well the model did.

Now, you might be asking, why Twitch?

Oftentimes, the hardest part of solving a machine learning problem is collecting the data. Well, Twitch is always running, so I could write a simple node.js scraper to download the messages as they came up.

Second, Twitch chat often contains a lot of memes and copypasta, and I wanted to see if the network could learn to complete the copypasta if given the beginning.

Finally, messages on Twitch aren't as dependent on the previous messages since the chat is moving so fast, so I could treat each message independently.

The model this time is a recurrent neural network. In short, it's a neural network that repeats over timesteps, and passes data into itself every iteration.

Here, X is a vector that contains the words in the beginning of a sentence. The words are encoded as IDs, so passing "league of" would be seen as [545, 23], if "league" is the 545th word, etc.

We want to predict the next word, so passing in "league of" would give an output of "legends". The output of each layer is a vector of [vocab size x 1], with each value being the probability of that specific word coming next in the sentence. We can then either take a sample of these probabilities to choose the next word, or just take the maximum.

The green box is a single hidden layer that takes in two inputs: the current word, and the value of the previous hidden layer. They all share the same weights, so we can extend the recurrent network as long as we need to. In practice, we use an LSTM instead of a fully-connected layer, but their purpose is the same.

To train the network, we use a cross-entropy loss on the prediction of each word in the message. This backpropogates through to the beginning from each word, allowing the network to train multiple times from one sentence.

This setup is pretty typical, so I won't go into much detail. Check out the source code or this post by Andrej Karpathy for more explanations.

Here's what I would consider the trickiest part of the setup. We know we want to use Twitch chat as a source of data, but how do we actually do that in practice?

First comes the actual collection. I used a scraper in node.js to read the IRC channels that Twitch operates on, and save every message to a text file.

```
// tmi is an npm module for reading Twitch chat
var client = new tmi.client(options);
var totalstring = "";
client.on("chat", function (channel, userstate, message, self) {
//remove non-alphanumeric or space characters
message = message.replace(/[^\w\s]/gi, '')
//marker for end of sentence
message = message + "<eos>"
totalstring += message
fs.writeFile("database.txt", totalstring, function(err) {});
}
```

The important thing to take away here is to always clean your data. On the first iteration I didn't remove weird characters, and some ASCII art totally messed up the dataset.

Next, we need to convert the sentence to arrays of word IDs. The first step is to decide on which IDs to should map to which words.

```
data = totalstring.toLowerCase();
var sentences = data.split("<eos>");
for(var s = 0; s < sentences.length; s++)
{
sentences[s].replace(/[^0-9a-zA-Z ]/g, '')
var words = sentences[s].split(" ");
for(var w = 0; w < words.length; w++)
{
if(topwords[words[w]] == null)
{
topwords[words[w]] = 1;
}
else
{
topwords[words[w]] = topwords[words[w]] + 1;
}
}
}
var len = Object.keys(topwords).length;
var thewords = Object.keys(topwords);
for(var w = 0; w < len; w++)
{
if(topwords[thewords[w]] < 5)
{
delete topwords[thewords[w]];
}
else
{
topwords[thewords[w]] = w+1;
}
}
jsonfile.writeFile("words.json", topwords, function (err) {});
```

Here, we're taking counts of how many times each word appears. If it appeared less than 5 times, we discard it. Later on, we'll add a special word that will represent all rare words in sentences.

If the word occurs enough times, we keep it and assign it an ID number that counts up from 1. The number 0 is reserved for when there is no word.

Now, we'll switch to python to construct a numpy matrix containing the word IDs for each message.

```
with open('database.txt') as data_file:
sentences=data_file.read().replace('\n', '')
with open('words.json') as data_file:
wordsdata = json.load(data_file)
count = 0
split_sentences = sentences.split("<eos>")
# 152401 sentences
# 33233 words
nparray = np.zeros((152401,20))
for s in xrange(len(split_sentences)):
sentence = split_sentences[s].lower()
regex = re.compile('[^0-9a-zA-Z ]')
realsent = regex.sub('', sentence)
words = sentence.split(" ")
if len(words) >= 2:
count = count + 1
for w in xrange(min(len(words),20)):
word = words[w]
if word in wordsdata:
nparray[count][w] = wordsdata[word]
else:
# its a rare word
nparray[count][w] = 33234
nparray = nparray[:count,:]
np.random.shuffle(nparray)
np.save("data.npy",nparray)
```

Any sentences that only have two or less words are discarded. If there's a rare word, we replace it with a special id 33234, which is [vocabsize + 1]. We set the maximum message length to 20. If the message ends before that, which most do, the rest of the vector is filled with 0's.

It's important to shuffle the rows in the end, in case there is any correlation between messages scraped at similar times (which there often is). This might cause the neural network to jump back and forth during training.

After all of that, we're finally done with data preparation and sanitization! I'm sure there are many improvements that could be made in my methods, but it worked well enough.

Data handling is a part of the process that normally gets left out when talking about machine learning, but it's an important part of the process as well.

Well, did the network work? You can try out whatever phrases you want in the field at the top of the page.

Since there's no concrete way to measure how well the model performs, here are some examples that might shine some light.

It was able to learn the names of a few popular Twitch games, since they were said a lot.

A copypasta from Kripp's stream. Even though I started the message from the middle of the copypasta, the model was able to recognize it and finish the job.

Got some nice banter here.

Putting in an emote usually just gives an endless loop of emotes. It's imitating chat spamming kappa all the time.

If you put in some generic phrase, the network usually responds with `<rare word>`

. This is probably due to the total number of rare words outnumbering any single word in the dataset. It did learn to put an "a" before the word, though.

Adding better context gives better results, along with some deep thoughts about game design.

Overall, the model did perform pretty well. A nice way it could be tested is to use some form of a Turing test where we make a Twitch bot say phrases from the neural network, and see if anyone catches on.

The code for this project is available on my Github, along with the 5MB dataset I scraped.

Thanks Michael and Kevin for helping build the first trial of this

]]>example of balancing the pole in CartPole

In this post, I will be going over some of the methods described in the CartPole request for research, including implementations and some intuition behind how they work.

In CartPole's environment, there are four observations at any given state, representing information such as the angle of the pole and the position of the cart.

Using these observations, the agent needs to decide on one of two possible actions: move the cart left or right.

A simple way to map these observations to an action choice is a linear combination. We define a vector of weights, each weight corresponding to one of the observations. Start off by initializing them randomly between [-1, 1].

```
parameters = np.random.rand(4) * 2 - 1
```

How is the weight vector used? Each weight is multiplied by its respective observation, and the products are summed up. This is equivalent to performing an inner product (matrix multiplication) of the two vectors. If the total is less than 0, we move left. Otherwise, we move right.

```
action = 0 if np.matmul(parameters,observation) < 0 else 1
```

Now we've got a basic model for choosing actions based on observations. How do we modify these weights to keep the pole standing up?

First, we need some concept of how well we're doing. For every timestep we keep the pole straight, we get +1 reward. Therefore, to estimate how good a given set of weights is, we can just run an episode until the pole drops and see how much reward we got.

```
def run_episode(env, parameters):
observation = env.reset()
totalreward = 0
for _ in xrange(200):
action = 0 if np.matmul(parameters,observation) < 0 else 1
observation, reward, done, info = env.step(action)
totalreward += reward
if done:
break
return totalreward
```

We now have a basic model, and can run episodes to test how well it performs. The problem is now much simpler: how can we select these weights/parameters, to receive the highest amount of average reward?

One fairly straightforward strategy is to keep trying random weights, and pick the one that performs the best.

```
bestparams = None
bestreward = 0
for _ in xrange(10000):
parameters = np.random.rand(4) * 2 - 1
reward = run_episode(env,parameters)
if reward > bestreward:
bestreward = reward
bestparams = parameters
# considered solved if the agent lasts 200 timesteps
if reward == 200:
break
```

Since the CartPole environment is relatively simple, with only 4 observations, this basic method works surprisingly well.

I ran the random search method 1,000 times, keeping track of how many episodes it took until the agent kept the pole up for 200 timesteps. On average, it took 13.53 episodes.

Another method of choosing weights is the hill-climbing algorithm. We start with some randomly chosen initial weights. Every episode, add some noise to the weights, and keep the new weights if the agent improves.

```
noise_scaling = 0.1
parameters = np.random.rand(4) * 2 - 1
bestreward = 0
for _ in xrange(10000):
newparams = parameters + (np.random.rand(4) * 2 - 1)*noise_scaling
reward = 0
run = run_episode(env,newparams)
if reward > bestreward:
bestreward = reward
parameters = newparams
if reward == 200:
break
```

The idea here is to gradually improve the weights, rather than keep jumping around and hopefully finding some combination that works. If `noise_scaling`

is high enough in comparison to the current weights, this algorithm is essentially the same as random search.

As usual, this algorithm has its pros and cons. If the range of weights that successfully solve the problem is small, hill climbing can iteratively move closer and closer while random search may take a long time jumping around until it finds it.

However, if the weights are initialized badly, adding noise may have no effect on how well the agent performs, causing it to get stuck.

To visualize this, let's pretend we only had one observation and one weight. Performing random search might look something like this.

In the image above, the x-axis represents the value of the weight from -1 to 1. The curve represents how much reward the agent gets for using that weight, and the green represents when the reward was high enough to solve the environment (balance for 200 timesteps).

An arrow represents a random guess as to where the optimal weight might be. With enough guesses, the agent will eventually try a weight in the green zone and be successful.

How does this change with hill climbing?

Here, an arrow represents the value that weights are initialized at. The blue region is noise that is added at every iteration.

If the weight starts at arrow B, then hill-climbing can try other weights near arrow B until it finds one that improves the reward. This can continue until the weight is in the green zone.

However, arrow A has a problem. If the weight is initialized at a location where changing it slightly gives no improvement at all, the hill climbing is stuck and won't find a good answer.

It turns out that in CartPole, there are many initializations that get stuck.

More than half of the trials got stuck and couldn't improve (the bar on the right). However, of the trials that reach a solution, it took an average of 7.43 episodes to do so, less than random search.

A possible fix to this problem would be to slightly increase the `noise_factor`

every iteration that the agent does not improve, and eventually reach a set of weights that improves the reward.

Another change that may improve accuracy is to replace

```
reward = run_episode(env,parameters)
```

with

```
reward = 0
for _ in xrange(episodes_per_update):
run = run_episode(env,newparams)
reward += run
```

Instead of only running one episode to measure how good a set of weights is, we run it multiple times and sum up the rewards. Although it might take longer to evaluate, this decreases variance and results in a more accurate measurement of whether one set of weights performs better.

The next method is a little more complicated than needed to solve the CartPole environment. When possible, simple methods such as random search and hill climbing are better to start with. However, it's a good concept to learn and performs better in environments with lots of states or actions.

In order to implement a policy gradient, we need a policy that can change little by little. In practice, this means switching from an absolute limit (move left if the total is < 0, otherwise move right) to probabilities. This changes our agent from a deterministic to a stochastic (random) policy.

Instead of only having one linear combination, we have two: one for each possible action. Passing these two values through a softmax function gives the probabilities of taking the respective actions, given a set of observations. This also generalizes to multiple actions, unlike the threshold we were using before.

Since we're going be computing gradients, it's time to bust out Tensorflow.

```
def policy_gradient():
params = tf.get_variable("policy_parameters",[4,2])
state = tf.placeholder("float",[None,4])
linear = tf.matmul(state,params)
probabilities = tf.nn.softmax(linear)
```

We also need a way to change the policy, increasing the probability of taking a certain action given a certain state.

Here's a crude implementation of an optimizer that allows us to incrementally update our policy. The vector `actions`

is a one-hot vector, with a one at the action we want to increase the probability of.

```
def policy_gradient():
params = tf.get_variable("policy_parameters",[4,2])
state = tf.placeholder("float",[None,4])
actions = tf.placeholder("float",[None,2])
linear = tf.matmul(state,params)
probabilities = tf.nn.softmax(linear)
good_probabilities = tf.reduce_sum(tf.mul(probabilities, actions),reduction_indices=[1])
# maximize the log probability
log_probabilities = tf.log(good_probabilities)
loss = -tf.reduce_sum(log_probabilities)
optimizer = tf.train.AdamOptimizer(0.01).minimize(loss)
```

So we know how to update our policy to prefer certain actions. If we knew exactly what the perfect to move in every state was, we could just continously perform supervised learning and the problem would be solved. Unfortunately, we don't have some magic oracle that knows all the right moves.

We want to know how good it is to take some action from some state. Thankfully, we do have some measure of success that we can make decisions based on: the return, or total reward from that state onwards.

We're trying to determine the best action for a state, so the first thing we need is a baseline to compare from. We a define some value for each state, that contains the average return starting from that state. In this example, we'll use a 1 hidden layer neural network.

```
def value_gradient():
# sess.run(calculated) to calculate value of state
state = tf.placeholder("float",[None,4])
w1 = tf.get_variable("w1",[4,10])
b1 = tf.get_variable("b1",[10])
h1 = tf.nn.relu(tf.matmul(state,w1) + b1)
w2 = tf.get_variable("w2",[10,1])
b2 = tf.get_variable("b2",[1])
calculated = tf.matmul(h1,w2) + b2
# sess.run(optimizer) to update the value of a state
newvals = tf.placeholder("float",[None,1])
diffs = calculated - newvals
loss = tf.nn.l2_loss(diffs)
optimizer = tf.train.AdamOptimizer(0.1).minimize(loss)
```

In order to train this network, we first need to run some episodes to gather data. This is pretty similar to the loop in random-search or hill climbing, except we want to record transitions for each step, containing what action we took from what state, and what reward we got for it.

```
# tensorflow operations to compute probabilties for each action, given a state
pl_probabilities, pl_state = policy_gradient()
observation = env.reset()
actions = []
transitions = []
for _ in xrange(200):
# calculate policy
obs_vector = np.expand_dims(observation, axis=0)
probs = sess.run(pl_probabilities,feed_dict={pl_state: obs_vector})
action = 0 if random.uniform(0,1) < probs[0][0] else 1
# record the transition
states.append(observation)
actionblank = np.zeros(2)
actionblank[action] = 1
actions.append(actionblank)
# take the action in the environment
old_observation = observation
observation, reward, done, info = env.step(action)
transitions.append((old_observation, action, reward))
totalreward += reward
if done:
break
```

Next, we compute the return of each transition, and update the neural network to reflect this. We don't care about the specific action we took from each state, only what the average return for the state over all actions is.

```
vl_calculated, vl_state, vl_newvals, vl_optimizer = value_gradient()
update_vals = []
for index, trans in enumerate(transitions):
obs, action, reward = trans
# calculate discounted monte-carlo return
future_reward = 0
future_transitions = len(transitions) - index
decrease = 1
for index2 in xrange(future_transitions):
future_reward += transitions[(index2) + index][2] * decrease
decrease = decrease * 0.97
update_vals.append(future_reward)
update_vals_vector = np.expand_dims(update_vals, axis=1)
sess.run(vl_optimizer, feed_dict={vl_state: states, vl_newvals: update_vals_vector})
```

If we let this run for a couple hundred episodes, the value of each state is represented pretty accurately. The `decrease`

factor puts more of an emphasis on short-term reward rather than long-term reward. This introduces a little bias but can reduce variance by a lot, especially in the case of outliers, such as a lucky episode that lasted for 100 timeframes from a state that normally lasted only 30.

How can we use the newly found values of states in order to update our policy to reflect it? Obviously, we want to favor actions that return a total reward greater than the average of that state. We call this error between the actual return and the average an *advantage*. As it turns out, we can just plug in the advantage as a scale, and update our policy accordingly.

```
for index, trans in enumerate(transitions):
obs, action, reward = trans
# [not shown: the value function update from above]
obs_vector = np.expand_dims(obs, axis=0)
currentval = sess.run(vl_calculated,feed_dict={vl_state: obs_vector})[0][0]
advantages.append(future_reward - currentval)
advantages_vector = np.expand_dims(advantages, axis=1)
sess.run(pl_optimizer, feed_dict={pl_state: states, pl_advantages: advantages_vector, pl_actions: actions})
```

This requires a slight change in our policy update: we add in a scaling factor for a given update. If a certain action results in a return of 150, while the state average is 60, we want to increase its probability more than an action with a return of 70 and a state average of 65. Furthermore, if the return for an action is 30 while the state is average is 40, we have a negative advantage, and instead decrease the probability.

```
def policy_gradient():
params = tf.get_variable("policy_parameters",[4,2])
state = tf.placeholder("float",[None,4])
actions = tf.placeholder("float",[None,2])
advantages = tf.placeholder("float",[None,1])
linear = tf.matmul(state,params)
probabilities = tf.nn.softmax(linear)
good_probabilities = tf.reduce_sum(tf.mul(probabilities, actions),reduction_indices=[1])
# maximize the log probability
log_probabilities = tf.log(good_probabilities)
# insert the elementwise multiplication by advantages
eligibility = log_probabilities * advantages
loss = -tf.reduce_sum(eligibility)
optimizer = tf.train.AdamOptimizer(0.01).minimize(loss)
```

How well does our policy gradient perform on CartPole?

Turns out, it's not very fast. Compared to random search and hill climbing, policy gradient takes much longer to solve CartPole.

This is partly due to having to learn a value function first, before the agent can make any solid assumptions as to which actions are better. While the first two methods can start improving right off the bat, the policy gradient needs to collect data on how to improve first.

However, this ensures that the updates are always making some form of improvement. In environments with much higher dimensional state spaces, the previous methods get much less efficient, as the probability of improving when adjusting randomly is small. A policy gradient ensures the agent is always heading toward some kind of optimum.

Since we always follow the gradient, our weights will always be changing and we doesn't get stuck in bad initializations such as in hill climbing. It also can improve on an already good solution even further, which random search cannot do.

In the end, simple methods are almost always better when dealing with simple problems. For more complicated environments, a policy gradient allows our agent to perform more consistent updates and gradually improve.

The code for these methods is available on my Github

]]>It turns out, these same networks can be turned around and applied to image generation as well. If we've got a bunch of images, how can we generate more like them?

A recent method,

]]>It turns out, these same networks can be turned around and applied to image generation as well. If we've got a bunch of images, how can we generate more like them?

A recent method, Generative Adversial Networks, attempts to train an image generator by simultaneously training a discriminator to challenge it to improve.

To gain some intuition, think of a back-and-forth situation between a bank and a money counterfeiter. At the beginning, the fakes are easy to spot. However, as the counterfeiter keeps trying different kinds of techniques, some may get past the check. The counterfeiter then can improve his fakes towards the areas that got past the bank's security checks.

But the bank doesn't give up. It also keeps learning how to tell the fakes apart from real money. After a long period of back-and-forth, the competition has led the money counterfeiter to create perfect replicas.

Now, take that same situation, but let the money forger have a spy in the bank that reports back how the bank is telling fakes apart from real money.

Every time the bank comes up with a new strategy to tell apart fakes, such as using ultraviolet light, the counterfeiter knows exactly what to do to bypass it, such as replacing the material with ultraviolet marked cloth.

The second situation is essentially what a generative adversial network does. The bank is known as a discriminator network, and in the case of images, is a convolutional neural network that assigns a probability that an image is real and not fake.

The counterfeiter is known as the generative network, and is a special kind of convolutional network that uses transpose convolutions, sometimes known as a deconvolutional network. This generative network takes in some 100 parameters of noise (sometimes known as the *code*) , and outputs an image accordingly.

how images are generated from deconvolutional layers. [source]

Which part was the spy? Since the discriminator was just a convolutional neural network, we can backpropogate to find the gradients of the input image. This tells us what parts of the image to change in order to yield a greater probability of being real in the eyes of the discriminator network.

All that's left is to update the weights of our generative network with respect to these gradients, so the generative network outputs images that are more "real" than before.

The two networks are now locked in a competition. The discriminative network is constantly trying to find differences between the fake images and real images, and the generative network keeps getting closer and closer to the real deal. In the end, we've trained the generative network to produce images that can't be differentiated from real images.

How well does this work? I wrote an implementation in Tensorflow, and trained it on various image sets such as CIFAR-10 and 64x64 Imagenet samples.

generated CIFAR images on iteration 300, 900, and 5700.

In these samples, 64 images were generated at different iterations of learning. In the beginning, all the samples are roughly the same brownish color. However, even at iteration 200, some hints of variation can be spotted. By iteration 900, different colors have emerged, although the generated images still do not resemble anything. At iteration 5700, the generated images aren't blurry anymore, but there's no actual objects in the images.

real images on the left, generated at iteration 182,000 on the right

After letting the network run for a couple hours on a GPU, I was glad to see that nothing broke. In fact, the generated images are looking pretty close to the real deal. You can see actual objects now, such as some ducks and cars.

What happens if we scale it up? CIFAR is only 32x32, so let's try Imagenet. I downloaded a 150,000 image set from the Imagenet 2012 Challenge, and rescaled them all to 64x64.

generated Imagenet images on iteration 300, 800, and 5800.

The progression here is basically the same as before. It starts out with some brown blobs, learns to add color and some lighting, and finally learns the look and feel of real images.

real images on the left, generated at iteration 17,800 on the right

Here's the generated Imagenet samples at the last iteration I trained from. These definitely look a lot better than the earlier iterations. However, especially at this higher resolution, some problems become apparent.

When generating from CIFAR or Imagenet, there are no concept of classes in the generative adversial network. The network is not learning to make images of cars and ducks, it is learning to make images that look real in general. The problem is, this results in some image that may have some combination of features from all sorts of objects, like the outline of a table but the coloring of a frog.

OpenAI's recent blog post describes two papers that attempt to combat this: Improving GANs, and InfoGAN.

Both of these involve adding multiple objectives to the discriminator's cost function, which is a good idea. In a simple GAN, the discriminator only has one idea of what an incredibly "real" image looks like. This leads to the generator either collapsing to only produce one image no matter what noise it starts with, or only producing images that have some resemblance of real features but no distinct uniqueness, like our Imagenet generator.

*Improving GANs* adds in minibatch discrimination, which is a fancy way of making sure features within various samples remain varied.

Meanwhile, *InfoGAN* tries to correlate the initial noise with features in the generated image, so you can do things such as adjust one of the initial noise variables to change the angle of an object.

In a plain GAN, the initial noise variables suffer from the same problem of features in a typical neural network. Although they make sense when put together, it's hard to tell what each of them do individually.

left and middle: CIFAR-10. right: Imagenet

I wrote a quick generation script that kept all 200 initial noise values constant, except linearly adjusted one from -1 to 1. The most common result is some color becomes more prominent in a certain region of the image.

Unfortunately, this means the generator network has not learned what it means to represent an object. All it's doing is creating an image that has features that might be present in a photograph, such as distinct color regions and shadows.

Adding some secondary objectives, such as correlating initial noise and features present, could add some more concrete value to this noise, and result in images that look less like a mix between multiple objects.

The problem of generating realistic-looking images is complex. For the simplicity of the simple generative adversial network, it's done a pretty good job in creating images that look real, at least from a distance.

The code for training and producing these images is up on my Github.

]]>Implementing a model into your reinforcement learning agent yields many benefits. The

]]>Implementing a model into your reinforcement learning agent yields many benefits. The agent can perform lookahead and perform simulations in its head to try and predict what the best move is, without taking any real-world actions.

Learning a model can also be simpler than learning value functions, especially in environments with a lot of states. Take the game chess. There are 10^50+ possible states in chess, so a value function would take an unreasonable amount of time to compute. By instead learning a model, the agent can learn how the game itself works, and use any planning algorithm such as value/policy iteration to find a strategy. The downside here is that there is more computation involved in selecting an action.

model-based learning

How is a model learned? Well, the idea is the same as learning a reward function. Every time an action is taken, we record which state we ended up in. We then use the ratios as the probability of transitioning to that state.

For simple tasks, having a table of states and actions will work just fine. However, for larger state spaces, a function approximator can be used instead, with the same techniques discussed earlier.

So, model-free learning learns a policy directly from real-world interactions. In comparison, model-based learning learns a model of how the environment works, and comes up with a strategy based on simulated interactions.

There's actually a third option: Dyna is an idea that combines the best of both worlds. We still learn a model, but create the policy using both real and simulated data.

Dyna: learn policy from real and simulated interactions

A basic Dyna update would proceed as follows

- Take an action in the real environment based on the current policy
- Update value function Q(s,a) from the real experience using SARSA
- Update the model's transition probabilities from the real experience
- Update Q(s,a) from simulated experience n times
- Repeat

Why does this work? By keeping a model, we can make more efficient use of the data collected. If taking an action in the real environment is costly, Dyna allows us to make use of that data more than once.

There's another way we can utilize models to find a policy. Since we can simulate what will happen in the future, we can use forward search to look ahead and determine what action is best.

A key thing to note is we don't need to solve the whole state space, only states that we will encounter from the current state onwards.

Like in the past, Monte-Carlo is the simplest example. Let's say we're at some state S. Using state S as the start state, we simulate a sequence of actions all the way until the terminal state, using the model we learned. If we do this many times for each possible action, and average the total return at the end, all that's left is to act greedily and choose the action from the starting state that had the greatest average total return.

The problem here is that whatever policy we were using to simulate the model is static and doesn't change. Monte-Carlo Tree Search fixes this problem.

In simple Monte-Carlo search, we performed a search and to find the best action at the start state. This is great and all, but it only helped us in one state. With Monte-Carlo Tree Search, we follow the same search as in Monte-Carlo search, but keep averages on state-action pairs for every pair we encounter, not only the starting state. Our simulation policy is then to act greedily towards the action that leads to the greatest average total reward. If it's a new state, we can default to acting randomly in order to explore.

To make this concrete, say we're in some starting state S. At first, we have no information, so we perform 100 iterations acting randomly until the terminal state, recording total rewards.

the model so far

In simple Monte Carlo search, all we would know is that going right from the starting state is better. However, in Monte Carlo Tree Search, we also know that going right from state D is better.

In practice, this is actually just the same as applying Monte-Carlo control to simulated experience that starts from the current state.

TD-learning works as well, especially in environments where the same state can be encountered twice in one episode.

Finally, there's a method called Dyna-2 that combines this lookahead tree with the concept of using both simulated and real experience.

In Dyna-2, the agent holds long-term policy as well as short-term policy, as well as a model of the environment. The long-term policy is learned from real experience, using regular TD learning.

The short-term policy is computed at every state, using forward search to find the optimal action for this specific state.

The intuition here is that the long-term policy will learn general knowledge about the environment, while the short-term policy can search for specific results when starting from the current state.

]]>π = P (a|s)

This brings with it some advantages and

]]>π = P (a|s)

This brings with it some advantages and disadvantages. In the case of very high-dimension action spaces, it can take a lot of computing power to find the maximum value, and learning a policy allows us to bypass that step. We also get the benefit of stochasticity, or a random probability to choose one action compared to another. In some partially observed environments, random decision making is an essential piece of the optimal strategy. However, learning a policy often has higher variance and will take longer to train, and it may reach a local optima.

example of stochastic (random) policy being superior

So, how do we learn a policy? We'll assume some function π[W] (s,a) that takes in state-action pairs and returns probabilities. W represents parameters that we use to learn this function. If we're using a neural network, W is the weights of the network.

I won't go into too much of the mathematical details here, so check out the relevant video in David Silver's RL course for derivations and stuff.

In short, if we know the gradient of parameters to make the probability of taking action A from state S more likely, we adjust the parameters in that direction to the scale of the return for taking the action.

Let's go over a short example using Monte-Carlo learning. To learn a policy, we would first have to wait until the end of the episode. For every timestep, we would update parameter W:

```
W = W + learningRate * [derivative of W to maximize P(a|s)] * [total reward]
```

Note that total reward means from that timestep onwards.

This is great and all, but we run into same problems as using Monte-Carlo learning in the past: since we wait until the end of the episode, variance is high. To solve this, we can use good old policy evaluation to estimate the value function, and bootstrap off of that. This is known as actor-critic, where the actor is the policy and the critic is a value function.

An simple actor-critic follows the steps below:

- Estimate value function V(s) using policy evaluation of π
- Take an action based on policy π
- Find TD error of that action:
- update parameters W to the scale of TD error

```
TD error = reward + discountFactor * V(next state) - V(current state)
```

To understand how this works, think of `reward + discountFactor * V(next state)`

as the value of taking action A from state S, or `Q(s,a)`

.

The TD error can then be thought of as `Q(s,a) - V(s)`

, or "how good is it to take this action from this state, relative to how good the state is?".

Let's say we're in a state with a value of 100, we take an action that rewards us with 10, and the next state has a discounted value of 95. `(95 + 10) - 100 = 5`

, so this was clearly a good move. Therefore, we should update the policy in a direction that would increase the probability to take that action from that state.

This works really well since a bigger TD error would have a greater impact on our policy, and negative TD errors would adjust parameters in the opposite direction, decreasing the probability of taking a bad action.

Note: this is a continuation from a previous post, with information taken from David Silver's RL Course.

]]>I attempted to recreate the techniques described in Visualizing and Understanding Convolutional Networks to project features in the convnet back to pixel

]]>I attempted to recreate the techniques described in Visualizing and Understanding Convolutional Networks to project features in the convnet back to pixel space.

In order to do this, we first need to define and train a convolutional network. Due to lack of training power, I couldn't train on ImageNet and had to use CIFAR-10, a dataset of 32x32 images in 10 classes. The network structure was pretty standard: two convolutional layers, each with 2x2 max pooling and a reLu gate, followed by a fully-connected layer and a softmax classifier.

We're only trying to visualize the features in the convolutional layers, so we can effectively ignore the fully-connected and softmax layers.

Features in a convolutional network are simply numbers that represent how present a certain pattern is. The intuition behind displaying these features is pretty simple: we input one image, and retrieve the matrix of features. We set every feature to 0 except one, and pass it backwards through the network until reaching the pixel layer. The challenge here lies in how to effectively pass data backwards through a convolutional network.

We can approach this problem step-by-step. There are three main portions to a convolutional layer. The actual convolution, some max-pooling, and a nonlinearity (in our case, a rectified linear unit). If we can figure out how to calculate the inputs to these units given their outputs, we can pass any feature back to the pixel input.

image from Visualizing and Understanding Convolutional Networks

Here, the paper introduces a structure called a deconvolutional layer. However, in practice, this is simply a regular convolutional layer with its filters transposed. By applying these transposed filters to the output of a convolutional layer, the input can be retrieved.

A max-pool gate cannot be reversed on its own, as data about the non-maximum features is lost. The paper describes a method in which the positions of each maximum is recorded and saved during forward propagation, and when features are passed backwards, they are placed where the maximums had originated from. In my recreation, I took an even simpler route and just set the whole 2x2 square equal to the maximum activation.

Finally, the rectified linear unit. It's the easiest one to reverse, we just need to pass the data through a reLu again when moving backwards.

To test these techniques out, I trained a standard convolutional network on CIFAR-10. First, I passed one of the training images, a dog, through the network and recorded the various features.

our favorite dog

first convolutional layer (features 1-32)

As you can see, there are quite a variety of patterns the network is looking for. You can see evidence of the original dog picture in these feature activations, most prominently the arms.

Now, let's see how these features change when different images are passed through.

first convolutional layer (feature #7)

This image shows all the different pixel representations of the activations of feature #7, when a variety of images are used. It's clear that this feature activates when green is present. You can really see the original picture in this feature, since it probably just captures the overall color green rather than some specific pattern.

Finally, to gain some intuition of how images activated each feature, I passed in a whole batch of images and saved the maximum activations.

maximum activations for 32 features

Which features were activated by which images? There's some interesting stuff going on here. Some of the features are activated simply by the presence of a certain color. The green frog and red car probably contained the most of their respective colors in the batch of images.

two activations from the above image

However, here are two features which are activated the most by a red frog image. The feature activations show an outline, but one is in red and the other is in blue. Most likely, this feature isn't getting activated by the frog itself, but by the black background. Visualizing the features of a convolutional network allows us to see such details.

So, what happens if we go farther, and look at the second convolutional layer?

second convolutional layer (64 features)

I took the feature activations for the dog again, this time on the second convolutional layer. Already some differences can be spotted. The presence of the original image here is much harder to see.

It's a good sign that all the features are activated in different places. Ideally, we want features to have minimal correlation with one another.

Finally, let's examine how a second layer feature activates when various images are passed in.

second convolutional layer (feature #9)

For the majority of these images, feature #9 activated at dark locations of the original image. However, there are still outliers to this, so there is probably more to this feature than that.

For most features, it's a lot harder to tell what part of the image activated it, since second layer features are made of any linear combination of first layer features. I'm sure that if the network was trained on a higher resolution image set, these features would become more apparent.

The code for reproducing these results are available on my Github

]]>Model-free prediction is predicting the value function of a certain policy without a concrete model.

The simplest method is Monte-Carlo

]]>Model-free prediction is predicting the value function of a certain policy without a concrete model.

The simplest method is Monte-Carlo learning. However, it only works on episodic tasks: where you have a certain set of actions, the the episode ends with some total reward. Monte-Carlo learning states that the return of a state is simply the mean average of the total reward from when a state appeared onwards.

In short, once an episode has concluded, every state that was present can update its value function towards that end return.

However, a problem lies in that every state updates towards the end return, even though that individual state may have had no impact whatsoever on how much reward there was at the end. In order to reduce the variance, we can use a different method of prediction.

Temporal-difference learning, or TD learning, updates the values of each state based on a prediction of the final return. For example, let's say Monte-Carlo learning takes 100 actions and then updates them all based on the final return. TD learning would take an action, and then update the value of the previous action based on the value of the current action. TD learning has the advantage of updating values on more recent trends, in order to capture more of the effect of a certain state.

TD has a lower variance than Monte-Carlo, as each update depends on less factors. However, Monte-Carlo has no bias, as values are updated directly towards the final return, while TD has some bias as values are updated towards a prediction.

TD and Monte-Carlo are actually opposite ends of a spectrum between full lookahead and one-step lookahead. Any number of steps can be taken before passing the return back to an action. One step would be the previously discussed TD learning, and infinite steps is Monte-Carlo.

The question rises: how many steps is optimal to look ahead? Unfortunately, it varies greatly depending on the environment and is often a hyperparameter. Another solution is TD(λ), which takes a geometric-weighted average of all n-step lookaheads and uses that to update the value.

Finally, another way to think about TD(λ) is through eligibility traces. An eligibility trace is some weighting that is assigned to every state, based on how frequently and recently it has shown up. After each action, all states are updated proportionally to their eligibility trace towards the reward.

Okay, that's great. We can find value functions for any given policy through experimentation. But how do we improve on this policy to find the optimal strategy?

To do this, we have to return to the idea of policy iteration. Find the value function, change the policy to act greedily towards the value function, and repeat. However, we don't know how the environment works, so the transition probabilities are unknown. We can't act greedily towards state values if we don't know which actions lead to which states!

The solution here is to learn action-values instead of state-values. For every Q(state,action), there is a value corresponding to the expected return of taking a certain action from a certain state. By learning these instead, we can act greedily by simply taking the action with the highest expected return when at a state.

So we've got a basic iteration going, and it works well for simple policies. However, there's a problem with the lack of exploration. Let's say we're at a state with two doors. One door gives a reward of +1 every time, and the other door gives +0 reward, but has a 10% chance to give a reward of +100. On the first run, the second door doesn't get lucky and returns a reward of 0. Since it's acting greedily, the AI will only ever take the first door since it considers it the better option.

How do we fix this? Random exploration. At some probability, the program will take a random action instead of taking the optimal action. This allows it to eventually figure out that there is some extremely high reward hidden behind the second door if it tries it out. However, we want the AI to eventually converge at some optimal policy, so we lower the probability of taking a random action over time.

Finally, we can make use of the same techniques in model-free prediction for control. Eligibility traces work just as well in action-value iteration as in state-iteration.

There's also a little bit about learning off-policy, or learning by watching another policy make decisions. We can utilize a method named Q-learning, which is an update very similar to our random exploration. The value of Q(state,action) is updated towards the reward of an some arbitrary action that the observed policy decides on, and the expected return from that state onwards according to our greedy policy. Basically, we watch what the other policy did and how it affected the environment, and assess how good the move was depending on out current action-values.

More information about the difference between random-exploration (SARSA) and Q-learning: https://studywolf.wordpress.com/2013/07/01/reinforcement-learning-sarsa-vs-q-learning/

The methods we've been using in reinforcement learning often make use of a value function, either based on states or actions. We've been assuming some concrete mapping of states to values, such as a table. However, in some problems, there may be millions of states. In the case of some real-world problems such as controlling a robot arm, there may be almost infinite states for each joint.

Therefore, for most practical applications it is useful to use a function approximator to estimate the value function given a state. This can be anything from a neural network to just some linear combination of features.

Let's take a standard policy evaluation algorithm, using TD. Instead of updating the value of a table on every iteration, we can make a supervised training call to a neural network that takes in Markov states as input and outputs the value of that state. If we're doing control, the action can be encoded as a one-hot vector and included in the network input. For TD(λ), eligibility traces can be stored on individual features of the Markov state to produce the same effect.

However, there are some downsides to using a neural network as a value function. Since we're updating the network on values that are bootstrapped off its own output (in the case of TD), the network might never converge and instead scale up to infinity or down to 0.

There are two tricks that can help combat this. Instead of training the network on the current TD value, we instead save that "experience" into some array. We then train the network on a batch of randomly selected experiences from the past. This not only decreases correlation, but allows us to reuse data to make training more efficient.

We also hold two copies of the neural network. One is frozen and does not change, and we bootstrap off of it when calculating TD error. After ~1000 steps or so, we update the frozen neural network with the one we have been training on. This stabilizes the networks and helps prevent the networks from scaling up to infinity.

Note: this is a continuation from a previous post, with information taken from David Silver's RL Course.

]]>Well, one way to figure it out is by simply iterating the policy over a value function for each state.

In this example, we have a small gridworld. The goal is to make

]]>Well, one way to figure it out is by simply iterating the policy over a value function for each state.

In this example, we have a small gridworld. The goal is to make it to one of the two gray corners, ending the game. To encourage this, every move will have a reward of -1. Our policy will simply be to move in a random direction.

In order to find the value of the policy, we can start from a value function of all 0 and iterate, adding the reward for each state after every iteration.

In each step, the value of a state is the sum of the previous values of all neighbors, plus the reward of -1, scaled by probability (0.25 for up,down,right,left). Since the gray corners mark the end of the game, the reward is just 0.

After only two iterations, it already becomes clear that being near the two gray corners will have a higher return than being near the center.

But that only shows how to evaluate a certain policy. Randomly choosing a direction is certainly not the best strategy.

In policy iteration, instead of sticking to the same policy every time the values are iterated, the policy acts greedily towards the best expected result. In the example above, following the direction of the arrows would be the greedy policy.

Eventually, the policy would reach a point where continuing to iterate would no longer change anything. That final policy would therefore be the optimal policy for that environment.

The loop for policy iteration is as follows:

- perform policy evaluation for some number of iterations
- change the policy to act greedy towards the new value function
- repeat

On the other hand, value iteration only cares about the value function, and not on forming a complete policy for every iteration. It's the same as policy iteration, but only performing policy evaluation once and then changing the policy right away.

While this requires less steps to find the optimal policy, intermediate steps cannot be used as a suboptimal policy as the values do not correspond to a concrete policy.

Note: this is a continuation from a previous post, with information taken from David Silver's RL Course.

]]>A Markov State is a bunch of data that not only contains information about the current state of the environment, but all useful information from the past. In short, when making decisions

]]>A Markov State is a bunch of data that not only contains information about the current state of the environment, but all useful information from the past. In short, when making decisions based on a Markov State, it doesn't matter what happened, say, three turns ago. Anything that might have changed as a result of past actions will be encoded in the Markov State.

But that only tells about one state of the environment. On the other hand, a Markov Process contains a set of all possible states, and transition probabilities for each one. For example, someone in a sleeping state may have a 70% chance to continue sleeping, and a 30% chance to wake up.

Building on this, a Markov Reward Process contains all the same information, but also has rewards associated with each state. Reward is simply an arbitrary number that determines how good it is to be in a certain state, or take an action from a state. For example, being in the sleeping state may have a reward of 3 as you are well rested, while waking up has a reward of 0. However, reading a book once you wake up may give a reward of 10.

This leads into the concept of reward vs return. Rewards are immediate for each state. However, the return of a state is the total possible reward from that state onwards. There is often a discount associated with future rewards, to account for possible variation and inconsistencies, as well as prevent infinite returns.

The value function of a state indicates its return, which is composed of the state's immediate reward, plus the discounted reward of the future state. In the case of multiple possible states, the future rewards are simply weighted by their respective probabilities.

The Bellman Equation is a simple way of calculating the value of each state. For each state,

```
v = R + γPv
```

With γ being the discount factor (0 <= γ <= 1) and P being the transition probabilities from each state to the next. The Bellman Equation builds on itself, and can be computed iteratively by using the last set of values to compute the latest set.

Finally, a Markov Decision Process puts all the pieces together. It contains everything a Markov Reward Process does, along with a set of actions possible from each state.

This allows a policy, a mapping of states to actions, to be developed. A policy is simply a strategy of which action to take from each state.

By following a policy to determine the next state, bellman equations can be calculated for any number of policies. This brings us to the state-value function, or the expected return for each state, following policy π. There's also an action-value function, which contains expected returns for taking a certain action from a state, following policy π.

The end goal is to find the maximum action-value function over all policies. Once this is known, we can simply always take the action with the highest potential reward, and the agent will theoretically maximize on rewards.