kevin frans blog

Alchemist's Notes on Deep Learning

Kevin Frans — Wed, 28 May 2025 16:41:19 GMT

I recently compiled a more cleaned-up set of explanations and implementations on topics relating to deep learning and generative modelling. Check them out at https://notes.kvfrans.com/

Small-Research: Tanh Activations with DDPG

Kevin Frans — Tue, 19 Dec 2023 08:00:24 GMT

When implementing DDPG-style policy extraction, we often use a tanh normalizer to bound the action space. That way the policy does not attempt to output something OOD when the Q-function is only trained on actions in [-1, 1].

dist = agent.actor(batch['observations'])
normalized_actions = jnp.tanh(dist.loc)
q = agent.critic(batch['observations'], normalized_actions)
q_loss = -q.mean()

When I was writing code to execute the policy, I had forgotten to include this tanh shaping. So, it was using the raw actions, but clipped between [-1, 1]. Somehow, the performance was still basically at SOTA.

actions = agent.actor(observations).sample(seed=seed)
actions = jnp.tanh(actions) # We don't actually need this line.
actions = jnp.clip(actions, -1, 1)

Question: Is it better to not apply tanh during evaluation? One hypothesis is that the policy can't actually output -1 or 1 using tanh, because the pre-activation has to approach infinity.

I ran experiments comparing the correct version (tanh applied during eval) with my clipped version (instead of tanh, clip the actions):

Conclusion: It doesn't actually matter. Both perform the same. DMC environments are known to be solvable with bang-bang control, so the policy is likely outputting actions close to -1 or 1, regardless of the tanh activation.

Small-Research: Policy Extraction in IQL

Kevin Frans — Tue, 19 Dec 2023 07:35:17 GMT

Insight: IQL-AWR agents on 'maze2d' converge to a standard deviation of 0.4, which is much higher than expected.
Hypothesis: The KL constraint of AWR prevents it from learning Gaussian policies that are centered around the optimal action.
Conclusion: In data-heavy settings, use DDPG or discretized AWR for policy extraction. For sparse data, use AWR because DDPG will overfit.

When doing offline RL with IQL, we learn an implicit Q-function without learning a policy. The policy must be extracted later using the information from Q(s,a). In the original IQL paper, advantage-weighted regression (AWR) is used for policy extraction. The AWR objective is a weighted behavior cloning loss with the form:

$max \; log \; \pi(a|s) * exp(\lambda * A(s,a))$

The AWR objective can be derived as the solution to constrained optimization problem of

Maximize $E[ \pi(a|s) * A(s,a) ]$
st. $KL(\pi_{data} | \pi) < \epsilon$

The $\lambda$ term is a temperature parameter that controls how tight the constraint is.

A nice property of AWR is that it avoids querying out-of-data actions. We only weight state-action pairs inside the dataset, instead of searching over the $A(s,a)$ space which may have inaccurate values.

But a downside of AWR is that we assign probability mass to each action in the dataset. Since we usually use a Gaussian policy, this leads to bad behavior.

As an example, let's look at maze-2d. Here the actions are (x,y) velocities to move an agent to a goal. It's not complicated; and intuitively the best actions should be along the edge of the unit sphere. But, the policy doesn't behave that way.

Left: Data actions. Right: Actions produced by IQL-AWR policy.

If we look at the standard deviations of the Gaussian heads, they are not at zero. They are at 0.4, which is quite a bit of variance.

Remember that Gaussian heads have a log-probability that is equivalent to L2 loss scaled by variance. They suffer from the issue of mean-matching rather than mode-matching. With AWR, we find a 'compromise' between behavior cloning and advantage-maximization, and this compromise might not make sense.

Gaussians cannot capture multi-modal distributions. The most likely point is not the mode, but the mean.

Let's try some fixed. What if we extract the policy via DDPG? DDPG-style extraction optimizes through the Q-function to find the best action at each state. Importantly, DDPG ignores the shape of the Q-function landscape, and only cares about where the maximum is.

Another path is to discretize the action space. If we bin the action space into tokens, we can handle multimodality easily. So we don't run into the mean-matching problem. The token with highest advantage will also have the highest probability.

Left: Data actions. Middle: DDPG actions. Right: AWR-Discrete actions.

That looks much more reasonable. The actions are all on the unit circle now, which is consistent with the data. DDPG picks only the highest-advantage actions, whereas the AWR policy still assigns some probability mass to all data actions, but does so in a multimodal way that preserves the correct locations.

Question: What about the temperature parameter? As lambda goes to infinity, then the KL constraint becomes zero and AWR approaches the same objective as DDPG. The potential issue is that 1) we get some numerical issues exponentiating a large number, and 2) the percent of actions with non-vanishing gradients becomes lower.

One way to think about it is that AWR acts like a filter. If the action has enough advantage, increase its probability. Otherwise do nothing. DDPG acts like an optimization problem. For every state, find the action that is the best, and memorize this mapping. AWR has the benefit that it only trains on state-action pairs in the data, which is good for offline RL, but it is also less efficient.

Increasing the temperature helps, but not as much as DDPG. We still have high standard deviation.

Conclusion: In maze2d, don't use AWR for policy extraction with a Gaussian head. Use discretization, or use DDPG.

DDPG does the best on maze2d. Highest success rate, and also shortest time until completion.

How about other environments? I tried the same tactics on other envs, and here's what we get. For the full AntMaze, and on D4RL half-cheetah, the AWR style extraction actually performs the best.

In both of these cases, the standard deviation via AWR is much lower. Both of these offline datasets are collected from expert data. So the data is naturally unimodal, which may be why AWR does not have an issue. If we look at the standard deviations, they are much lower than the maze2d experiments. All of these envs use an action space of (-1, 1).

Why does DDPG do so bad? I am not sure. If we look at the average IQL Q-values, along with the average Q-values that during the policy optimization step, they look equal for both envs. Then again, what we really should look at are the advantages, which might be small in comparison to the value at each state. Likely, DDPG extraction is exploiting some errors in the Q-value network. Both AntMaze and half-cheetah do not provide dense coverage of the action space.

If we look at ExoRL cheetah-run, which does have dense coverage of the action space, we get a different story.

DDPG is better again. So, it seems like there are two categories here.

If the dataset has dense coverage, DDPG will solve the multimodality issues of AWR and perform better.
But without dense coverage, DDPG will overfit to the Q-function errors and learn a degenerate policy.

Results: On maze2d and ExoRL, DDPG performs decently better. On AntMaze and D4RL, AWR works and DDPG gets zero reward.

There's some clear room for improvement here – if the DDPG is prevented from overfitting on the sparse-data environments, likely it can outperform the AWR policy performance.

Successor Representations Explained

Kevin Frans — Fri, 08 Sep 2023 18:28:15 GMT

Note: We're going to conflate terminology for value-functions and Q-functions. In general the Q-function $Q(s,a)$ is just an action-conditioned version of the value function $V(s)$ .

Successor Representations

Let's consider a classic temporal difference method -- estimating the value function. The value of a state is the expected discounted sum of future rewards, starting from that state. Notably, temporal difference methods give us a nice recursive form of the value function.

$V(s) = r + \gamma E_{p(s'|s)}[V(s')]$

Learning a value function is helpful because it lets us reason about future behaviors, from a single function call. What if we could do this not over reward, but over states themselves?

Successor representations are an application of temporal difference learning, where we predict expected state occupancy. If I follow my policy, discounted into the future, what states will I be in?

This is easily understandable in a MDP with discrete states. Let's say I start out in state A. After one timestep, I transition into a probability distribution over A,B,C. Each path then leads to a distribution for t=2, then t=3, etc. Now, we'll combine all of these distributions as a weighted average.

$\mu(s^*|s) = (1-y)p(s'|s) + \gamma E_{p(s'|s)} \mu(s^*|s')$

Looks similar to the value function, doesn't it. Successor representations follow the same structure as the value function, but we're accumulating state probabilities rather than expected reward.

Gamma Models

In more recent methods, we use deep neural networks to learn the successor representation. Gamma models attempt to a learn an action-conditioned successor function $\mu(s^*|s,a)$ . Notably, $\mu$ is a generative model and can handle both continuous and discrete settings. That means we can learn $\mu(s^*|s,a)$ using our favorite generative-modelling methods. The paper uses GANs and normalizing flows.

This interpretation gives us some nice properties:

If our reward function is $\delta(s = s^*)$ (i.e. sparse reward at some goal state), then the value of any state is equivalent to evaluating the gamma-model at $\mu(s^*|s,a)$ . The value function of a sparse reward is equal to the probability that the policy occupies that state.
We can extract the value function for arbitrary reward functions as well. It involves a sum over trajectories sampled from the policy:
- $Q(s,a) = \sum \mu(s^*|s,a) \; r(s^*)$
If we set $\gamma=0$ , then the gamma model is actually a one-step world model. In this case, $\mu(s^*|s,a) = p(s'|s,a)$ and we're simply modelling the environment's transition dynamics.

Forward-Backward Models

This next work describes FB models, an extension of successor representations that lets us extract optimal policies for any reward function.

Above, we showed we can extract any Q(s,a), given a reward function r(s) and $\mu(s^*|s,a)$ . In continuous space, we need to sum over trajectories to compute $Q_r(s,a) = \sum \mu(s^*|s,a) \; r(s^*)$ . In discrete space, $\mu(s^*|s,a)$ can directly output a vector over $s$ , and we can compute Q with a matrix multiplication $Q(s,a) = \mu(s,a)^Tr$ . Moving forward, we'll assume we are in the continuous setting.

A key insight about value-functions and successor-functions is that they are policy-dependent. Depending on the policy, the distribution of future (reward/states) will change. Assuming we had some representation of a policy $z$ , we could train a policy-conditioned successor function $\mu(s^*|s,a,z)$ . We can further decompose $\mu$ into the policy-dependent and -independent parts, F and B. $\mu(s^*|s,a,z) = F(s,a,z)^TB(s^*)$ .

Let's take a look at that decomposition in detail. The original $\mu$ is a neural-network with inputs $(s^*,s,a,z)$ . It returns a scalar of the probability of being in state $s^*$ . In the decomposition, F is a network with inputs (s,a,z) and output (d). B is a network with inputs ( $s^*$ ), outputting a vector (d,1). A final dot product reduces these vectors into a scalar. One way to think about these representations is: F(s,a,z) calculates a set of expected future features when following policy $z$ . B(d,s*) then calculates how those features map back to real states.

Now, how do we find $\pi_z$ that is optimal for every reward function? Well, we need a way to learn a function z(r). The trick here is we can choose this function, so let's be smart about it. Let's define z(r) = $\sum B(s^*) \; r(s^*)$ . Some derivations:
- Remember that the Q-function for any $r$ was $Q_r(s,a) = \sum \mu(s^*|s,a) \; r(s^*)$ .
- Also, we decomposed $\mu(s^*|s,a,z) = F(s,a,z)^TB(s^*)$ .
- The optimal Q is thus $\sum F(s,a,z)^TB(s^*) \; r(s^*)$ = $F(s,a,z)^T \sum B(s^*) \; r(s^*)$ .
- Based on $z(r) = \sum B(s^*) \; r(s^*)$ , we get $Q(s,a,z) = F(s,a,z)^Tz$

Notably, the reward term has disappeared! Let's interpret. The $z$ vector here is playing two roles. First, it serves a policy representation that is given to F -- it contains information about the behavior of $\pi_z$ . Second, it serves as a feature-reward mapping -- it describes how much reward to assign to each successor feature.

The F and B networks are trained without any reward-function knowledge. We'll do the same thing as in previous works. We update $\mu = FB$ through a Bellman update. Remember that $\pi(s',z) = \text{argmax} \; F(s,a,z)^Tz$ , and $\mu = F(s,a,z)^TB(s^*)$ .

$\mu(s^*|s,a,z) = \delta(s=s^*) + \gamma E_{p(s'|s,a)} \mu(s^*|s',\pi(s',z),z)$
$F(s,a,z)^TB(s^*) = \delta(s=s^*) + \gamma E_{p(s'|s,a)} F(s,\text{argmax} \; F(s',a,z)^Tz,z)^TB(s^*)$

You'll notice that in addition to trajectories, this update is conditioned on $z$ . Since we don't know $z$ ahead of time, we just need to guess -- we'll generate random $z$ vectors from a Gaussian distribution.

Test Time. Once we've learned F and B, extracting the optimal policy is simple. We'll calculate $z = \sum B(s^*) \; r(s^*)$ . That gives us our Q-function $Q(s,a,z) = F(s,a,z)^Tz$ , which we can act greedily over.

Variational Information Bottleneck Explained

Kevin Frans — Sun, 27 Aug 2023 22:39:44 GMT

Let's take a look at neural networks from the perspective of information theory. We'll be following along with the paper Deep Variational Information Bottleneck (Alemi et al. 2016).

Given a dataset of inputs $X$ and outputs $Y$ , let's define some intermediate representation $Z$ . A good analogy here is that $Z$ is the features of a neural network layer. The standard objective is to run gradient descent so to predict $Y$ from $X$ . But that setup doesn't naturally give us good representations. What if we could write an objective over $Z$ ?

One thing we can say is that $Z$ should tell us about the output $Y$ . We want to maximize the mutual information $I(Z,Y)$ . This objective makes natural sense, we don't want to lose any important information.

Another thing we could say is that we should minimize the information passing through $Z$ . The mutual info $I(Z,X)$ should be minimized. We call this the information bottleneck. Basically, $Z$ should keep relevant information about $Y$ , but discard the rest of $X$ .

The tricky thing is that mutual information is hard to measure. The definition for mutual information involves optimal encoding and decoding. But we don't have those optimal functions. Instead, we need to approximate them using variational inference.

Let's start by stating a few assumptions. First we'll assume that X,Y,Z adhere to the Markov chain (Y - X - Z). This setup guarantees that P(Z|X) = P(Z|X,Y) and P(Y|X) = P(Y|X,Z). In other words, Y and Z are independent given X. We'll also assume that we're training a neural network to model p(z|x).

Information Bottleneck Objective

Here's our main information bottleneck objective ( $\beta$ determines the "strength" of the bottleneck):

$I(Z,Y) - \beta I(Z,X)$ .

We'll break it down piece by piece. Looking at the first term:

$I(Z,Y) = H(Y) - H(Y|Z)$
$I(Z,Y) = H(Y) + \int p(y,z) \; log \; p(y|z)$ (Definition of conditional entropy.)

The H(Y) term is easy, because it's a constant. We can't change the output distribution because it's defined by the data.

The second conditional term is trickier. Unfortunately, p(y|z) isn't tractable. We don't have that model. So we'll need to approximate it by introducing a second neural network, q(y|z). This is the variational approximation trick.

The key to this trick is recognizing the relationship between $\int p(y,z) \; log \; p(y|z)$ and KL divergence. KL divergence is always positive, so $\int p(y,z) \; log \; p(y|z) \geq \int p(y,z) \; log \; q(y|z)$ . That means we can plugin q(y|z) instead of p(y|z) and get a lower bound. Another way to interpret this trick is: H(Y|Z) is the lowest entropy we can achieve using $Z$ in the optimal way. If we use $Z$ in a sub-optimal way, that's an upper bound on the true entropy. The overall term is lower-bounded because we subtract H(Y|Z).

Because of the Markovian assumption above, we can break down p(y,z) into p(y|x)p(z|x). This version is easier to sample from.

Let's look at the second term now. We will use a similar decomposition.

$I(Z,X) = H(Z) - H(Z|X)$
$I(Z,X) = -\int p(z) \; log \; p(z) + \int p(x,z) \; log \; p(z|x)$

This time, the second term is the one that is simple. We have the model p(z|x), so we can calculate the log-probability by running the model forward. The p(z) term is tricky. It's not possible to compute p(z) directly, so we approximate it with r(z). The same KL-divergence trick makes this an upper bound to H(Z).

Putting everything together, the information bottleneck term becomes:

$I(Z,Y) - \beta I(Z,X) \geq \int p(x)p(y|x)p(z|x) \; log \; q(y|z) - \beta \int p(x)p(z|x) log \; \dfrac{p(z|x)}{r(z)}$

It's simpler than it looks. We can compute an unbiased gradient of the above term using sampling. Once we sample $x \sim X$ , we can calculate p(y|x) and p(z|x) by running the model forward. The second term is actually a KL divergence between p(z|x) and r(z), which can be computed analytically if we assume both are Gaussian.

Thus the objective function can be written as:

$\text{min}_\theta \; E[-log \; q_\phi(y|p_\theta(z,x)) + \beta \; KL(p_\theta(z|x) || r(z))]$

Intuitions

Let's go over some intuitions about the above formula.

The first part is equivalent to maximizing mutual information between Y and Z. Why is this equivalent? Mutual information can be written as $H(Y) - H(Y,Z)$ . The first term is fixed. The second term asks "what is the best predictor of Y we can get, using Z"? We're training the network $q_\phi$ to do precisely this objective.
The second part is equivalent to minimizing mutual information between $Z$ and $X$ . Why? Because no matter which $x$ is sampled, we want the posterior distribution $p_\theta(z|x)$ to look like $r(z)$ . In other words, if this term is minimized to zero, then p(z|x) would be the same for every $x$ . This is the "bottleneck" term.
The variational information bottleneck shares a lot of similarities with the variational auto-encoder. We can view the auto-encoder as a special case of the information bottleneck, where Y=X. In fact, the VAE objective $\text{min}_\theta \; E[-log \; q_\phi(x|p_\theta(z,x)) + KL(p_\theta(z|x) || r(z))]$ is equivalent to what we derived above, if y=x and $\beta=1$ .

If we set X=Y, one could ask, isn't it weird to maximize $I(Z,X) - I(Z,X)$ ? It's just zero. The answer is that we're optimizing different approximations to those two terms. The first term is the recreation objective, which optimizes a bound on H(X|Z). The second is the prior-matching objective, which optimizes the true H(Z|X) and a bound on H(Z).
What extra networks are needed to implement the variational information bottleneck? Nothing, really. In the end, we're splitting y=f(x) into two networks, p(z|x) and q(y|z). And we're putting a bottleneck on $z$ via an auxiliary objective that it should match a constant prior r(z).

A Brief Examination of Generative Models

Kevin Frans — Fri, 11 Aug 2023 21:58:26 GMT

Let's explore some generative models. There's a nice variety of them around. How well can the various methods model a given dataset? What are their flaws and strengths?

I'm interested in how well generative models handle mode coverage. Let's say my dataset has images of five cat breeds. How is each breed reflected in the generated images? If one breed is less prominent in the data, does that affect the accuracy of its generated versions?

The goal with this project was to get a deeper sense of how each generative model method works, and in what unique ways they can fail. We will take a look at 1) variational auto-encoders, 2) generative adversarial networks, 3) auto-regressive samplers, and 4) diffusion models.

Code for all these experiments is available at https://github.com/kvfrans/generative-model.

The Dataset

To get a cleaner look at the various models, I opted to create an artificial dataset rather than use natural images. The dataset is based off of MNIST digits. But, I've put some tricks in how the dataset is constructed. We'll use only these five digits, which I will refer to as "0A, 0B, 1, 2, 3, 4".

The main twist is that each digit is present in the dataset at differing frequencies. ["0A, 0B, 1, 2"] are included at the same equal frequency. "3" is half as likely to be sampled, and "4" is a quarter as likely. So, the digits "3" and "4" are less present than the other digits. A strong generative model should be able to correctly model these frequency differences.

You'll notice that 0 is in the list twice. This is intentional. There are two kinds of zeros, but they are quite similar. I'm curious if the models can correctly generate both kinds of digits.

We'll also add a second dimension of variation -- color. Again, we're going to put some twists on the frequencies here so everything isn't just uniform. Each digit has a (1/3) chance to be pure green. With (2/3) chance, the digit will instead be a random linear interpolation between red and blue. In other words, there is a continuous distribution of colors between red and blue, and a discontinuous set of pure green digits.

Real data from our custom dataset.

How will we measure mode coverage? First, we will need a way to categorize generated digits into one of the 30 possible bins. Let's train a classifier network.

Actually, that's a trap and we won't. There's a much simpler way to classify them; nearest neighbors. To classify a digit, we'll simply loop over the six possible MNIST digits and find the one with the lowest error. We'll do the same for the average color. In this way, we can cheaply categorize digits to look at frequencies.

Accuracy is straightforward as well. Once a digit has been categorized, measure the distance between the generated digit and its closest match. That will be the accuracy.

Oracle Network

As a starting point, let's construct a network that can access privileged information. We'll call it the oracle network. This network will take as input a vector containing a one-hot encoding of the MNIST digit, and a three-vector of colors. Given this input, it must recreate the matching colored digit.

The oracle network has a straightforward learning objective. It only has to memorize colored versions of the six digits, and doesn't need to model any sort of probability distribution. We can train the network with a mean-squared error loss.

I'm using JAX to write the code for these experiments. The network structure is a few fully-connected linear layers, interspersed with tanh activations.

activation = nn.tanh
class GeneratorLinear(nn.Module):
    @nn.compact
    def __call__(self, x):
        # Input: N-length vector.
        x = nn.Dense(features=64)(x)
        x = activation(x)
        x = nn.Dense(features=128)(x)
        x = activation(x)
        x = nn.Dense(features=28*28*3)(x)
        x = nn.sigmoid(x)
        # Output: 28x28x3 image.
        x = x.reshape(x.shape[0], 28, 28, 3)
        return x

Throughout this project, we'll do our best to use this same network structure for all the methods.

Images created by the oracle network.

As expected, it's an easy task. Memorizing these MNIST digits is simple. Trickier will be generating them at the right frequencies.

Variational Auto-Encoder

The first generative modelling method we'll examine is the variational auto-encoder (VAE).

Source: https://danijar.com/building-variational-auto-encoders-in-tensorflow/

The VAE works by jointly training an encoder and decoder network. The encoder takes in an image as input, and compresses it down to an N-length latent vector $z$ . This latent vector is then given to the decoder, which recreates the original image.

The catch is that we're not happy just encoding and decoding an existing image. We want to generate images, which means sampling them from a random distribution. What is the distribution? It's $P(z)$ . Unfortunately $P(z)$ is high dimensional and intractable to sample from.

Instead, we shape $P(z)$ so it's in a form that we can work with. We'll have the encoder output not a latent vector, but a Gaussian parametrized by mean and log-variance. We also apply a KL divergence objective between the output and the unit Gaussian ( $\mu = 0; \sigma = 1$ ). This objective encourages the model so that $P(z)$ looks similar to the unit Gaussian.

This shaping means we can approximate $P(z)$ by sampling $z$ from the unit Gaussian. And that's precisely what we do to generate new images -- sample a random latent variable, then pass it through the decoder.

For more on VAEs, check out this post and this newer one.

Images created by the variational auto-encoder.

The VAE is quite stable to train, and it does a good job. It's surprising how well the frequencies match the ground truth. The errors are greater for digits that are less frequently seen during training. Soon we'll see that a well-behaved objective is not to be taken for granted.

Generative Adversarial Networks

Next up is the generative adversarial network (GAN). GANs attack the problem that measuring reconstruction via a mean-squared error loss is not ideal. Such a loss often results in a blurry image -- colors will collapse to the mean, so a digit that is either red or blue will be optimized to be purple. In the GAN setup, reconstruction error is instead measured by a learned discriminator network.

Source: https://developers.google.com/machine-learning/gan/generator

The setup is as follows. The generator takes random noise as input, and produces fake images. The images are then passed to the discriminator, which predicts whether the image is real or fake. The generator's objective is to maximize the probability that the discriminator classifies its images as real.

Meanwhile, the discriminator is trained on both real and fake images, and is optimized to classify them accordingly. Thus, there is an adversarial game here. At the Nash equilibrium, the generator is producing images that exactly match the distribution of real images.

Images creates by the generaive adversarial network. Note the mode collapse; most digits are missing. Green is also missing.

Looking at the frequencies gives a hint to a key flaw in GANs. They often suffer from "mode collapse", where the generator fails to capture the entire spectrum of data, and instead produces a limited subset. In the results above, we see a generator that produces only 2s, and a separate trial that produces only 0s.

The mode collapse issues arises because the discriminator assesses each image independently. Thus there is no picture of the distribution of generated images, which is needed to assess diversity. In theory, mode collapse is solved in the optimal equilibrium -- if a generator is producing only 0s, the discriminator can penalize 0s, forcing the generator to produce something else, and so on. In practice, it is hard to escape this local minima.

One way to address mode collapse is via minibatch discrimination. In this idea, the discriminator receives privileged information about the batch as a whole. For each image, the discriminator can see the average distance of the image to the rest of the batch, in a learned feature space. This setup allows the discriminator to learn what statistics should look like in a batch of real images, thus forcing the generator to match those statistics.

Does it work? Well, we are doing better on the frequencies. But the quality of the results have taken a hit. GANs are quite unstable to train due to their adversarial nature. The discriminator must provide a clean signal for the generator to learn from, and often times this signal will be lost or diminished.

Auto-Regressive Sampling

In this next section, we'll take a look at a method that explicitly models the image probability distribution $P(x)$ .

The issue with modelling $P(x)$ directly is that the image space is high-dimensional. Instead, we need to break it into smaller components. PixelRNN breaks down an image into a sequence of pixels. If we can model $P(x_t|x_0...x_{t-1})$ , then $P(x)$ becomes the product of these individual probabilities. In other words -- generate an image pixel-by-pixel.

To model the probability distribution of a pixel, it helps to break the color into a set of discrete bins. I chose to represent each RGB channel as 8 possible options. Together this results in $8^3 = 512$ bins for each pixel. Our model will be a neural network that outputs a probability distribution over these 512 bins, given the preceding pixels.

To process our image, we'll flatten each image into a $28*28 = 784$ -length vector. The vector has 3 channels for the color, and an additional 2 channels for the normalized XY position of that pixel. 2 more channels are set to $(1-XY)$ coordinates. It's also helpful to include a boolean channel that is 1 if the pixel is on the right-most X boundary -- this information helps in modelling the jump between rows of an image.

Source: https://arxiv.org/abs/1601.06759v3

In the original PixelRNN setup, the neural network model is a recurrent network. It ingests a sequence of pixels recurrently, updating its hidden state, then using the final hidden state to predict the next pixel. I tried this setup but I had quite a hard time getting the model to converge.

Instead, we're going to use a hacky flattened version. The model gets to see the past 64 pixels, each which has 8 channels, for a total of $512$ channels per input. That's not bad at all. A flattened network is often inefficient because there is redundancy in treating each channel as separate, when in reality there is some temporal structure to it. But in our case, it's easier to learn a flattened feedforward network than a recurrent one due to stability in training.

During implementation, I ran into some troubles. It was easy to mess up the sampling procedure. The network needs to iterate through the image, predicting the next pixel and placing that pixel in the correct space. You can see in this example that the rows don't match up correctly so the digit is shifted.

Even with the correct sampling, the network's results aren't great.

A key issue with auto-regressive sampling is that it's easy to leave the training distribution. Since we're generating each image a pixel at a time, if any of those pixels gets generated wrong, then the subsequent pixels will likely also be wrong. So the model needs to generate many pixels without any errors -- a chance that quickly decays to 0 as the sequence length increases.

We can see in the example images that the generated digits start to lose structure. The images all sort of resemble MNIST digits, but they don't have a coherent shape.

The frequencies of the generated digits are also quite wrong. This issue points to another flaw in auto-regressive sampling. P(x) is split into a sequence of probabilities, and important decisions have to be made early on. As an example, the choice for what digit to generate might occur in the 28th pixel, because that pixel is blank for a 2 or 3. It's not great for stability to have big sways in probabilities during certain pixel locations.

Diffusion Models

Our final generative modelling method also samples auto-regressively, but does so over time instead of over pixels.

Source: https://scholar.harvard.edu/binxuw/classes/machine-learning-scratch/materials/foundation-diffusion-generative-models

We'll start by defining a noising process. This process will take an image from the dataset, and progressively add Gaussian noise to it. By the end of the process (say, 300 timesteps), the image is purely Gaussian noise.

The task for the diffusion model is now to reverse that process. Starting from a pure Gaussian image, it should iteratively remove the noise until the original image is recreated. The nice property about recreating an image sequentially is that decisions can be spread out. The first few denoising steps might define the outline of an image, while later steps add in the details.

Gaussian noise gives us a few nice properties.

Applying two steps of Gaussian noise to an image is equivalent to applying one step with greater variance. This property lets us easily simulate an image with $T$ steps of noising, in a single operation.

If the forward process $p(x_{t+1}| x_{t})$ is Gaussian, then the reverse process $q(x_{t}| x_{t+1})$ is also Gaussian given the variance is low. (https://lilianweng.github.io/posts/2021-07-11-diffusion-models/)

Because of the second property, we can approximate the reverse process as a Gaussian. The variances are fixed, but we need to learn the mean of that distribution -- that's what our network does.

In the original DDPM paper, the network predicts the noise that was added, not the original image. I found that this was hard to achieve. Because our model is quite small, it's hard to compress 28*28*3 samples of independent noise into a feedforward bottleneck. In contrast, it's simpler to compress the original image since there are only a few MNIST digits. The noise can be recovered via noise = noisy_image - predicted_original_image.

Note that the denoising process is a Gaussian at every timestep. So while we're predicting the mean with our learned network, we also add back noise. The variance of this noise follows the pre-determined schedule so it decreases over time.

What do the metrics show?

Works nicely. The frequencies aren't as well-matched as the VAE, but it's at least covering all the modes. That's better than the GAN or PixelRNN.

The error rates are interesting. The highest error isn't by shape, but by color. Examining the generated images, it looks like the model has trouble deciding if it wants to make the digits red or blue. Really, those pixels should all be a shade of purple, with all pixels being the same shade.

As always, code for everything here is available at https://github.com/kvfrans/generative-model.

Deriving the KL divergence loss in variational autoencoders

Kevin Frans — Tue, 01 Aug 2023 15:01:54 GMT

Let's derive some things related to variational auto-encoders (VAEs).

Evidence Lower Bound (ELBO)

First, we'll state some assumptions. We have a dataset of images, $x$ . We'll assume that each image is generated from some unseen latent code $z$ , and there's an underlying distribution of latents $p(z$ ). We'd like to discover parameters $\theta$ to maximize the likelihood of $p_\theta(x|z)$ under the data. In practical terms -- train a neural network that will generate images from the dataset.

Let's look at the posterior term closer.

$p_\theta(x|z) = \dfrac{p(z|x)p(x)}{p(z)}$

We can calculate $p_\theta(x|z)$ for any $z$ .
We can assume $p(z)$ . In the VAE case, it's the standard Gaussian.
We don't know $p(x)$ . It is intractable to compute $p(x) = \int p_\theta(x|z)p(z)dz$ , because the space of all $z$ is large.
We also don't know $p(z|x)$ .

To help us, we'll introduce an approximate model: $q_\phi(z|x)$ . The idea is to closely match the true distribution $p(z|x)$ , but in a way that we can sample from it. We can quantify the distance between the two distributions via KL divergence.

$KL(q_\phi(z|x), p(z|x)) = E_q[log \; q_\phi(z|x) - log \; p(z|x)]$

In fact, breaking down this term will give us a hint on how to measure $p(x)$ .

$= E_q[log \; q_\phi(z|x) - log \; p(z|x)]$
$= E_q[log \; q_\phi(z|x) - log \; p(z,x)/p(x)]$
$= E_q[log \; q_\phi(z|x) - log \; p(z,x) + log \; p(x)]$
$= E_q[log \; q_\phi(z|x) - log \; p(z,x)] + log \; p(x)$
$= E_q[log \; q_\phi(z|x) - log \; p_\theta(x|z)p(z)] + log \; p(x)$

We can rearrange terms as:

$log \; p(x) = E_q[log \; p_\theta(x|z)p(z) - log \; q_\phi(z|x)] + KL(q_\phi(z|x), p(z|x))$

So, $log \; p(x)$ breaks down into two terms.

The KL term here is intractable because it involves $p(z|x)$ , so we can't sample it. At least, we know that KL is always greater than zero.
The expectation is tractable! It involves three functions that we can evaluate. Note that $E_q$ means $x$ is sampled from the data, then $z$ computed via $q_\phi(z|x)$ . This expectation is a lower bound on $log \; p(x)$ . We call it the evidence lower bound, or ELBO.

The ELBO can be further broken down into two components.

$E_q[log \; p_\theta(x|z)p(z) - log \; q_\phi(z|x)]$
$= E_q[log \; p_\theta(x|z)+ log \; p(z) - log \; q_\phi(z|x)]$
$= E_q[log \; p_\theta(x|z)] + E_q[log \; p(z) - log \; q_\phi(z|x)]$
$= E_q[log \; p_\theta(x|z)] + KL(p(z), q_\phi(z|x))$

The ELBO is equal to:

A reconstruction objective. For all $x$ in the dataset, encoding it via $q_\phi$ then decoding via $p_\theta$ should give high probability for the original $x$ .
A prior-matching objective. For all $x$ in the dataset, the distribution of $q_\phi(z|x)$ should be similar to the prior $p(z)$ .

Here's a practical way to look at the ELBO objective. We can't maximize $p(x)$ directly because we don't have access to $p(z|x)$ . But if we approximate $p(z|x)$ with $q_\phi(z|x)$ , we can get a lower bound on $p(x)$ and maximize that instead. The lower bound depends on 1) How well $x' = p(q(x))$ recreates the data, and 2) How well $q(x)$ matches the true prior $p(z)$ .

Analytical KL divergence for Gaussians

In the classic VAE setup, p(x) is a standard Gaussian. The prior-matching objective above can be computed analytically without the need to sample q(z|x).

Since the VAE setup defines an independent Gaussian for each data point in the batch, we only need to consider the univariate Gaussian case (instead of a multivariate). The encoder network will output a mean $\mu$ and standard deviation $\sigma$ . We'll refer to this as $P = N(\mu, \sigma^2)$ . The standard Gaussian is $Q = N(0,1)$ .

The KL divergence measures how different two probability distributions are:

$KL(P,Q) = E_{P}[log \;P(x) - log \;Q(x)]$

We'll also need the probability density for a Gaussian:

$p(x|\mu,\sigma) = \dfrac{1}{\sigma \sqrt{2 \pi}} e^{-\dfrac{1}{2}(\dfrac{x-\mu}{\sigma})^2}$

For Q, $\mu=0$ and $\sigma=1$ so the term simplifies.

$q(x) = \dfrac{1}{\sqrt{2 \pi}} e^{-\dfrac{1}{2}x^2}$

Given the above, let's plug in to the KL divergence.

$KL(P,Q) = E_P[log(\dfrac{1}{\sigma \sqrt{2 \pi}}) -\dfrac{1}{2}(\dfrac{x-\mu}{\sigma})^2 -log(\dfrac{1}{\sqrt{2 \pi}}) +\dfrac{1}{2}(x)^2]$

Let's break down this expectation and deal with each term one-by-one.

$KL(P,Q) = E_P[log(\dfrac{1}{\sigma \sqrt{2 \pi}}) -log(\dfrac{1}{\sqrt{2 \pi}})] + E_P[- \dfrac{1}{2}(\dfrac{x-\mu}{\sigma})^2] + E_P[\dfrac{1}{2}(x)^2]$

The first expectation involves two logs that can be combined into one. There is no $x$ term, so the expectation can be dropped.

$= E_P[-log(\dfrac{\sigma \sqrt{2 \pi}}{\sqrt{2 \pi}})]$
$= E_P[-log(\sigma)]$
$= -log(\sigma) = -(1/2)log(\sigma^2)$

The second expectation can be simplified by knowing that $E_P[(x-\mu)^2]$ is the equation for variance ( $\sigma^2$ ).

$E_P[- \dfrac{1}{2}(\dfrac{(x-\mu)^2}{\sigma^2})]$
$= -(1/2) E_P[(x-\mu)^2)] (1/\sigma^2)$
$= -(1/2) \sigma^2 (1/\sigma^2)$
$= -(1/2)$

The third expectation can also be explained in terms of variance. An equivalent equation for variance is $\sigma^2 = E[X^2] - E[X]^2$ . In our case, $E[X] = \mu$ , and $E[X^2]$ is what we want to find. So,

$\sigma^2 = E[X^2] - \mu^2$
$E[X^2] = \sigma^2 + \mu^2$
$(1/2)E[X^2] = (1/2)(\sigma^2 + \mu^2)$

Nice, all of our expectations are now expressed as functions of $\mu$ and $\sigma$ . Put together, we get the final equation for the KL divergence loss:

$KL\_Loss(\mu,\sigma) = (1/2) [-log(\sigma^2) - 1 +\sigma^2 + \mu^2]$

Note that our encoder would give us a batch-sized vector of $\mu$ and $\sigma$ . Because we assume each $(\mu, \sigma)$ pair parametrizes an independent Gaussian, the total loss is the sum of the above equation applied elementwise.

Sanity check: the KL Loss should be minimized when $\mu=0, \sigma=1$ .

$(d/d \mu) \; [-log(\sigma^2) - 1 +\sigma^2 + \mu^2] = 2 \mu \rightarrow$ min. at $\mu=0$ .
$(d/d \sigma) \; [-log(\sigma^2) - 1 +\sigma^2 + \mu^2] = 2 \sigma - 2/\sigma \rightarrow$ min. at $\sigma=1$ .

All good!

An aside on batch-wise KL vs. element-wise KL

A common misconception when examining this KL loss is how it relates to each batch of data. One could imagine that the KL loss is meant to shape a given batch of latent vectors so the approximate distribution within the batch is standard Gaussian. This interpretation would mean that on average, the latents should have mean 0 and variation 1 – but each sample can vary. It is possible to nearly-perfectly match the standard Gaussian while conveying information about the input image.

The above interpretation is incorrect. Instead, the KL loss is applied element-wise. Each image is encoded into a mean+variation pair, and the loss encourages this explicit mean and variation to resemble 0 and 1. Thus to nearly-match the standard Gaussian, each image would encode to a standard Gaussian (and thus no unique information is conveyed). In this way, the KL loss and the recreation objective are conflicting. A balance is located depending on the scaling of the two objectives.

For AGI, we need better tasks. For better tasks, we need open-endedness. (ALOE 2022 Notes)

Kevin Frans — Sun, 15 May 2022 19:00:11 GMT

Open-endedness is the idea of a system that "endlessly creates increasingly interesting designs". That's a simple concept with big implications. Biological life on earth is diverse and beautiful and the result of open-ended evolution. Human culture is open-ended; we've created a lot of art and technology without any real plan.

For those of us interested in AI and the emergence of intelligence, open-endedness is especially important. Because even with powerful learning algorithms, intelligence is one of those tricky things that's hard and complicated to specify. But that's the beauty of open-ended systems: they can generate unexpected designs, designs that go beyond the intention of the designer. And so even if intelligence is hard to create, it might be possible to create a system where intelligence naturally emerges.

This weekend, a bunch of us thinking about how open-endedness can boost AI gathered at the Agent Learning in Open-Endedness (ALOE) workshop at ICLR. I had the pleasure of attending, and since this topic is really fascinating I took some notes, so here they are.

As context, I (Kevin Frans) am a master's student at MIT under Phillip Isola, and I've spent time at places like OpenAI and Cross Labs. The hypothesis I've been studying is: a strong learning algorithm, coupled with a rich enough set of tasks, will result in intelligence. We have strong learning algorithms, but our tasks are quite bad, so what are consistent principles for making task distributions that are rich, learnable, yet increasingly complex?

More tasks = smarter agents.

A common theme in AI is that we take the same basic learning algorithms, and scale them up with harder tasks, and they can discover more complicated behaviors. There's arguments of whether this fact is good or bad for research, but the fact is that it works, and accepting this can lead to cool practical results.

XLand, the new task set from Deepmind, works by accepting this lesson and pushing it to the limit. XLand aims to create a whole bunch of related tasks by 1) defining a distribution of games, and 2) making them all multi-agent.

The key lesson here is that distribution of XLand tasks is just so varied and dense. XLand tasks take place in a 3D physics world, and agents need to do things like move blocks around, catch other agents, or play hide-and-seek. These games are defined in a modular way, so some agents get reward based on Distance(Agent, Ball) while others are rewarded for -Distance(AgentA, AgentB). Each task also takes place on a procedurally-generated terrain. Tasks are inherently multi-agent, so the same task can have a different reward landscape based on which agents are involved.

Now that we've got a rich distribution of games, just train agents with RL and they will be smart. There's some fancy population-based training involved, but it's a side point, the main idea is that agents trained on this rich set of games have no choice but to become generally capable, and they do. Strong XLand agents perform well on a bunch of the tasks, even unseen ones, and seem to generally do the right thing.

To me, the nice part about XLand is that it confirms we can achieve strong generalization just by engineering the task space to be richer, denser, and larger. Now AGI is a game design problem, and game design is fun. The downside is that this all seems awfully expensive to compute.

Procedural content generation + AI.

My favorite video games can be replayed hundreds of times, and these games usually make use of procedural content generation (PCG). Instead of manually building out levels, we can define an algorithm that builds the levels for us, and use it to generate an infinite amount of content.

Sam Earle's work is on how PCG and AI intersect. First, can AI methods be used as a PCG algorithm? You could imagine PCG being a data modelling problem – given a set of 1000 Mario levels, train a GAN or autoencoder to model the distribution. But this is open-endedness, we're not satisfied with mimicry. Sam's early work on PCGRL frames content generation as a reinforcement learning task. If we can define a measurement of how good a level is, we can try and learn a generator that best fits this measurement. A follow-up paper takes this further, and lets the user specify a bunch of parameters like level-length or difficulty, which the AI generator does its best to fulfill.

To me, the simplest open-ended loop is a zero-sum competitive game. Two players compete against each other, and adapt to each other's weakness in a continuous learning loop. So if we want to train a really good Mario-playing agent, it's crucial that we think just as hard about a Mario level-generating agent. That's where AI PCG methods come in – we're only going to get so far with human-made levels, so we're going to need to build generator agents that never stop learning.

The Tao of Agents and Environments

"Instead of a real presentation, here are some slides that I threw together in an hour", started Julius Togelius. "Actually I spent two days on this, but it's funnier to pretend I didn't." He then proceeded to show pictures of Agent Smith from The Matrix, and an environment diagram of the H2O cycle.

The purpose of all this was to point out that our classic framework of "agents interacting with an environment" is limiting. What if our environment has other agents in it? Why does the game-player have to be the agent? Can the environment be the player? If we start to question these terms, maybe there's new ground we can think about when it comes to "agent-based" systems.

From an open-endedness perspective, this idea makes sense, because we want to think of our systems as more than just optimization problems. A traditional reinforcement learning problem is to train a robotic agent to solve a control environment, it's clean and simple and solvable. But in an open-ended system, maybe we want the robot to compete against other robots. Maybe the real goal is not to develop strong robots, but to see the emergence of cooperation or communication. Maybe we care about these robots because we want to discover new challenges for new robots. In an open-endedness view, everything is adapting so everything is an agent, and everything affects everything else so everything is also an environment.

To see emergent complexity, use antagonistic pairs.

A fundamental open-ended loop is:

Solve a task
Generate a new task
Repeat

Solving a task is well-studied, we have whole fields on how to do this. How to generate new tasks, however, is ripe for exploration. We need to be careful. If the tasks are too similar, then we're not learning anything new, but if the tasks are too distinct, an agent might not be able to solve them. Ideally, we want to define some implicit curriculum of tasks that an agent can gradually work its way through.

Past works on open-ended task generation include ideas like POET, where tasks are generated by 1) making 100 variations of a solved task, and 2) trying to solve as many of the variations as we can. This framework does work; but it's limited by the hard-coded method of making task variations. We can do something smarter.

PAIRED, as presented by Natasha Jaques, presents a more adaptive way of generating tasks. If we think about the kinds of tasks we want, its tasks that the agent is just barely capable of solving. PAIRED proposes a heuristic using two agents, a protagonist and an antagonist, and a task generator. The generator tries to propose tasks that the protagonist can't solve, but the antagonist can. This way, we're always generating tasks in the sweet spot where they're solvable, but the protagonist hasn't quite figured it out yet. "Hey protagonist", says the generator, "here's a task that your brother the antagonist can solve, so I know you can solve it too, so start trying".

I think these little loops are quite powerful, and we just need to find better places to apply them. As a researcher, if I want to train a robust AI agent, I should try and parametrize a super-expressive distribution of tasks, then let loose an open-ended process to gradually expand what it can solve.

Panel Discussion

Along with the presentations, here were some fun thoughts from the panel discussions:

Ken: What's the difference between AI and ALIFE? Well in AI we're somewhat focusing on economic value, whereas in ALIFE we really just care about what's cool.
Ken: ALIFE focuses too much on evolution, which we don't really need. But in AI and machine learning, we need to work on weirder and braver ideas.
Danijar: We need more open-ended environments. Atari is dumb, robotic control is too simple.
Ken: Environmental design for OE is a deep field. It needs to be understood. What gives our universe properties that lets it sustain OE for so long?
Julian: People who designed games like Skyrim and Elden Ring have figured out a bunch of heuristics that make good OE experiences.
Danijar: Minecraft kind of has crafting, but it's not OE crafting, it's hardcoded. We want to have emergent crafting, but also be computationally feasable.
Joel: The best part of ALIFE is the belief that it's totally possible to make life or life-like evolution, and it should be easy, and it's just right there.
Moderator: Say you travel 10 years into the future, and there's an open-ended simulation that works. What does it look like?
(everyone looks up and thinks)
Joel: We tackle the meta learning objective. We optimize for human preferences, we tell a system to keep making cool stuff that is cool to us as humans.
Natasha: Inter-group competition. We get something interesting where we have competition, and groups competing against each other, so we have cooperative stuff like language, communities, etc.
Julian: The system is super diverse. There are many things going on. I don't know what the principles are, but I won’t understand what’s going on. The metrics look great, but it's like looking at a coral reef, I don't know what's going on but there’s pretty colors. Like a coral reef, there will be many extremely different kinds of things.
Danijar: We get better and better at simulating some version of the real world, at a level of abstraction that is more and more accurate, maybe some MMO game to collect resources, or a mathematics environment where we discover more theorems.
Ken: There’s so many possibilities its hard to say, at a low level there’s creatures in a world that are reminiscent of nature, there’s theorem proving, agents that are with humans in the web, OE interacting with humans.

For more, check out the ALOE proceedings and workshop.

A Mathematical Definition of Interestingness

Kevin Frans — Mon, 06 Dec 2021 22:28:59 GMT

In open-endedness research, we care about building systems that continously generate interesting designs. So what makes something interesting? Previously, I argued that interestestness depends on a viewer's perspective. But this answer isn't very satisfying.

Ideally, we would want some mathematically-definable way to measure how interesting something is. A naive approach would to be to define interestingness as variation or entropy, but we run into the following problem:

In this example, you get the highest entropy by mixing milk and coffee, since at the molecular level the particles are the most widely distributed. But somehow the middle cup is the most interesting to us as humans, since the half-mixed milk forms all sorts of cool swirls and twists.

An analogy I like to draw here is the 10,000 oatmeal problem, which states:

I can easily generate 10,000 bowls of plain oatmeal, with each oat being in a different position and different orientation, and mathematically speaking they will all be completely unique. But the user will likely just see a lot of oatmeal.

Entropy is not all that great! Sure, we get a lot of unique designs, but at some level they all look like the same design. Something is missing in our interestingness formulation.

The best definition of interestingness I've come across so far is the idea of sophistication, which is mentioned in this blog post by Scott Aaronson:

Sophistication(x) is the length of the shortest computer program that describes, not necessarily x itself, but a set S of which x is a “random” or “generic” member.

Sophistication solves our milk-coffee problem and our oatmeal problem. In both cases, the high-entropy examples can be described by very short programs – just sample random positions for each particle. Something like the swirls in the half-mixed cup has higher sophistication, since you'd need a more complicated program to generate those artifacts.

How can we apply sophistication in the context of open-endedness and machine learning? Let's say we're trying to generate a lot of interesting image. One formulation is to do something like:

An image gets less reward the more times it is generated.
Continously search for high-reward images.

Obviously, this formulation kind of sucks, since the space of images is really big, and we will basically end up with a lot of images of noise. Ideally, we want to eliminate all images that look like 'static noise' in one swoop, and move on to more interesting kinds of variation.

Here's a sophistication-like formulation that makes use of smarter learning:

A generative adversarial network is trained to model all past images that have been generated.
An image gets less reward if the GAN's discriminator thinks it looks in-distribution.
Continuously search for high-reward images and re-train the GAN.

In theory, this GAN-based formulation should explore the image space in a more efficient manner. Once a few static-noise images have been generated, the GAN can model all types of static-noise, so high-reward images need to display other kinds of variation.

The GAN-based formulation kind of resembles sophistication, where we assume that Soph(x) is just -Disc(x), i.e. an image is sophisticated if it the GAN struggles to model it correctly. As long as we're continously re-training our GAN, this approximation shouldn't be too bad.

(To be continued...)

To extract information from language models, optimize for causal response

Kevin Frans — Fri, 12 Nov 2021 20:56:33 GMT

One way to view AI progress is that we can measure increasingly abstract concepts.
A decade ago, it was really hard to take a photo and measure “how much does this image look like a dog?”. Nowadays, just pick your favorite image classification model and run the image through. Similarly, it used to be unfeasable to measure “how realistic is this photo?”, but with GANs, we can do it easily.

With the rise of powerful language models, e.g. GPT-3, we’ve unlocked the ability to work with even higher-level ideas. GPT contains a powerfully rich representation of our world, and it’s been shown to solve logical puzzles, summarize large texts, and utilize factual knowldge about the human world.

Can we somehow extract this knowledge out of GPT? Let’s say an alien wants to learn about the human world, so it goes up to its local GPT engine and asks, “What is coffee?”

“Oh, got it”, the alien says, “so coffee is a kind of orange juice.” This definition is shallow! Coffee is definitely something you drink in the morning, but so are a lot of other things. This answer is vague, and vagueness is a common trend when asking GPT to describe various concepts:

How can we improve these results? Let’s take a step back. We know that GPT has good knowledge of what coffee is, but that doesn’t mean it’s good at telling us what it knows. We want a stronger way of interacting with GPT.

A nice field to borrow from is feature visualization, which focuses on understanding what types of inputs activate various areas of neural networks. Crucially, feature visualization techniques tend to examine causal relationships – results are obtained by optimizing through the neural network many times, and discovering an input which maximizes a certain neuron activation.

So, let’s do the same with language models. Instead of asking GPT what it thinks about certain topics, let’s just search for phrases which directly affect what GPT is holding in memory. We can do this through what I’ll call causal-response optimization, which tries to maximize the probability of GPT replying “Yes” to a specific question.

Does this metric actually mean anything? It's hard to confirm this quantitatively, but we can at least look at how the causal-response score of "Am I thinking of coffee?" responds to some human-written options.

OK, it's not bad! The more details the phrase has, the higher the likeihood that GPT responds with "Yes". That sounds reasonable. Now, we just need to replace the human-written options with automatically generated options.

There’s actually a whole body of work, known as “prompting”, that deals with how to locate phrases that adjust language model behavior. But we’re going to take a simple approach: 1) generate 20 varied answers for “What is coffee?”, and 2) select the answer with the greatest causal-response score for “Am I thinking of coffee?”.

How well does this setup work? Here’s a comparison example on the word-definition task we looked at earlier.

Nice, we're on the right track. Generating text through causal-response optimization results in substantially different results than pure forward-prediction. In general, the causal response definitions are more detailed and specific, which makes sense in this case, since the objective is to maximize GPT's recognition of the word.

What else can we do with causal-response optimization? Well, it turns out that GPT has a slight tendency to lie when doing forward-prediction. Causal-optimization can help deal with this issue. Take a look at the following list-completions:

Note: Suwa is a city, not a prefecture. Note 2: These are cherry-picked examples, not rigorously tested

The forward-prediction answers seem right at first glance, but they're a little wrong ... you can't read a painting. If we use causal-response optimization, we get "art in some form", which is a little more satisfying.

Another fun thing we can do is use causal-response optimization to steer long-form text generation. GPT does a nice job of evaluating various aspects of text through yes-no answers:

Since we can now measure concepts like happiness or old-man-ness, we can also optimize for them.

Let's look at a "story about apples". If we wanted to hear it from the perspective of an old man, one option is to adjust our forward-prediction prompt like so:

Or, we could apply causal-response optimization to each sentence generated. For each line, ten sentence options are generated, and the one that has the best score for "Is this something an old man would say?" is chosen.

Almost every sentence in the causal-optimization story reinforces the idea of an old man speaking: there are references to growing up in the country, apples straight off the tree, and the absence of supermarkets.

So, as researchers, why should we care about causal-response optimization?

I'm not trying to argue that this method constantly provides better quality results. The examples in this post are mostly anecdotal, and are somewhat cherry-picked to showcase key differences. While I'm sure causal-response can be used to improve text-generation if applied correctly, that claim would require more rigor than I can provide at the moment.

What I do want to present, are some hints that causal-response is a useful tool for extracting knowledge out of language models. Causal optimization results in substantially different content than forward-prediction – we're explicitly probing into the language model, instead of relying on the model to say what it is thinking. We should explore this direction more! Strong pre-trained models are going to be around for a while, so understanding how they view the world is an important task to be looking at.

The code for these results is available at this gist, although you need an OpenAI GPT-3 key to run it. I think the code is very readable, so it's probably easy to re-implement this stuff with e.g. GPT-Neo.

Further work can involve:

How can we quantatively measure how good generated sentences are?
Are there good metrics for knowledge-extraction?
What questions do we have about large language-model representations, and can causal-optimization be used to answer those questions?
Are there better ways of optimizing text for causal-response than bootstrapping off forward-prediction? I.e. evolutionary search, gradient optimization, etc. (Somewhat addressed in Appendix)

Appendix

Cite as:

@article{frans2021causalresponse,
  title   = "To extract information from language models, optimize for causal response",
  author  = "Frans, Kevin",
  journal = "kvfrans.com",
  year    = "2021",
  url     = "https://kvfrans.com/causal-language-model/"
}

This work isn't rigorous enough for me to publish it in a formal setting, but I would love to see further research the directions discussed here. Thanks to Phillip Isola and others in the Isolab for the help :)

Source Code

lang.ipynb

GitHub Gist: instantly share code, notes, and snippets.

Gist262588213843476

Things To Read

"Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing", by Liu et al. A growing field in the last years deals with prompting for pre-trained language models. In prompting, instead of fine-tuning the network parameters on a task, the goal is to learn a prompt phrase that is appended to input text in order to influence the behavior of the language model. This paper is a good review of the area.

"AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts", by Shin et al. This paper is one of the works in prompting, I picked it because it cleanly states the prompting problem and presents a framework. In AutoPrompt, the aim is to learn a set of trigger words which influences a language model into predicting the correct output for a set of tasks.

"Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search", by Galatolo et al. CLIP-GlaSS is a way to move from text-to-image and image-to-text via CLIP encodings. There are similarities to prompting work here, especially in the image-to-text task, where the aim is to locate text which best matches a given image.

In CLIP-GlaSS they use an evolutionary search to locate text; I tried that initially in the causal-response optimization, but the method struggled with finding gramatically-correct answers. I played with various combinations of beam search, where language model logits are weighted by the causal-response objective, but the results had a lot of variation. In the end, generating options via forward-prediction and selecting via causal-response led to be most satisfying results.

"Language Models as Knowledge Bases?", by Petroni et al. This work is an analysis on BERT and how it performs well on various question-answering datasets without additional fine-tuning. To be honest, I chose this paper based on its name, which implies a perspective I agree with – language models are rich knowledge bases, and what we need are tools to extract this knowledge from them.

Data digesters, ML^2, Interestingness

Kevin Frans — Tue, 02 Nov 2021 17:16:40 GMT

This post is a bunch of random thoughts about things I've read recently. It's organized into sections, and the sections don't really relate to each other, but here they are.

Neural Networks are Data Digesters

A common philosophy around my head is that "representation learning is everything". Need to label some images? Train a linear classifier over a powerful representation space. Need to navigate a robot to pick up some blocks? Train a simple policy over a strong representation of the robot. If you can map your input data into a general enough representation, everything becomes easy.

Eric Jang's blog post "Just Ask For Generalization" takes this philosophy a step further. If what we really need is strong representations, and neural networks are really strong sponges for absorbing data, then we might as well cram as much relevant data as we can into our models. Eric argues:

1. Large amounts of diverse data are more important to generalization than clever model biases.
2. If you believe (1), then how much your model generalizes is directly proportional to how fast you can push diverse data into a sufficiently high-capacity model.

To me, it feels like we're reaching diminishing returns in designing new models – e.g, transformers are highly effective at modeling everything from text to video to decision-making. Progress lies, instead, in how we can feed these models the data they need to properly generalize in whatever dimensions we care about.

One hypothesis is that we should give our models as much data as we can, regardless of how relevant the data is to the task at hand. Take, for example, language models – the strongest models are pretrained on billions of arbitrary texts from the internet. In reinforcement learning, we're starting to see the same parallels. Traditionally, agents are trained to maximize rewards on the task at hand, but it turns out we get sizable improvements by training agents to reach past states, achieve a range of rewards, or solve many mutations of the task.

If we believe this hypothesis, we should really be researching how to best extract relevant data for the task at hand. Should a strong RL agent be trained to achieve many goals instead of one, or to achieve a spectrum of rewards instead of optimal reward? Should its model be able to predict future states? To contrast augmented data from fake data? Should it be trained over thousands of games, instead of just one?

Machine Learning via Machine Learning

MIT offered a meta-learning class last year, which I really wanted to take, except I was away on gap year so I couldn't. The highlight of the class is to build "a system that passes 6.036 Introduction to Machine Learning as a virtual student". That's so cool! Thankfully the group behind the class released a paper about it, so here it is.

We generate a new training set of questions and answers consisting of course exercises, homework, and quiz questions from MIT’s 6.036 Introduction to Machine Learning course and train a machine learning model to answer these questions. Our system uses Transformer models within an encoder-decoder architecture with graph and tree representations ... Thus, our system automatically generates new questions across topics, answers both open-response questions and multiple-choice questions, classifies problems, and generates problem hints, pushing the envelope of AI for STEM education.

Summary-wise, they've got a Transformer model which reads in machine learning questions, and outputs a graph of mathematical operators which solve the questions. It seems to work quite well, and passes the class with an A (!).

A lot of the benefit seems to come from the data augmentation:

After collecting questions from 6.036 homework assignments, exercises, and quizzes, each question is augmented by paraphrasing the question text. This results in five variants per question. Each variant, in turn, is then augmented by automatically plugging in 100 different sets of values for variables. Overall, we curate a dataset consisting of 14,000 augmented questions.

One thing that's a bit dissapointing is that coding questions are skipped. In my opinion well-formulated coding questions are the best part of classes, since they force you to replicate the content and really understand it. Maybe a tandem team with OpenAI Codex can pass the course in full.

What makes something interesting?

In open-endedness research, we want to build systems that can keep producing unbounded amounts of interesting things. There are many problems with this definition. What qualifies as unbounded? What makes something interesting? What makes something a thing?

One answer is that all these questions depend on the perspective that we look at the world in. Alyssa M Adams' talk at Cross Roads this month gives an example:

The sea was never blue. Homer described the sea as a deep wine color. The greek color experience was made of movement and shimmer, while modern humans care about hue, saturation and brightness.

It's a representation learning problem! Interestingness isn't a fact by itself, an object is only interesting to the agent that is looking at it. Normally, this answer isn't that satisfying, since modeling what is interesting to humans is expensive at best and inaccurate at worst.

With the rise of foundation models, however, we can make a non-trivial argument that AI models like GPT-3 and CLIP are learning strong human-centric representations of reality. Could we just, like, ask these models what they think is interesting?

Maybe GPT-3 thinks "walking" and "running" are too similar, but "flying" is qualitatively different. Maybe CLIP thinks a truck is a truck regardless of what roads it's on, but if the truck is in the ocean, now that is interesting.

Interestingness in Architecture

"Nearly everything being built is boring, joyless, and/or ugly", writes Nathan Robinson. In an April article, Robinson showcases colorful classical temples, mosques, and coastal villages. Then he shows photographs of modern brutalist design – cold and gray.

What is beauty? Beauty is that which gives aesthetic pleasure ... a thing can be beautiful to some people and not to others. Pritzker Prize-winning buildings are beautiful to architects ... They are just not beautiful to me.

Hey, look, it's the problem of interestingness again. What makes a building lively, welcoming, or as Christopher Alexander puts it, contains "the quality without a name"? There's probably no objective answer, but there is a human-centered answer. Could an AI understand architectural aesthetics? If it has the right representations, I don't see why not.

CLIPDraw: Exploring Text-to-Drawing Synthesis

Kevin Frans — Mon, 28 Jun 2021 20:07:30 GMT

CLIPDraw synthesizes novel drawings from text. Play with the codebase yourself.

AI-assissted art has always been intriguing to me. To me, visual art is a very human thing -- I can’t imagine a way that a computer rediscovers our own cultural concepts without some kind of experience living in our world. So when AI methods are able to learn from us, and produce artwork that we as humans might make ourselves, something cool is definitely happening.

This post presents CLIPDraw, a text-to-drawing synthesis method that I’ve been playing with nonstop for the past few weeks. At its core, CLIPDraw is quite simple, yet it is able to produce drawings that display a whole range of interesting behaviors and connections. So for this article, I want to focus on showing off behaviors of CLIPDraw that I personally found intriguing and wanted to learn more about. For more detailed analysis and technical detail, check out the CLIPDraw paper, or play around with the Colab notebook yourself!

Image Synthesis -> Drawing Synthesis

The field of text-to-image synthesis has a broad history, and recent methods have shown stunningly realistic image generation through GAN-like methods. Realism, however, is a double-edged sword – there's a lot of overhead in generating photorealistic renderings, when often all we want are simple drawings. With CLIPDraw, I took inspiration from the web game Skribbl.io, where players only have a few seconds to draw out a word for other players to guess. What if an AI could play Skribbl.io? What would it draw? Are simple shapes enough to represent increasingly complex concepts?

How does CLIPDraw work?

CLIPDraw's knowledge is powered through a pre-trained CLIP model, a recent release which I definitely reccommend reading about. In short, a CLIP model consists of an image encoder and a text encoder, which both map onto the same represenational space. This setup allows us to measure the similarities betweeen images and text. And if we can measure similarities, we can also try to discover images that maximize that similarity, therefore matching a given textual prompt.

The basic CLIPDraw loop follows this principle of synthesis-through-optimization. First, start with a human-given description prompt and a random set of Bezier curves. Then, gradually adjust those curves through gradient descent so that the drawing best matches the given prompt. There are a few tricks that also help, as detailed in the paper, but this loop is basically all there is to it.

Now, let's examine what CLIPDraw comes up with in practice.

What visual techniques does CLIPDraw use?

A reoccuring theme is that CLIPDraw tends to interpret the description prompts in multiple, unexpected ways. A great example for this is "A painting of a starry night sky", which shows a painterly-styled sky with a moon and stars, along with an actual painting canvas and painter in the foreground, which then also features black and blue swirls resembling Van Gogh's "The Starry Night".

CLIPDraw also likes to use abstract symbols, the most prominent being when it writes out the literal word inside the image itself. Sometimes, CLIPDraw will use tangentially related concepts, like the Google Maps screenshot when asked for "自転車 (Bicycle in Japanese)”. A fun result is the prompt for "Fast Food", which shows hamburgers and a McDonald's logo, but also has a bunch of joggers racing in the background.

How does CLIPDraw react to different styles?

An experiment I really enjoyed was to synthesize images of cats, but in different artistic styles. With CLIPDraw, this was as easy as changing the descriptor adjectives of the textual prompt. Surprisingly, CLIPDraw is quite robust at handling different styles, contrary to the initial intent to synthesize scribble-like drawings.

One interesting feature is that CLIPDraw adjusts not only the textures of drawings, ala Style Transfer methods, but also the structure of the underlying content. In the cat experiments, asking for "a drawing" produces a simplified cartoonish cat, while prompts like "a 3D wireframe" produce a cat in perspective, with depth and shadows.

"The Eiffel Tower"

Does stroke count affect the drawings that CLIPDraw produces?

A key aspect of CLIPDraw is that drawings are represented as a set of Bezier curves rather than a matrix of pixels. This feature gives us a nice parameter to tweak in the number of curves a drawing is comprised of.

Emperically, drawings with low stroke counts tend to result in a more cartoonish or abstract representation of the prompt, such as the 16-stroke version of "The Eiffel Tower" being basically made of a few straight lines. As stroke count is increased, CLIPDraw begins to target more structured shapes, shown as our Eiffel Tower begins gaining 3D perspective, lights, and finally a background.

What happens if abstract words are given as a prompt?

A fun thing to push the limits is to give CLIPDraw abstract descriptions, and see what it does with them. As a human artist, even I would have to stop and think about how to convey these concepts through visuals, so it's interesting to see how the AI will approach things.

In most cases, CLIPDraw likes to use symbols to showcase concepts that are culturally related to the given phrase, like the fireworks and smiles in "Happiness" or the Japanese and English-like letters in "Translation".

My favorite here is the drawing for "Self", which features a body holding up multiple heads. The drawing can almost be seen as a metaphor for e.g. the idea that a person's self may contain multiple outward personalities, or that a self is actually a sum of many cognitive processes. This piece is defintely the most "art-like" example I came across; there's a lot of room for individual interpretation, and it almost feels like CLIP knows something that I don't.

Can drawings be fine-tuned via negative prompts?

A final experiment was to see if CLIPDraw behavior could be finely adjusted by introducing additional optimization objectives. In the normal process, drawings are optimized to best match a textual prompt. What if we also tried to optimize unsimilarity with a set of negative prompts?

In a few situations, this works! By penalizing the prompt for "Words and text", the "Hashtag" drawing features less prominent words and instead draws a set of selfie-like faces. Negative prompts can also do things like adjust the color of images, or force drawings to contain only a single subject rather than many.

Practially, the examples above were pretty hard to achieve, and most of the time negative prompts don't change much about the final drawing at all. I think there's still a lot of room for experimentation on how to improve this technique. One thing I'd love to see is a panacea prompt, like "A messy drawing", that consistently improves drawing quality if used as a negative objective regardless of context.

CLIPDraw: Parting Thoughts

The CLIPDraw algorithm isn't particularly novel; people have doing synthesis-through-optimization for a while through activation-maximization methods, and recently through CLIP-matching objectives. However, I do believe biasing towards drawings rather than photorealism gives images more freedom of expression, and optimizing Bezier curves is a nice way to do this efficiently. I also personally love this artstyle and I think the drawings are quite similar to what an artist would produce ;).

That being said, I do think the behaviors showcased here should be pretty generalizable to any CLIP-based optimization method. Already a few extensions come to mind – Can we synthesize videos? 3D models? Can an AI play Skribbl or Broken Picture Phone with itself? I'm sure you as a reader have a bunch of ideas I haven't even considered. So please, feel free to take this method and go wherever you feel is exciting. And then tell me about it!

You can experiment with the Colab notebook here, and play with CLIPDraw in the browser. Results are generally pretty quick (within a minute) unless you crank up stroke count and iterations. You can also check out the full paper, which contains a bit of a deeper analysis and details on the technical implementation.

Thanks for reading!

"Leonardo Dicaprio cheers meme template"

Appendix

Thank you to everyone at Cross Labs for assisstance and feedback.

Things To Read

"CLIPDraw: Exploring Text-to-Drawing Synthesis via Language-Image Encoders", by Frans et al. Of course a nice place to start is the paper for CLIPDraw! The paper mostly contains a similar focus on analyzing interesting behaviors, but also includes some comparisons and a Related Work section of good papers to read up on.

"CLIP: Connecting Text and Images", by Radford et al. This is the blog for the original CLIP release, from OpenAI. CLIP is a language-image dual encoder that maps text and images onto the same feature space. Trained on a huge amount of online data, CLIP ends up being super robust and can solve a bunch of image tasks without any additional training. (Paper)

"Differentiable Vector Graphics Rasterization for Editing and Learning", by Li et al. This work is the differentiable vector graphics methods that we use in CLIPDraw to backprop through Bezier curves. It's really impressive and basically works out of the box, so I highly reccommend giving it a read.

"Generative Art Using Neural Visual Grammars and Dual Encoders", by Fernando et al. This paper is another CLIP-based generative art method, also using stroke-based images, but with an emphasis on AI creativity. Instead of backprop, they use evolutionary methods to evolve a custom LSTM-based grammar which defines the strokes. Also their figures are cool.

"Images Generated By AI Machines", by @samburtonking. Also check out the Colab notebooks by @advadnoun and @RiversHaveWings, respectively. This twitter account basically shows off cool generative art made from CLIP + GAN optimization methods. A lot of interesting pieces have been made and it's well worth looking through.

"VectorAscent: Generate vector graphics from a textual description" by Ajay Jain. Here's a project from earlier this year that actually used the same techniques of combining CLIP and diffvg. The key difference is in the image augmentation, which seems to help in ensuring synthesized drawings look more consistent.

StampCA: Growing Emoji with Conditional Neural Cellular Automata

Kevin Frans — Sat, 03 Apr 2021 08:54:15 GMT

StampCA growing emoji. Play with the codebase yourself.

When a baby is born, it doesn’t just appear out of nowhere -- it starts as a single cell. This seed cell contains all the information needed to replicate and grow into a full adult. In biology, we call this process morphogenesis: the development of a seed into a structured design.

Morphogenesis builds up an embyro. https://www.nature.com/articles/s41467-018-04155-2

Of course, if there’s a cool biological phenomenon, someone has tried to replicate it in artificial-life land. One family of work examines Neural Cellular Automata (NCA), where a bunch of cells interact with their neighbors to produce cool emergent designs. By adjusting the behavior of the cells, we can influence them to grow whatever we want. NCAs have been trained to produce a whole bunch of things, such as self-repairing images, patternscapes, soft robots, and Minecraft designs.

Growing a Minecraft tree through Neural Cellular Automata. https://arxiv.org/pdf/2103.08737.pdf

A common thread in these works is that a single NCA represents a single design. Instead, I wanted to explore conditional NCAs, which represent design-producing functions rather than single designs. The analogy here is DNA translation -- our biologies don’t just define our bodies, they define a system which translates DNA into bodies. Change the DNA, and the body changes accordingly.

This post introduces StampCA, a conditional NCA which acts like a translation network. In StampCA, the starting cell receives an encoding vector, which dictates what design it must grow into. StampCA worlds can mimic the generative capabilities of traditional image networks, such as an MNIST-digit GAN and an Emoji-producing autoencoder. Since design-specific information is stored in cell state, StampCA worlds can 1) support morphogenetic growth for many designs, and 2) grow them all in the same world.

At the end, I’ll go over some thoughts on how NCAs relate to work in convnets, their potential in evo-devo and artificial life, and why I’m pretty excited about them looking ahead.

Neural Cellular Automata: How do they work?

First, let’s go over how an NCA works in general. The ideas here are taken from this excellent Distill article by Mordvintsev et. al. Go check it out for an nice in-depth explanation.

The basic idea of an NCA is to simulate a world of cells which locally adjust based on their neighbors. In an NCA, the world is a grid of cells. Let’s say 32 by 32. Each cell has a state, which is a 64-length vector. The first four components correspond to RGBA, while the others are only used internally.

Every timestep, all cells will adjust their states locally. Crucially, cells can only see the states of themselves and their direct neighbors. This information is given to a neural network, which then updates the cell's state accordingly. The NCA will iterate for 32 timesteps.

One timestep of an NCA. Every cell updates their state based on their neighbors' states, as defined by a convolutional neural network.

The key idea here is to learn a set of local rules which combine to form a global structure. Cells can only see their neighbors, so they need to rely on their local surroundings to figure out what to do. Again, the analogy is in biological development -- individual cells adjust their structure based on their neighbors, creating a functional animal when viewed as a whole.

In practice, we can define an NCA as a 3x3 convolution layer, applied recursively many times. This convolutional kernel represents a cell's receptive field – each cell updates its own state based on the states of its neighbors. This is repeated many times, until a final result is produced. Since everything is differentiable, we can plug in our favorite loss function and optimize.

By being creative with our objectives, we can train NCAs with various behaviors. The classic objective is to start with every cell set to zero except a single starting cell, and to grow a desired design in N timesteps. We can also train NCAs that maintain designs by using old states as a starting point, or NCAs which repair designs by starting with damaged states. There’s a lot of room for play here.

Different objectives for training NCAs.

The common shortcoming here is that the NCA needs to be retrained if we want to produce a different design. We’ve successfully learned an algorithm which locally generates a global design -- now can we learn an algorithm which locally generates many global designs?

StampCA: Conditional Neural Cellular Automata

One way to view biological development is as a system of two factors: DNA and DNA translation. DNA produces a code; a translation system then creates proteins accordingly. In general, the same translation system is present in every living thing. This lets evolution ignore the generic heavy work (how to assemble proteins) and focus on the variation that matters for their species (which proteins to create).

The goal of a conditional NCA is in the same direction: separate a system which grows designs from a system which encodes designs. The key distinction is that encodings are design-specific, whereas the growth system is shared for all designs.

It turns out that the NCA formulation gives us a straightforward way to do this, which we’ll refer to as StampCA. In a typical NCA, we initialize the grid with all zeros except for a seed cell. Instead of initializing this seed cell with a constant, let’s set its state to an encoding vector. Thus, the final state of the NCA becomes a function of its neural network parameters along with the given encoding.

A StampCA is an NCA where the seed is conditioned on a learned encoding.

Armed with this, we can construct an autoencoder-type objective. Given a desired design, an encoder network first compresses it into an encoding vector. This encoding vector is used as the initial state in the NCA, which then iterates until a final design has been created. We then optimize everything so the final design matches the desired design.

A correctly-trained StampCA learns general morphogenesis rules which can grow an arbitrary distribution of designs. There are two main advantages over single-design NCAs:

StampCAs can grow new designs without additional training
Many designs can all grow in the same world

Let’s look at some examples on Emoji and MNIST datasets.

StampCA Autoencoder: Growing Emojis

The original Distill paper trained NCAs to grow emojis, so let’s follow suit. We’ll take a dataset of ~3000 32x32 emojis from this font repo. Our aim here is to train a StampCA which can grow any emoji in the dataset.

After training for about 4 hours on a Colab GPU, here’s the outputs:

StampCA Emojis growing. Replicate these results yourself.

Overall, it’s doing the right thing! A single network can grow many kinds of emoji.

For a StampCA network to correctly work, two main problems need to be solved: cells need to share the encoding information with each other, and cells need to figure out where they are relative to the center.

Ripple growth pattern.

It’s interesting that the growth behavior for every emoji is roughly the same -- waves ripple out from the center seed, which then form into the right colors. Since cells can only interact with their direct neighbors, information has to travel locally, which can explain this behavior. The ripples may contain information on where each cell is relative to the original seed, and/or information about the overall encoding.

A cool thing about StampCAs is that multiple designs can grow in the same world. This works because all designs share the same network, they just have different starting seeds. Since NCAs by definition only affect their local neighbors, two seeds won’t interfere with each other, as long as they’re placed far enough apart.

Placing seeds down along a path.

Different designs can grow in the same world.

If seeds are placed too close, weird behavior.

StampCA GAN: Learning fake MNIST

A nice part about conditional NCAs is that they can approximate any kind of vector-to-image function. In the Emoji experiment we used an autoencoder-style loss to train NCAs that could reconstruct images. This time, let's take a look at a GAN objective: given a random vector, grow a typical MNIST digit.

OK, in an ideal world, this setup would be pretty straightforward. A GAN setup involves a generator network and a discriminator network. The generator attempts to produce fake designs, while the discriminator attempts to distinguish between real and fake. We train these networks in tandem, and over time the generator learns to produce realistic-looking fakes.

It turns out that this was a bit tricky to train, so I ended up using a hack. GANs are notoriously hard to stabilize, since the generator and discriminator need to be around the same strength. Early on, NCA behavior is quite unstable, so the NCA-based generator has a hard time getting anywhere.

Instead, the trick was to first train a traditional generator with a feedforward neural network. This generator learns to take in a random vector, and output a realistic MNIST digit. Then, we train an NCA to mimic the behavior of the generator. This is a more stable objective for the NCA to solve, and it eventually learns to basically match the generator's performance.

StampCA MNIST digits growing. Unlike with Emojis, we don't need to supply a dataset to encode->decode.
MNIST digits are generated from scratch with the GAN objective. Replicate these results yourself.

MNIST StampCA grows following the digits' curve, in comparison to the ripple of Emoji StampCA.

The cool thing about these GAN-trained StampCAs is that we don't need a dataset of base images anymore. Since GANs can create designs from scratch, every digit that is grown from this MNIST StampCA is a fake design.

Another interesting observation is how the MNIST StampCA grows its digits. In the Emoji StampCA, we saw a sort of ripple behavior, where the emojis would grow outwards from a center seed. In the MNIST case, growth more closely follows the path of the digit. This is especially visible when generating "0" digits, since the center is hollow.

Placing seeds down along a path.

Many MNIST digits in the same world. Note the "drifter" artifact coming from the 9 digit, reminiscent of cellular automata like Conway's Game of Life.

If seeds are placed too close, weird behavior.

Discussion

First off, you can replicate these experiments through these Colab notebooks, here for the Emojis and here for the MNIST.

The point of this post was to introduce conditional NCAs and show that they exist as a tool. I didn't focus much on optimizing for performance, so I didn't spend much time tuning hyperparameters or network structures. If you play around with the models, I'm sure you can improve on the results here.

An NCA is really just a fancy convolutional network, where the same kernel is applied over and over. If we view things this way, then it's easy to plug in an NCA to whatever kind of image-based tasks you're interested in.

The interesting part about NCAs are that it becomes easy to define objectives that last over time. In this work, we cared about NCAs that matched the appearance of an image after a certain number of timesteps. We could have also done more complicated objectives: e.g, grow into a pig emoji after 50 timesteps, then transform into a cow emoji after 100.

The other aspect distinguishing NCAs are their locality constraints. In an NCA, it's guaranteed that a cell can only interact with its direct neighbors. We took advantage of this by placing multiple seeds in the same world, since we can be confident they won't interfere with one another. Locality also means we can increase the size of an NCA world to an arbitrary size, and it won't change the behavior at any location.

Looking ahead, there are a bunch of weirder experiments we can do with linking NCAs to task-based objectives. We could view an NCA as defining a multi-cellular creature, and train it to grow and capture various food spots. We can also view NCAs as communication systems, since cells have to form channels to tranfer information from one point to another. Finally, there's a bunch of potential in viewing NCAs as evo-devo systems: imagine an NCA defining a plant which dynamically adjusts how it grows based on the surroundings it encounters.

Code for these experiments is availiable at these notebooks. Feel free to shout at me on Twitter with any thoughts.

Appendix

Cite as:

@article{frans2021stampca,
  title   = "StampCA: Growing Emoji with Conditional Neural Cellular Automata",
  author  = "Frans, Kevin",
  journal = "kvfrans.com",
  year    = "2021",
  url     = "http://kvfrans.com/stampca-conditional-neural-cellular-automata/"
}

Why a blog post instead of a paper? I like the visualizations that the blog format allows, and I don't think a full scientific comparison is needed here – the main idea is that these kinds of conditional NCA are possible, not that the method proposed is the optimal setup.

Things To Read

"Growing Neural Cellular Automata", by Mordvintsev et al. This article was the original inspiration for this work in NCAs. They show that you can construct a differentiable cellular automata out of convolutional units, and then train these NCA to grow emojis. With the correct objectives, these emojis are also stable over time and can self-repair any damages. Since it's on Distill, there's a bunch of cool interactive bits to play around with.

"Self-Organising Textures", by Niklasson et al. This Distill article is a continuation of the one above. They show that you can train NCAs towards producing styles or activating certain Inception layers. The driftiness of the NCAs leads to cool dynamic patterns that move around and are resistant to errors.

"Growing 3D Artefacts and Functional Machines with Neural Cellular Automata", by Sudhakaran et al. This paper takes the traditional NCA setup and shifts it to 3D. They train 3D NCAs to grow Minecraft structures, and are able to retain all the nice self-repair properties in this domain.

"Regenerating Soft Robots through Neural Cellular Automata", by Horibe et al. This paper looks at NCAs that define soft robots, which are tested in 2D and 3D on locomotion. The cool thing here is they look at regeneration in the context of achieving a behavior – robots are damaged, then allowed to repair to continue walking. Key differences is that everything is optimized through evolution, and a separate growth+generation network is learned.

"Endless Forms Most Beautiful", by Sean B. Carroll. This book is all about evolutionary development (evo-devo) from a biological perspective. It shows how the genome defines a system which grows itself, through many interacting systems such as repressor genes and axes of organization. One can argue that the greatest aspect of life on Earth isn't the thousands of species, but that they all share a genetic system that can generate a thousand species. There's a lot of inspiration here in how we should construct artificial systems: to create good designs, we should focus on creating systems that can produce designs.

"Neural Cellular Automata Manifold", by Ruiz et al. This paper tackles a similar problem of using NCAs to grow multiple designs. Instead of encoding the design-specific information in the cell state, they try to produce unique network parameters for every design. The trick is that you can produce these parameters through a secondary neural network to bypass retraining.

Quality Diversity: Evolving Ocean Creatures

Kevin Frans — Wed, 02 Dec 2020 02:05:00 GMT

If you've ever folded a paper airplane, chances are you folded The Dart. It's fast and simple, and kids all around the world have launched it through the air. Soon, however, they will move on to other designs – airplanes that glide, make loops, or twirl in flight. There are thousands of paper airplane designs, each with its own quirks and structure.

Paper Airplane Designs: https://www.foldnfly.com/#/1-1-1-1-1-1-1-1-2

In the artificial life world, we would call paper airplane design a quality-diversity algorithm. A paper airplane should be fly fast, but we don’t only care about an optimal design – we’re also interested in airplanes with unique bodies or wing types. Rather than a single design, quality-diversity algorithms aim to develop a population of diverse solutions to a problem.

In this post, let’s explore how quality-diversity algorithms can give us a peek into a world of swimming, soft creatures.

Oceanworld

An Oceanworld organism. Red squares represent nodes, and blue connections are bonds. Green lines represent rigid bonds that are affected by fluid resistance.

In Oceanworld, a creature is a 16x16 matrix of flexible nodes and bonds. Each node is a simulated physics object, and bonds can rotate around and deform.

A creature moves by a periodic heartbeat. Whenever a heartbeat begins, any muscle bonds will extend their length, deforming the shape of the creature. When the heartbeat ends, the muscle bonds will return to original length. In addition, a bond can also be rigid. Rigid bonds are affected by fluid resistance – when they move, a force will pushback in the opposite direction.

The great part about evolutionary methods is that this is all a black box -- we don't actually need to understand how the physics works, we just need to run the algorithm.

Various Oceanworld creatures formed by sampling random genomes. When rigid bonds move, fluid resistance applies a force.

Cool, we have some creatures. But how much of the total possible space do these creatures actually represent?

One way we can gain intuition about the possible space is through a technique called novelty search. Novelty search tries to find a population of creatures that are diverse from each other according to some metric. In our case, let’s use the number of nodes, and the ratio of bonds to nodes.

Novelty search over Oceanworld genomes. Grey points represent the current population of genomes. Red points represent genomes sampled from a random distribution.

Novelty Search works by a simple loop:

Maintain a population of 1000 genomes.
Mutate each genome and add it to the population.
Sum the distance each genome has to its 10 closest neighbors, in terms of some novelty metric.
Sort the populations by increasing distance, and keep only the top 1000.

Working correctly, the novelty search algorithm should optimize for a group of genomes that are spread out in terms of number of nodes and bonds-node ratio.

I always like to sanity check by looking in, so let’s take a peek at some sample points and see how the creatures look.

10% nodes, 100% bond ratio

50% nodes, 100% bond ratio

100% nodes, 100% bond ratio

10% nodes, 10% bond ratio

50% nodes, 10% bond ratio

100% nodes, 10% bond ratio

There’s a key thing missing in our Oceanworld creatures – they don’t actually do anything. The “Quality” in Quality-Diversity has to measure something. So let’s choose something: a creature that swims faster is better.

First, let’s try out a greedy evolutionary algorithm. We’ll start with a random creature, and measure how far it travels. Then, we’ll randomly mutate its genome and try again. If the new creature performs better, it replaces the old one.

Initial genome #1

Halfway through training

Fully trained

Initial genome #2

Halfway through training

Fully trained

Like a pair of deformed jellyfish, our creatures have evolved to move forwards by pulsating their muscle bonds. I doubt I would have been able to hand-design any of these creatures – the physics relationships are just too complex. Thankfully, evolution takes care of it for us.

But are these really the best creatures? The key observation here is that different starting genomes converge to different solutions. This tells us that our possibility space has local minima – genomes where any small change will hurt performance, even though a larger change could be beneficial.

It turns out that local minima are part of a classic paradox: “the path to success does not look like success”. When Carl Benz proposed the motor engine in a world of horse carriages, it was slow and heavy. But he kept at it, and motor cars emerged the victor.

In some form, Benz was using a quality-diversity algorithm: motor engines were different, so they deserved a chance. Quality-diversity algorithms help us escape local minima, because they force us to consider a group of genomes rather than a single leader. With more diversity, there is more chance to discover a path forwards.

MAP-Elites

MAP-Elites is a quality-diveristy algorithm that combines novelty search with evolution. Instead of evolving a single optimal genome, we want to evolve an optimal genome for every point in some grid of features.

MAP-Elites over Oceanworld genomes. Yellow spots represent genomes with higher performance.

In our case, we can use the two metrics from before – node count and bond ratio. Every spot in the matrix corresponds to some range of the metrics (e.g. 50-60 nodes and 30%-40% bonds), and we store the optimal genome that fits the criteria.

MAP-Elites works by:

Maintain a 32 x 32 matrix of genomes.
Pick a random genome, and mutate it.
Measure its metrics, and find which spot in the grid the genome belongs to.
If the new genome performs better than the old genome in the spot, replace it.

As always, let's get our hands dirty and check out what the optimal genomes end up looking like.

10% nodes, 100% bond ratio

50% nodes, 100% bond ratio

100% nodes, 100% bond ratio

10% nodes, 50% bond ratio

50% nodes, 50% bond ratio

100% nodes, 50% bond ratio

10% nodes, 10% bond ratio

50% nodes, 10% bond ratio

100% nodes, 10% bond ratio

In the MAP-Elites genomes, we get a whole bunch of different creatures, and some cool patterns emerge. The fastest creatures turn out to be the smaller ones, likely due to their lower mass. Larger creatures tend to develop a kind of "tail" -- my guess is that this acts as a stabilizer so the creature doesn't drift off direction.

The last row is especially interesting, since the larger creatures develop a completely different movement strategy. Instead of moving its entire body, the creature unrolls itself to stretch forwards.

Quality-diversity algorithms are an easy way to explore the possibilty space of an open-ended world. The techniques, such as novelty search and MAP-Elites, are simple but yield strong results when given proper metrics of diversity. In this Oceanworld, the limiting factor is the expressivity of our genome -- there are only so many ways to build swimming creatures out of a single type of bond.

Further work can involve:

How can we increase the expressivity of our genome? Can we make the genome more information-dense in relation to the creature it defines? What kinds of genomes allow innovations to transfer between creatures?
What are good definitions of diversity? Can we define diversity in terms of a creature's behavior, rather than its structure?

Appendix

In the MAP-Elites method, it turns to be helpful to first run novelty search to get a population of diverse genomes. We start the MAP-Elites matrix with this diverse population, rather than a population of random genomes.

MAP-Elites from random starting population.

MAP-Elites from novelty-search population.

Things To Read

For more on quality-diversity, a good starting point are the ICML tutorial slides below. The rest are some highlight papers with useful ideas.

"Recent Advances in Population-Based Search", ICML 2019 Tutorial, by Jeff Clune, Joel Lehman, and Ken Stanley. This is a great presentation that goes over recent works related to quality diversity. It introduces novelty search as a way to get around environments where greedily optimzing traps you in a local minimum, and then goes to to motivate quality-diversity algorithms where we simultaneously train populations of high-performing agents.

"Abandoning Objectives: Evolution through the Search for Novelty Alone", by Joel Lehman and Ken Stanley. This is the paper on Novelty Search. It argues that basing evolutionary search on fitness is limiting, since only creatures with high fitness will be considered. In environments such as hard mazes, the fitness landscape is non-convex and greedy optimization will fail. Instead, strong diversity metrics are often a better landscape to search through.

"Illuminating search spaces by mapping elites", by Jean-Baptiste Mouret and Jeff Clune. This is the MAP-Elites paper. They introduce the idea of "illumination algorithms", which aim to find the highest-performing solution for every point in a given feature space. The core MAP-Elites idea is pretty straightforward -- store the best solution at every point, and keep trying new solutions.

"POET: Paired Open-Ended Trailblazer", by Rui Wang et al. POET is a quality-diversity algorithm with a step towards open-endedness. The idea is similar to MAP-Elites, but instead of niches defined as properties of the genome, each niche is a task. New tasks are introduced randomly, and stay in the pool if agents can solve them. You end up with a growing population of tasks and population of agents that can solve those tasks.

What are good diversity metrics? Hand-designed metrics such as creature size only work to some degree. In my mind, this is an important question as "novelty" is hard to define concretely.

"Diversity is All You Need", by Benjamin Eysenbach et al. This paper describes a system where we want to find a set of diverse policies. They define diversity as:

A discriminator should be able to predict which policy is active based on state history.
A policy should cover as many states as it can.

"Stochastic Neural Networks for Hierarchical Reinforcement Learning", by Carlos Florensa, Yan Duan, and Pieter Abeel. Motivated by information theory, this work aims to find diverse policies by maximizing entropy. This ends up being a similar method as above -- policies should visit different states to be maximally diverse.

"Autonomous skill discovery with Quality-Diversity and Unsupervised Descriptors", by Antoine Cully. This work takes hints from dimensionality reducing methods such as PCA (Principle-component analysis). Basically, given a set of trajectories, we can map every trajectory onto a 2D space. This mapping becomes our diversity metric for running a quality-diversity method such as MAP-Elites.

Aside from how to discover diverse populations, the world and environment this all takes place in is crucial as well. These are some nice references for work in ALIFE and evolved creatures.

"Evolution Strategies", OpenAI. A good overview of how evolutionary systems can potentially outscale traditional reinforcement learning. The basic idea is that ES is inherently parallalizable, so systems with lots of CPU cores can scale well. ES also treats an entire episode with a single reward, so it doesn't suffer from things such as sparsity or time dependency.

"Evolving Virtual Creatures", by Karl Sims. This is a classic work from 1994, where virtual swimming robots are evolved to move forwards. The methods used as similar to what we did in this blog post, but the genome is defined as a hierarchy of parts, and there are basic neural networks for control.

"Evolving soft locomotion", by Francesco Corucci et al. A more recent paper that uses evolutionary methods to design soft swimming and walking robots. The world here uses creatures of deformable voxels, akin to legos, that apply a fluid resistance force when expanding (similar to what we did but in 3D). Their genome is a CPPN, a neural network C(x,y) that outputs the creature's properties at each coordinate.

Open-Endedness 3: Multicell World

Kevin Frans — Thu, 12 Nov 2020 01:35:00 GMT

How many shapes can be made from six 2x4 LEGO bricks? The answer is 915,103,765. By combining only a few simple parts, it’s surprisingly easy to achieve high levels of complexity.

A Lego Counting Problem: http://web.math.ku.dk/~eilers/lego.html

If you look at the single-celled organisms on Earth today, you’ll mostly be stuck with bacteria. But once you enter the multi-cellular domain, suddenly life shows all sorts of shapes and structures. The possibilty space of multi-celled organisms is much richer.

Recall our principles of open-endedness:

Possibility Space: A world consists of a set of organisms. The space of possible organisms must be arbitrarily large.
Peer Dependency: When new organisms appear, an environment must develop new niches.
Search Algorithm: A procedure must find new organisms to fit new niches.

In this post, let’s dive into how multi-cellularity can expand our possibility space towards increasingly complex genomes.

Multicell World

Last time, we worked with a simple Cellworld where cells could gain energy and reproduce. Starting with this is a base, let’s restructure the genome to support multi-cellularity, and also mix in the our bitwise chemical system.

In the Multicell world, an organism is made of a group of cells. Each cell performs a single chemical reaction, creating energy. When a cell has gathered enough energy, it will create a new cell in a neighboring space.

Similar to the Bitwise Chemicals project, there are eight chemicals. Every space in the grid has its own chemical counts, and a cell will only be able to perform a reaction if the required chemicals are there. Within an organism’s cells, chemical counts and energy are averaged out.

Green cells represent the reaction (01+02). These chemicals are provided periodically, at a higher rate near the bottom.

A brown reaction appears, which reacts the waste produced by the green reaction.

A wide variety of cells coexist, living off the outputs of neighboring reactions.

So, we now have a world that supports multi-cell organisms, and they can interact with each other and evolve. But there’s an obvious flaw: every cell is doing the same thing!

While we have different species performing different reactions, within an organism all the cells are the same. Let’s allow for cells to specialize, and perform different roles.

An organism comprised of different specialized cells. Cells behave differently depending on their pheromone.

In this setup, a genome is a decision tree. When a new cell is born, this tree will tell it what reaction to specialize in. Once a cell specializes, it will only perform that reaction.

Crucially, the genome decision tree takes as input a pheromone – four bits that are unique to each indicidual cell. When a cell makes a child, it can assign it a new pheromone, which can cause the child to specialize in a different reaction.

Put together, the genome of a organism is like an algorithm that tells the organism how to grow and what its cells should specialize in. Will species evolve to take advantage of multi-cellularity?

In the beginning, cells consist only of green reactions, living mostly near the bottom.

A species with both green and brown cells is twice as energy-efficient, since the output of green is used to power the brown reaction.

The next generation begins to differentiate, and we see diverse combinations of multi-cell reactions.

Here, the first species develops Killer cells. Killer cells cost energy, and have to be powered by neighboring reaction cells.

Descendants of the first Killer species spread. They develop different base reactions to power themselves.

The world eventually settles into species containing a few reaction cells along with sporadic Killer cells.

Through the multi-cell experiment, we see a bunch of species appear that use multiple specialized cells to perform chains of reactions. These reactions usually synergize, with one providing the materials for the other. As the world advances, we also see cells that use their excess energy to power Killer cells and gain space.

One interesting note are the patterns of cells. With the way genomes work, there’s no real definition of a “pattern” – it all depends on the dynamics of the pheromones. Perhaps a cell with a pheromone of [0001] will produce a child of pheromone [0010], whose own child will loop back to [0001].

In real-life plants, we often see the Fibonacci series, or spiral patterns. These are patterns that naturally arise when cells grow by replication, since they are easy ways to gain repetition. In our gridworld, we often see checker-board style patterns, which are similarly easy to achieve.

Further work can involve:

How can we increase the complexity of inter-species interactions beyong sharing chemicals and/or killing each other? What kinds of complexity lead to novel/interesing behavior?
Other than grid-based worlds, are there more interesting ways to represent physical organisms? What is the simplest physics system that will allow for movement?

Appendix

Here is a full run of the Multicell world. Shown on the left are average chemical counts throughout the grid, and on the right is a legend for what reactions each color represents.