Theory and Application of Image Generation

Image generation, theory and application

Luke Wood

The Code, Slides, Demos

About me

  • From San Diego
  • Work on the Keras team
  • Last year~ on KerasCV
  • Pursuing Doctorate at UC San Diego

Background in Generative Modeling

  • ML since 2015
  • Generative modeling since 2016 (off & on)
  • Recent work on StableDiffusion in KerasCV

Generative modeling, why should you care...

Historically you could....

Generate fake shoe pictures

Learn the latent space of a dataset!

(More on this later...)

Generate DeepFakes

All quite interesting...

  • but nothing particularly useful
  • too difficult to control

Until... DALL-E 2!

And then... StableDiffusion!

Stable Diffusion is a deep learning, text-to-image model released by startup StabilityAI in 2022.

Most importantly, StableDiffusion is 100% open source... and generously licensed

"A gentleman otter in a 19th century portrait"

"A cute magical flying dog, fantasy art drawn by Disney concept artists"

"pencil sketch of robots playing poker"

"Multicolor hyperspace"

But that's not all!

Image to image workflows GUIDED by text

Image to image inpainting (as seen in the intro)!

... and outpainting!

... and variation generation!

Now that I have your attention...

Lets take a step back! How does this all work?

Representations & Continuity

AutoEncoders

  • AutoEncoders: travel back to 1987
  • early days of ML
  • no large scale data
  • unfortunately, no good visual results for you!
  • backprop "without a teacher"

Flash forward to the 2010s

TensorFlow, GPUs, large datasets

AutoEncoders are a form of compression

Caveats

  • data specific
  • lossy
  • "They are rarely used in practical applications" - Keras blog in 2016

... but what happens in between real samples?

def plot_label_clusters(vae, data, labels):
    # display a 2D plot of the digit classes in the latent space
    z_mean, _, _ = vae.encoder.predict(data)
    plt.figure(figsize=(12, 10))
    plt.scatter(z_mean[:, 0], z_mean[:, 1], c=labels)
    plt.colorbar()
    plt.xlabel("z[0]")
    plt.ylabel("z[1]")
    plt.show()


(x_train, y_train), _ = keras.datasets.mnist.load_data()
x_train = np.expand_dims(x_train, -1).astype("float32") / 255

plot_label_clusters(vae, x_train, y_train)

Generate new images!

Continuity!

Latent space walking, or latent space exploration, is the process of sampling a point in latent space and incrementally changing the latent representation. Its most common application is generating animations where each sampled point is fed to the decoder and is stored as a frame in the final animation. For high-quality latent representations, this produces coherent-looking animations. These animations can provide insight into the feature map of the latent space, and can ultimately lead to improvements in the training process.

Panda ➡️ Plane

Dog ➡️ Bowl of fruit

A quick aside on Variational AutoEncoders (VAEs)...
class Sampling(layers.Layer):
    """Uses (z_mean, z_log_var) to sample z, the vector encoding a digit."""

    def call(self, inputs):
        z_mean, z_log_var = inputs
        batch = tf.shape(z_mean)[0]
        dim = tf.shape(z_mean)[1]
        epsilon = tf.keras.backend.random_normal(shape=(batch, dim))
        return z_mean + tf.exp(0.5 * z_log_var) * epsilon

Any Questions?

(on continuity only please)

Congratulations!

You now understand approximately 1/4 of StableDiffusion.

Diffusion Models

Denoising Diffusion Probabilistic Models, 2020

Super-resolution

Push super resolution to the limit!

  • start from pure noise
  • proposed in 2020

More reading on keras.io

Any questions?

(On diffusion models)

Latent diffusion models

  • improves efficiency
  • use VAE decoder
  • UNet

CLIP

... what you need to know

We just need the text encoder

CLIP

More reading available on the OpenAI blog post

the Final Piece...

Conditioning!

Conditioning

  • classic deep learning
  • concatenate
  • 64x64x3 ➡️ 64x64x4

That's All!

You now know how StableDiffusion works!

How do I use it?

Text to Image Generation

"An astronaut riding a horse"

Code:

from tensorflow import keras
import keras_cv

keras.mixed_precision.set_global_policy("mixed_float16")
model = keras_cv.models.StableDiffusion(jit_compile=True)

images = model.text_to_image(
    "Teddy bears conducting machine learning research",
    batch_size=4,
)
plot_images(images)

Variation generation

Remember CLIP?

Switch it out!

It's really that easy!


Textual Inversion

Teach new concepts to StableDiffusion!

Step 1: collect 3-5 images of your object

urls = [
    "https://i.imgur.com/VIedH1X.jpg",
    "https://i.imgur.com/iLkM4Ar.jpg",
    "https://i.imgur.com/eBw13hE.png",
]
files = [tf.keras.utils.get_file(origin=url) for url in urls]
# Resize images
resize = keras.layers.Resizing(height=512, width=512, crop_to_aspect_ratio=True)
images = [keras.utils.load_img(img) for img in files]
images = [keras.utils.img_to_array(img) for img in images]
images = np.array([resize(img) for img in images])
visualization.plot_gallery(images, value_range=(0, 255), rows=1, cols=3)
Step 2: add a special token to the model vocabulary
your_token = '<any-special-name>'
tokenizer.add_token(your_token)
Step 3: construct an image-caption dataset
your_token = '<any-special-name>'
templates = [
    "a photo of a {}",
    "a rendering of a {}",
    "a cropped photo of the {}",
    "the photo of a {}",
    # ...
]
templates = [t.format(your_token) for t in templates]

# Construct a TensorFlow dataset of the images + tokens
image_dataset = tf.data.Dataset.from_tensor_slices(images)
text_dataset = tf.data.Dataset.from_tensor_slices(templates)
# ... there is a bit more boilerplate to pre-process the text
train_ds = tf.data.Dataset.zip(
  (image_dataset.shuffle(), text_dataset.shuffle())
)

Step 4: Fine Tune the TextEncoder with your new dataset!

stable_diffusion.diffusion_model.trainable = False
stable_diffusion.decoder.trainable = False
stable_diffusion.text_encoder.trainable = True

trainer = StableDiffusionFineTuner(stable_diffusion, name="trainer")
optimizer = keras.optimizers.SGD(learning_rate=5e-4)
trainer.compile(optimizer=optimizer, loss="mse")

# trainer trains the StableDiffusion model for you.
trainer.fit(
    train_ds,
    epochs=10,
    steps_per_epoch=200
)

Results

images = stable_diffusion.text_to_image(
    "a photo of <any-special-name> wearing a top hat",
    batch_size=4,
)
plot_images(images)

Results

images = stable_diffusion.text_to_image(
    "An app icon of <any-special-name>.",
    batch_size=4,
)
plot_images(images)

Demo Time

Prompt requests?

Follow along on Colab!

Conclusions

  • limitless possibilities
  • the power of multi-modal models
  • how fast the field is evolving

More Workflows Coming Soon

Other workflows are coming to KerasCV soon!

Other links

Thank you!

References:

- Hello, I am Luke Wood and this is "Theory and Application of Image Generation" - This talk covers a little history of generative models, modern image generation models, and more specifically the architecture of the popular StableDiffusion model. - For the technical parts, I will assume you have some knowledge of general machine learning, but not generative modeling.

Lets start out with an anecodote: - I was in Bruges this past week - wanted to get a picture by the famous waterway - a few boats in the water - normally you wait and get another picture

So while this might look photoshopped, it is not! The edits made to this photo were actually done by a generative image model. - used DALLE-2 - realized I had no image of the landscale itself! - the weather was no longer good

there we go, now we have just the landscape. Then I decided, well, while I'm using a generative image model I may as well try something a bit more fun!

- as you can see, these models are quite capable - DALLE-2 is a "multi-modal" - multi model introduction - point out wide variety of use cases

Follow along with the slides linked above. If you have a laptop use the web version, if you have a phone use the PDF version. There is some sort of bug in the slide rendering system I use on mobile devices, but you can read the PDF version as a workaround.

- Got started with generative modeling in 2016~ or so - These are the GitHub avatars of our ML group - Actually this guy (Ian) works on KerasCV now in a full time role - Still do some ML research with this professor in my spare time - Keep an eye out for these guys a little later in the talk

But you could still do some fun stuff with generative modeling.

Girl with a Pearl Earring Painting by Johannes Vermeer

This is why they are not used everywhere

So we just covered the concept of a latent representation of an image dataset, and the fact that latent representations are continuous, any questions?

Next, we will cover the diffusion model.

You may be familiar with the idea of super-resolution: it's possible to train a deep learning model to denoise an input image -- and thereby turn it into a higher-resolution version. The deep learning model doesn't do this by magically recovering the information that's missing from the noisy, low-resolution input -- rather, the model uses its training data distribution to hallucinate the visual details that would be most likely given the input.

Next we will discuss CLIP

Note that the resnet block don’t directly look at the text. But the attention layers merge those text representations in the latents. And now the next ResNet can utilize that incorporated text information in its processing.