- Hello, I am Luke Wood and this is "Theory and Application of Image Generation"
- This talk covers a little history of generative models, modern image generation models, and more specifically the architecture of the popular StableDiffusion model.
- For the technical parts, I will assume you have some knowledge of general machine learning, but not generative modeling.
Lets start out with an anecodote:
- I was in Bruges this past week
- wanted to get a picture by the famous waterway
- a few boats in the water
- normally you wait and get another picture
So while this might look photoshopped, it is not!
The edits made to this photo were actually done by a generative image model.
- used DALLE-2
- realized I had no image of the landscale itself!
- the weather was no longer good
there we go, now we have just the landscape.
Then I decided, well, while I'm using a generative image model I may as well
try something a bit more fun!
- as you can see, these models are quite capable
- DALLE-2 is a "multi-modal"
- multi model introduction
- point out wide variety of use cases
Follow along with the slides linked above.
If you have a laptop use the web version, if you have a phone use the PDF version.
There is some sort of bug in the slide rendering system I use on mobile
devices, but you can read the PDF version as a workaround.
- Got started with generative modeling in 2016~ or so
- These are the GitHub avatars of our ML group
- Actually this guy (Ian) works on KerasCV now in a full time role
- Still do some ML research with this professor in my spare time
- Keep an eye out for these guys a little later in the talk
But you could still do some fun stuff with generative modeling.
Girl with a Pearl Earring
Painting by Johannes Vermeer
This is why they are not used everywhere
So we just covered the concept of a latent representation of an
image dataset, and the fact that latent representations are
continuous, any questions?
Next, we will cover the diffusion model.
You may be familiar with the idea of super-resolution: it's possible to train a deep learning model to denoise an input image -- and thereby turn it into a higher-resolution version. The deep learning model doesn't do this by magically recovering the information that's missing from the noisy, low-resolution input -- rather, the model uses its training data distribution to hallucinate the visual details that would be most likely given the input.
Next we will discuss CLIP
Note that the resnet block don’t directly look at the text. But the attention layers merge those text representations in the latents. And now the next ResNet can utilize that incorporated text information in its processing.