In Lesson 10 of the fastai course (Part 2) Jeremy assigns us the following homework assignment:
try picking one of the extra tricks we learned about like image-to-image, or negative prompts; see if you can implement negative prompt in your version of this; or try doing image-to-image; try adding callbacks
In this blog post I’ll implement negative prompting using the diffusion loop code provided in the course’s Stable Diffusion with Diffusers notebook.
I’ll start by copy/pasting all of the boilerplate code provided in that notebook, and running it to make sure we get the desired images.
The original image contains values between 0 and 255:
np.array(im).min(), np.array(im).max()
(0, 255)
The following lines load the image, resize it to the desired size (512x512) and convert it to a tensor using torchvision.transforms.ToTensor. I also make sure the image is on the GPU and is using half-precision (which is used by the VAE):
We don’t want to start the diffusion loop with the original image’s latents because the UNet is trained to predict noise on noisy latents. In order to give the UNet it’s expected input (noisy latents) we need to add noise to our initial image’s latents!
We don’t want to literally add (with +) noise to the latents. Instead, we want to simulating the diffusion process as if it were starting from pure random noise. To do this, we need to prep the scheduler with the total number of steps (so it can calculate noise appropriately), pick some initial step for our noise, and add it to our latents with add_noise:
# set timestepssteps =70scheduler.set_timesteps(steps)# get start timestepinit_strength =0.15# can be anything 0-1init_step =int(init_strength * steps)ts_start = torch.tensor([scheduler.timesteps[init_step]])# create noisebs =1noise = torch.randn((bs, unet.in_channels, height//8, width//8)).to("cuda")# add noiselatents = scheduler.add_noise(latents, noise, ts_start).half()
/tmp/ipykernel_33/3549084226.py:12: FutureWarning: Accessing config attribute `in_channels` directly via 'UNet2DConditionModel' object attribute is deprecated. Please access 'in_channels' over 'UNet2DConditionModel's config object instead, e.g. 'unet.config.in_channels'.
noise = torch.randn((bs, unet.in_channels, height//8, width//8)).to("cuda")
latents.shape
torch.Size([1, 4, 64, 64])
Note that init_step and ts_start are two different values.
init_step, ts_start
(10, tensor([854.2174]))
Running the diffusion loop
I’ll define some of the related inputs so I can run the diffusion loop with our initial image’s noisy latents. Note that we are not starting the diffusion loop with the first ts but rather starting at the init_step we calculated above:
for i,ts inenumerate(tqdm(scheduler.timesteps[init_step:])): inp = scheduler.scale_model_input(torch.cat([latents] *2), ts)with torch.no_grad(): u,t = unet(inp, ts, encoder_hidden_states=emb).sample.chunk(2) pred = u + g*(t-u) latents = scheduler.step(pred, ts, latents).prev_samplewith torch.no_grad(): final_image = vae.decode(1/0.18215* latents).sample
display(mk_img(final_image[0]))
As a reminder, here is the initial image, we can see the similarities in color structure (note the transitions from red –> blue –> green –> yellow).
Image.open(init_image_path)
Varying init_strength
With the core functionality of image-to-image generation working properly, I’ll wrap it all into a function so I can loop through different init_strength values to see how it affects the generated image.
As init_strength goes from 0.0 (totally random initial noise) to 0.99 (very lightly noised initial image) we can see how the prompt conforms to the color and structure of the initial image.
Final Thoughts
Working through this implementation solidified my understanding of the diffusion loop. A few small but key points that I paid more attention to this time around:
The VAE encoder expects latent values between -1 and 1 so we have to transform our image tensor accordingly.
Adding noise to our initial image’s latents requires:
Picking a total number of inference steps.
Picking an initial step (at which we will apply the noise) and the corresponding scheduler timestep.
Using scheduler.add_noise.
The text encoding process remains untouched.
The only change to the diffusion loop is starting at scheduler.timesteps[init_step] instead of the first timestep.
The last implementation for this HW assignment will be to implement callbacks, which I’ll do in a future blog post! Thanks for reading!