Denoisers

DRLX generally uses conditioned denoisers for diffusion modelling. Currently, the library is made with text conditioning in mind, the base classes are with generalizability in mind, and to this end the conditional denoiser supports any kind of conditioning signal that produces an embedding.

BaseConditionalDenoiser

class drlx.denoisers.BaseConditionalDenoiser(config: ModelConfig, sampler_config: SamplerConfig | None = None, sampler: Sampler | None = None)

Bases: Module

Base class for any denoiser that takes a conditioning signal during denoising process, including text conditioned denoisers.

Parameters:

config (ModelConfig) – Configuration for model
sampler_config (SamplerConfig) – Configuration for sampler (optional). If provided, will create a default sampler.
sampler (Sampler) – Can be provided as alternative to sampler_config (also optional). If neither are provided, a default sampler will be used.

abstract decode(latent: Tensor) → Tensor[Tensor]: Decode latent vector into an image (typically called in postprocess)

abstract encode(pixel_values: Tensor[Tensor]) → Tensor: Encode image into latent vector

abstract forward(*inputs) → Tensor[Tensor]: Forward pass for denoiser. Output varies based on prediction type.

abstract get_input_shape() → Tuple

Get input shape for denoiser. Useful during training + sampling when shape of input noise to denoiser is needed.

Returns:: Input shape as a tuple
Return type:: Tuple[int]

abstract postprocess(output) → ndarray

Called on the output from the model after sampling to give final image

Returns:: Final denoised image as uint8 numpy array
Return type:: np.ndarray

abstract preprocess(*inputs) → Tensor[Tensor]

Called on the conditioning input (typically: tokenizes text prompt)

Returns:: Conditioning input embeddings (i.e. text embeddings) as tensors
Return type:: torch.Tensor

sample(**kwargs)

Use the sampler to sample an image. Will require postprocess to output an image. Note that different samplers have different outputs.

Parameters:: kwargs – Keyword arguments to sampler
Returns:: Varies per sampler but always includes denoised latent/images

training: bool

LDMUNet

class drlx.denoisers.ldm_unet.LDMUNet(config: ModelConfig, sampler_config: SamplerConfig | None = None, sampler: Sampler | None = None)

Bases: BaseConditionalDenoiser

Class for Latent Diffusion Model UNet denoiser. Can optionally pass sampler information, though it is not required. Generally used in tandem with a diffusers pipeline.

Parameters:

config (ModelConfig) – Configuration for model
sampler_config (SamplerConfig) – Configuration for sampler (optional). If provided, will create a default sampler.
sampler (Sampler) – Can be provided as alternative to sampler_config (also optional). If neither are provided, a default sampler will be used.

forward(pixel_values: Tensor[Tensor], time_step: Tensor[Tensor] | int, input_ids: Tensor[Tensor] | None = None, attention_mask: Tensor[Tensor] | None = None, text_embeds: Tensor[Tensor] | None = None) → Tensor[Tensor]: For text conditioned UNET, inputs are assumed to be: pixel_values, input_ids, attention_mask, time_step

from_pretrained_pipeline(cls: Type, path: str)

Get unet from some pretrained model pipeline

Parameters:

cls (Type) – Class to use for pipeline (i.e. StableDiffusionPipeline)
path (str) – Path to pretrained pipeline

Returns:

an LDMUNet object with UNet, Text Encoder, VAE, tokenizer and scheduler from pretrained pipeline. Also returns the pretrained pipeline in case caller needs it.

Return type:

LDMUNet

get_input_shape() → Tuple[int]

Figure out latent noise input shape for the UNet. Requires that unet and vae are defined

Returns:: Input shape as a tuple
Return type:: Tuple[int]

postprocess(output: Tensor[Tensor], vae_device=None): Post process

preprocess(text: Iterable[str], mode='tokens', **embed_kwargs)

Preprocess text input, either into tokens or into embeddings.

Parameters:

mode (str) – Either “tokens” or “embeds”
text (Iterable[str]) – Text to preprocess

Returns:

Either a tuple of tensors for input_ids and attention_mask or a tensor of embeddings

Return type:

Union[Tuple[Tensor, Tensor], Tensor]

training: bool