Diffusers documentation

GLM-Image

Diffusers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.36.0).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

GLM-Image

Overview

GLM-Image is an image generation model adopts a hybrid autoregressive + diffusion decoder architecture, effectively pushing the upper bound of visual fidelity and fine-grained details. In general image generation quality, it aligns with industry-standard LDM-based approaches, while demonstrating significant advantages in knowledge-intensive image generation scenarios.

Model architecture: a hybrid autoregressive + diffusion decoder design、

Autoregressive generator: a 9B-parameter model initialized from GLM-4-9B-0414, with an expanded vocabulary to incorporate visual tokens. The model first generates a compact encoding of approximately 256 tokens, then expands to 1K–4K tokens, corresponding to 1K–2K high-resolution image outputs. You can check AR model in class GlmImageForConditionalGeneration of transformers library.
Diffusion Decoder: a 7B-parameter decoder based on a single-stream DiT architecture for latent-space image decoding. It is equipped with a Glyph Encoder text module, significantly improving accurate text rendering within images.

Post-training with decoupled reinforcement learning: the model introduces a fine-grained, modular feedback strategy using the GRPO algorithm, substantially enhancing both semantic understanding and visual detail quality.

Autoregressive module: provides low-frequency feedback signals focused on aesthetics and semantic alignment, improving instruction following and artistic expressiveness.
Decoder module: delivers high-frequency feedback targeting detail fidelity and text accuracy, resulting in highly realistic textures, lighting, and color reproduction, as well as more precise text rendering.

GLM-Image supports both text-to-image and image-to-image generation within a single model

Text-to-image: generates high-detail images from textual descriptions, with particularly strong performance in information-dense scenarios.
Image-to-image: supports a wide range of tasks, including image editing, style transfer, multi-subject consistency, and identity-preserving generation for people and objects.

This pipeline was contributed by zRzRzRzRzRzRzR. The codebase can be found here.

Usage examples

Text to Image Generation

import torch
from diffusers.pipelines.glm_image import GlmImagePipeline

pipe = GlmImagePipeline.from_pretrained("zai-org/GLM-Image",torch_dtype=torch.bfloat16,device_map="cuda")
prompt = "A beautifully designed modern food magazine style dessert recipe illustration, themed around a raspberry mousse cake. The overall layout is clean and bright, divided into four main areas: the top left features a bold black title 'Raspberry Mousse Cake Recipe Guide', with a soft-lit close-up photo of the finished cake on the right, showcasing a light pink cake adorned with fresh raspberries and mint leaves; the bottom left contains an ingredient list section, titled 'Ingredients' in a simple font, listing 'Flour 150g', 'Eggs 3', 'Sugar 120g', 'Raspberry puree 200g', 'Gelatin sheets 10g', 'Whipping cream 300ml', and 'Fresh raspberries', each accompanied by minimalist line icons (like a flour bag, eggs, sugar jar, etc.); the bottom right displays four equally sized step boxes, each containing high-definition macro photos and corresponding instructions, arranged from top to bottom as follows: Step 1 shows a whisk whipping white foam (with the instruction 'Whip egg whites to stiff peaks'), Step 2 shows a red-and-white mixture being folded with a spatula (with the instruction 'Gently fold in the puree and batter'), Step 3 shows pink liquid being poured into a round mold (with the instruction 'Pour into mold and chill for 4 hours'), Step 4 shows the finished cake decorated with raspberries and mint leaves (with the instruction 'Decorate with raspberries and mint'); a light brown information bar runs along the bottom edge, with icons on the left representing 'Preparation time: 30 minutes', 'Cooking time: 20 minutes', and 'Servings: 8'. The overall color scheme is dominated by creamy white and light pink, with a subtle paper texture in the background, featuring compact and orderly text and image layout with clear information hierarchy."
image = pipe(
    prompt=prompt,
    height=32 * 32,
    width=36 * 32,
    num_inference_steps=30,
    guidance_scale=1.5,
    generator=torch.Generator(device="cuda").manual_seed(42),
).images[0]

image.save("output_t2i.png")

Image to Image Generation

import torch
from diffusers.pipelines.glm_image import GlmImagePipeline
from PIL import Image

pipe = GlmImagePipeline.from_pretrained("zai-org/GLM-Image",torch_dtype=torch.bfloat16,device_map="cuda")
image_path = "cond.jpg" 
prompt = "Replace the background of the snow forest with an underground station featuring an automatic escalator."
image = Image.open(image_path).convert("RGB")
image = pipe(
    prompt=prompt,
    image=[image], # can input multiple images for multi-image-to-image generation such as [image, image1]
    height=33 * 32,
    width=32 * 32,
    num_inference_steps=30,
    guidance_scale=1.5,
    generator=torch.Generator(device="cuda").manual_seed(42),
).images[0]

image.save("output_i2i.png")

Since the AR model used in GLM-Image is configured with do_sample=True and a temperature of 0.95 by default, the generated images can vary significantly across runs. We do not recommend setting do_sample=False, as this may lead to incorrect or degenerate outputs from the AR model.

GlmImagePipeline

class diffusers.GlmImagePipeline

< source >

( tokenizer: ByT5Tokenizer processor: ProcessorMixin text_encoder: T5EncoderModel vision_language_encoder: PreTrainedModel vae: AutoencoderKL transformer: GlmImageTransformer2DModel scheduler: FlowMatchEulerDiscreteScheduler )

Parameters

tokenizer (PreTrainedTokenizer) — Tokenizer for the text encoder.
processor (AutoProcessor) — Processor for the AR model to handle chat templates and tokenization.
text_encoder (T5EncoderModel) — Frozen text-encoder for glyph embeddings.
vision_language_encoder (GlmImageForConditionalGeneration) — The AR model that generates image tokens from text prompts.
vae (AutoencoderKL) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
transformer (GlmImageTransformer2DModel) — A text conditioned transformer to denoise the encoded image latents (DiT).
scheduler (SchedulerMixin) — A scheduler to be used in combination with transformer to denoise the encoded image latents.

Pipeline for text-to-image generation using GLM-Image.

This pipeline integrates both the AR (autoregressive) model for token generation and the DiT (diffusion transformer) model for image decoding.

call

< source >

( prompt: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 timesteps: typing.Optional[typing.List[int]] = None sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 1.5 num_images_per_prompt: int = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prior_token_ids: typing.Optional[torch.FloatTensor] = None prior_image_token_ids: typing.Optional[torch.Tensor] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) output_type: str = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 2048 ) → GlmImagePipelineOutput or tuple

Parameters

prompt (str or List[str], optional) — The prompt or prompts to guide the image generation. Must contain shape info in the format ’H W’ where H and W are token dimensions (d32). Example: “A beautiful sunset36 24” generates a 1152x768 image.
image — Optional condition images for image-to-image generation.
height (int, optional) — The height in pixels. If not provided, derived from prompt shape info.
width (int, optional) — The width in pixels. If not provided, derived from prompt shape info.
num_inference_steps (int, optional, defaults to 50) — The number of denoising steps for DiT.
guidance_scale (float, optional, defaults to 1.5) — Guidance scale for classifier-free guidance.
num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
generator (torch.Generator, optional) — Random generator for reproducibility.
output_type (str, optional, defaults to "pil") — Output format: “pil”, “np”, or “latent”.

Returns

GlmImagePipelineOutput or tuple

Generated images.

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import GlmImagePipeline

>>> pipe = GlmImagePipeline.from_pretrained("zai-org/GLM-Image", torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> prompt = "A photo of an astronaut riding a horse on mars"
>>> image = pipe(prompt).images[0]
>>> image.save("output.png")

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None max_sequence_length: int = 2048 )

Parameters

prompt (str or List[str], optional) — prompt to be encoded
do_classifier_free_guidance (bool, optional, defaults to True) — Whether to use classifier free guidance or not.
num_images_per_prompt (int, optional, defaults to 1) — Number of images that should be generated per prompt. torch device to place the resulting embeddings on
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
device — (torch.device, optional): torch device
dtype — (torch.dtype, optional): torch dtype
max_sequence_length (int, defaults to 2048) — Maximum sequence length in encoded prompt. Can be set to other values but may lead to poorer results.

Encodes the prompt into text encoder hidden states.

GlmImagePipelineOutput

class diffusers.pipelines.glm_image.pipeline_output.GlmImagePipelineOutput

< source >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )

Parameters

images (List[PIL.Image.Image] or np.ndarray) — List of denoised PIL images of length batch_size or numpy array of shape (batch_size, height, width, num_channels). PIL images or numpy array present the denoised images of the diffusion pipeline.

Output class for CogView3 pipelines.

Update on GitHub

←FluxControlInpaint HiDream-I1→

Diffusers

GLM-Image

Overview

Usage examples

Text to Image Generation

Image to Image Generation

GlmImagePipeline

class diffusers.GlmImagePipeline

__call__

encode_prompt

GlmImagePipelineOutput

class diffusers.pipelines.glm_image.pipeline_output.GlmImagePipelineOutput

call