No longer supported

This model is legacy (released November 2022) and is no longer actively maintained or recommended by Stability AI. Consider using their current flagship model instead.

legacylegacyimage Open Source

Stable Diffusion 2.0

by Stability AI· Released November 2022· Cutoff 2022

Stable Diffusion 2.0 is a text-to-image diffusion model that generates high-resolution images from textual descriptions. It introduces a new text encoder (OpenCLIP) and supports image-to-image, inpainting, and depth-guided generation. This model marked a significant improvement over the original Stable Diffusion in terms of image quality and compositional understanding.

Official Site API Docs 🤗 Hugging Face 📄 Research Paper

Input cost

Free (open source)

Output cost

Free (open source)

Context window

—

Max output

—

Modalities

image

Parameters

1.4B (UNet) + 1.2B (VAE) + 354M (text encoder)

License

CreativeML Open RAIL-M

Capabilities

Text-to-ImageImage-to-ImageInpaintingDepth-Guided GenerationUpscaling

Best For

High-quality image generation from text prompts with improved compositional accuracy.

Strengths

Improved image quality over Stable Diffusion 1.x
Better compositional understanding
Supports multiple generation modes (text-to-image, image-to-image, inpainting)
Open source and freely available

Limitations

Slower inference compared to newer models
May struggle with complex prompts involving fine details
Requires significant GPU memory for high-resolution outputs
Not as refined as later versions (e.g., SDXL)

Use Cases

Generating artwork and illustrations from text descriptions

Photo editing and manipulation via inpainting

Creating variations of existing images

Depth-aware image generation for 3D applications

Rapid prototyping for design concepts

Educational projects and research in generative AI

Improvements Over Previous Model

New text encoder (OpenCLIP) replaces original CLIP, improving text understanding
Supports depth-guided generation via MiDaS depth estimation
Increased maximum image resolution to 768x768 (from 512x512)
Improved image quality and compositional accuracy
Added support for image-to-image and inpainting out of the box

Back to all models

Improvements Over Previous Model

New text encoder (OpenCLIP) replaces original CLIP, improving text understanding

Supports depth-guided generation via MiDaS depth estimation

Increased maximum image resolution to 768x768 (from 512x512)

Improved image quality and compositional accuracy

Added support for image-to-image and inpainting out of the box