Learning Illumination Control in Diffusion Models

Nishit Anand, Manan Suri, Christopher Metzler^*, Dinesh Manocha^*, Ramani Duraiswami^*

University of Maryland, College Park

Accepted at ReALM-GEN, ICLR 2026

^*Equal advising

Abstract

Controlling illumination in images is essential for photography and visual content creation. While closed-source models have demonstrated impressive illumination control, open-source alternatives either require heavy control inputs like depth maps or do not release their data and code. We present a fully open-source and reproducible pipeline for learning illumination control in diffusion models.

Our approach builds a data engine that transforms well-lit images into supervised training triplets: a poorly illuminated input, a natural-language lighting instruction, and a well-illuminated output. We fine-tune a diffusion model on this data and report strong gains over SD 1.5, SDXL, and FLUX.1-dev img2img baselines in perceptual similarity, structural similarity, and identity preservation. We release code, data, and model weights for reproducibility.

Overview

We frame relighting as instruction-based image editing without paired real-world captures. Starting from high-quality portraits (FFHQ), we filter for consistent lighting, isolate the subject, estimate a lighting-neutral albedo, synthesize plausible poor lighting, and generate photographer-style lighting sentences with a vision–language model. The resulting triplets supervise InstructPix2Pix built on Stable Diffusion 1.5 (VAE and text encoder frozen; U-Net fine-tuned at $512 \times 512$ ).

Data engine

CLIP filtering

We score each image with CLIP ViT-B/32 against seven lighting prompts and keep images whose average similarity exceeds 0.21, yielding roughly 12k well-lit faces (10k train / 1k val / 1k test).

CLIP illumination filtering: two rows of face thumbnails with high vs low lighting scores — Figure 2: CLIP-based illumination filtering. Images scoring above our threshold of 0.21 (top row) exhibit clear, well-lit faces, while images below this threshold (bottom row) show poor illumination or occluded faces.

Pipeline stages

Five-stage data engine pipeline diagram: CLIP, SAM 3, Retinex, Lambertian degradation, Qwen3-VL — Figure 3: Data engine pipeline. Starting from a large image collection, we filter for well-illuminated images using CLIP, segment the subject with SAM 3, extract a lighting-neutral albedo via Multi-Scale Retinex, apply synthetic illumination degradation using depth-aware Lambertian shading, and then generate natural language lighting editing instructions with Qwen3-VL to complete each training triplet.

Lighting instructions

Diagram of Qwen3-VL producing a lighting description from a portrait — Figure 4: Editing instruction generation. We use Qwen3-VL to generate natural language descriptions of lighting conditions, which serve as text instructions for our training triplets.

Training

We adopt InstructPix2Pix on SD 1.5: 250 epochs, AdamW, learning rate $1 \times 10^{-5}$ , per-GPU batch size 24, 8× NVIDIA A100 80GB GPUs (~5.5 hours).

Results

Quantitative evaluation uses LPIPS (↓), SSIM (↑), CLIP text–image score (↑), and an ArcFace identity score (↑) on 1k held-out FFHQ test triplets, with the same degraded inputs and instructions for all models.

Metric	SD 1.5	SDXL	FLUX.1-dev	Ours
LPIPS ↓	0.6346 ± 0.0901	0.6292 ± 0.0896	0.6504 ± 0.0787	0.3002 ± 0.0904
SSIM ↑	0.3802 ± 0.0951	0.4333 ± 0.1009	0.3726 ± 0.0974	0.5667 ± 0.1002
CLIP ↑	0.2601 ± 0.0280	0.2567 ± 0.0291	0.2520 ± 0.0303	0.2504 ± 0.0314
Identity ↑	0.0712 ± 0.0788	0.1088 ± 0.0980	0.0437 ± 0.0796	0.7591 ± 0.1823

Takeaway: we improve perceptual fidelity, structure, and especially identity preservation while remaining close to baselines on CLIP—consistent with editing lighting rather than re-synthesizing identity to chase the text prior.

Qualitative comparisons

Qualitative comparison grid on the FFHQ test set across baselines and our model — Figure 5: Qualitative comparison on our FFHQ test set. Given degraded inputs and lighting instructions, our model produces realistic relighting while preserving subject identity. All three baselines largely disregard the editing instruction and fail to maintain facial identity.

Out-of-distribution qualitative comparison on additional portrait images — Figure 6: Out-of-distribution generalization. Our model generalizes to unseen faces with diverse lighting instructions, while baselines exhibit inconsistent illumination and fail to preserve facial identity.

Limitations & future work

The pipeline is optimized for portrait data (FFHQ). Extending the engine to general scenes, stronger backbones, and spatially localized or multi-light instructions are promising directions—we publish artifacts to make that easier.

BibTeX

@misc{anand2026learning,
  title         = {Learning Illumination Control in Diffusion Models},
  author        = {Anand, Nishit and Suri, Manan and Metzler, Christopher and Manocha, Dinesh and Duraiswami, Ramani},
  year          = {2026},
  eprint        = {2604.24877},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2604.24877}
}