Abstract
Controlling illumination in images is essential for photography and visual content creation. While closed-source models have demonstrated impressive illumination control, open-source alternatives either require heavy control inputs like depth maps or do not release their data and code. We present a fully open-source and reproducible pipeline for learning illumination control in diffusion models.
Our approach builds a data engine that transforms well-lit images into supervised training triplets: a poorly illuminated input, a natural-language lighting instruction, and a well-illuminated output. We fine-tune a diffusion model on this data and report strong gains over SD 1.5, SDXL, and FLUX.1-dev img2img baselines in perceptual similarity, structural similarity, and identity preservation. We release code, data, and model weights for reproducibility.
Overview
We frame relighting as instruction-based image editing without paired real-world captures. Starting from high-quality portraits (FFHQ), we filter for consistent lighting, isolate the subject, estimate a lighting-neutral albedo, synthesize plausible poor lighting, and generate photographer-style lighting sentences with a vision–language model. The resulting triplets supervise InstructPix2Pix built on Stable Diffusion 1.5 (VAE and text encoder frozen; U-Net fine-tuned at ).
Data engine
Figure 1: Overview of our data engine. Starting from a well-illuminated image, the pipeline filters for lighting quality, segments the subject, extracts albedo, applies synthetic degradation, and generates text instructions describing the target illumination.
CLIP filtering
We score each image with CLIP ViT-B/32 against seven lighting prompts and keep images whose average similarity exceeds 0.21, yielding roughly 12k well-lit faces (10k train / 1k val / 1k test).
Figure 2: CLIP-based illumination filtering. Images scoring above our threshold of 0.21 (top row) exhibit clear, well-lit faces, while images below this threshold (bottom row) show poor illumination or occluded faces.
Pipeline stages
Figure 3: Data engine pipeline. Starting from a large image collection, we filter for well-illuminated images using CLIP, segment the subject with SAM 3, extract a lighting-neutral albedo via Multi-Scale Retinex, apply synthetic illumination degradation using depth-aware Lambertian shading, and then generate natural language lighting editing instructions with Qwen3-VL to complete each training triplet.
Lighting instructions
Figure 4: Editing instruction generation. We use Qwen3-VL to generate natural language descriptions of lighting conditions, which serve as text instructions for our training triplets.
Training
We adopt InstructPix2Pix on SD 1.5: 250 epochs, AdamW, learning rate , per-GPU batch size 24, 8× NVIDIA A100 80GB GPUs (~5.5 hours).
Results
Quantitative evaluation uses LPIPS (↓), SSIM (↑), CLIP text–image score (↑), and an ArcFace identity score (↑) on 1k held-out FFHQ test triplets, with the same degraded inputs and instructions for all models.
| Metric | SD 1.5 | SDXL | FLUX.1-dev | Ours |
|---|---|---|---|---|
| LPIPS ↓ | 0.6346 ± 0.0901 | 0.6292 ± 0.0896 | 0.6504 ± 0.0787 | 0.3002 ± 0.0904 |
| SSIM ↑ | 0.3802 ± 0.0951 | 0.4333 ± 0.1009 | 0.3726 ± 0.0974 | 0.5667 ± 0.1002 |
| CLIP ↑ | 0.2601 ± 0.0280 | 0.2567 ± 0.0291 | 0.2520 ± 0.0303 | 0.2504 ± 0.0314 |
| Identity ↑ | 0.0712 ± 0.0788 | 0.1088 ± 0.0980 | 0.0437 ± 0.0796 | 0.7591 ± 0.1823 |
Takeaway: we improve perceptual fidelity, structure, and especially identity preservation while remaining close to baselines on CLIP—consistent with editing lighting rather than re-synthesizing identity to chase the text prior.
Qualitative comparisons
Figure 5: Qualitative comparison on our FFHQ test set. Given degraded inputs and lighting instructions, our model produces realistic relighting while preserving subject identity. All three baselines largely disregard the editing instruction and fail to maintain facial identity.
Figure 6: Out-of-distribution generalization. Our model generalizes to unseen faces with diverse lighting instructions, while baselines exhibit inconsistent illumination and fail to preserve facial identity.
Limitations & future work
The pipeline is optimized for portrait data (FFHQ). Extending the engine to general scenes, stronger backbones, and spatially localized or multi-light instructions are promising directions—we publish artifacts to make that easier.
BibTeX
@misc{anand2026learning, title = {Learning Illumination Control in Diffusion Models}, author = {Anand, Nishit and Suri, Manan and Metzler, Christopher and Manocha, Dinesh and Duraiswami, Ramani}, year = {2026}, eprint = {2604.24877}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2604.24877}}