ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation

Seoul National Univesity
Corresponding author
ICCV 2025

Given a source image and a target prompt, ReFlex can preserve core information of the source image, including structure and background, while adapting high-level attributes in accordance with the target prompt. ReFlex is training-free, does not require a user-specified mask, and can be used even in the absence of a source prompt as demonstrated in the image translation examples.

Abstract

Rectified Flow text-to-image models surpass diffusion models in image quality and text alignment, but adapting ReFlow for real-image editing remains challenging. We propose a new real-image editing method for ReFlow by analyzing the intermediate representations of multimodal transformer blocks and identifying three key features. To extract these features from real images with sufficient structural preservation, we leverage mid-step latent, which is inverted only up to the mid-step. We then adapt attention during injection to improve editability and enhance alignment to the target text. Our method is training-free, requires no user-provided mask, and can be applied even without a source prompt. Extensive experiments on two benchmarks with nine baselines demonstrate its superior performance over prior methods, further validated by human evaluations confirming a strong user preference for our approach.

Analysis

Analysis on MM-DiT block:
(a) Upper row: The PCA result of I2I-SA and the attention maps of I2T-CA and T2I-CA for two words. I2I-SA encodes structural information, while I2T-CA and T2I-CA capture text-image relationships. Lower row: Edited results generated by injecting features, indicated above each image. (b) PCA results of residual and identity features, along with edited results generated by injecting these features from MM-DiT blocks.

Method

Overview of our method, ReFlex:
(a) We extract three key features—I2T-CA, I2I-SA, and the residual feature—from a mid-step latent, while the fully inverted latent serves as initial noise for target image generation. To enhance text alignment, we propose two adaptation methods, respectively for two attention maps, (b) I2T-CA and (c) I2I-SA, whereas residual feature is injected without modification. (d) For local editing, we generate an editing mask from the source I2T-CA. This injection process is applied only during the early timesteps of target image generation.

Comparisons

Qualitative results show that ReFlex outperforms both FLUX-based and SD-based approaches in accurately following the target prompt. In the PIE-bench and Wild-TI2I-Real datasets, other models fail at key edits –such as changing a heart to a circle or adjusting color details– while our method makes precise modifications with high image quality and structure preservation. Additional comparisons and diverse examples are provided below.

More Comparisons

More Results

Limitations

Limitations of our method.

(a) Incomplete preservation of source image details, when the edited region overlaps with the subject.
(b) Limitation of editing mask generation.
(c) Variability in editing results arises from the random seed.