FaceStudio: Put Your Face Everywhere in Seconds

(* Equal contributions, ✦ Corresponding Author)
Interpolate start reference image.
Capability demonstration of FaceStudio. Our model support various applications and can preserve the subject’s identity in the synthesized images with high fidelity.


This study investigates identity-preserving image synthesis, an intriguing task in image generation that seeks to maintain a subject's identity while adding a personalized, stylistic touch. Traditional methods, such as Textual Inversion and DreamBooth, have made strides in custom image creation, but they come with significant drawbacks. These include the need for extensive resources and time for fine-tuning, as well as the requirement for multiple reference images. To overcome these challenges, our research introduces a novel approach to identity-preserving synthesis, with a particular focus on human images. Our model leverages a direct feed-forward mechanism, circumventing the need for intensive fine-tuning, thereby facilitating quick and efficient image generation. Central to our innovation is a hybrid guidance framework, which combines stylized images, facial images, and textual prompts to guide the image generation process. This unique combination enables our model to produce a variety of applications, such as artistic portraits and identity-blended images. Our experimental results, including both qualitative and quantitative evaluations, demonstrate the superiority of our method over existing baseline models and previous works, particularly in its remarkable efficiency and ability to preserve the subject's identity with high fidelity.


Interpolate start reference image.
Overview of our model structure. We develop a hybrid-guidance identity-preserving image synthesis framework. Our model, built upon StableDiffusion, utilizes text prompts and reference human images to guide image synthesis while preserving human identity through an identity input.
Interpolate start reference image.
Comparison between standard cross-attentions in singleidentity modeling (a) and the advanced cross-attentions tailored for multi-identity integration (b).


Interpolate start reference image.
Our model supports multi-human image synthesis. Our model’s effectiveness is evident when compared to our model variant that removes our proposed multi-human cross-attention mechanisms.
Interpolate start reference image.
Our model supports hybrid-guidance. We employ an approach that combines textual prompts and reference images for image synthesis, and the text prompt used here pertains to the cartoon style.
Interpolate start reference image.
Our model supports identity mixing. We generate facial images that combine multiple identities using a mixing ratio to control the influence of different IDs.
Interpolate start reference image.
Image-to-image synthesis with our proposed method. Our model preserves the identities of humans and the layout in the raw images. chart editing task.
Interpolate start reference image.
More qualitative results. Our model obtains a balance between stylistic expression and the need to maintain recognizable features of the subject.
Interpolate start reference image.
Identity-preserving novel view synthesis. Our method excels at generating new views of a subject while maintaining its identity.
Interpolate start reference image.
Comparison with state-of-the-art methods in identity-preserving text-to-image generation.
Interpolate start reference image.
Comparison of face similarity and generation time for identity-preserving image generation. Our methods, guided by both texts and images, exhibit remarkable advantages compared to baseline approaches in terms of both face similarity and generation time. The variable N represents the number of reference images per identity.


      title={FaceStudio: Put Your Face Everywhere in Seconds}, 
      author={Yuxuan Yan and Chi Zhang and Rui Wang and Pei Cheng and Gang Yu and Bin Fu},