VT-DUDA: Visual Token Conditioning for Diffusion-guided Unsupervised Domain Adaptation

Qi, Xuan; Berardini, Daniele; Serez, Dario; Pastore, Vito Paolo; Murino, Vittorio

VT-DUDA: Visual Token Conditioning for Diffusion-guided Unsupervised Domain Adaptation

Xuan Qi^1,2, Daniele Berardini¹, Dario Serez¹, Vito Paolo Pastore^1,2, Vittorio Murino^1,3

¹AI for Good (AIGO), Istituto Italiano di Tecnologia
²University of Genoa
³University of Verona

Transactions on Machine Learning Research (TMLR)

Paper OpenReview Code BibTeX

VT-DUDA uses instance-level visual tokens to enrich diffusion conditioning for unsupervised domain adaptation, enabling more useful labeled target-style data synthesis.

Abstract

Unsupervised domain adaptation (UDA) aims to learn a target-domain classifier from labeled source data and unlabeled target data under distribution shift. Recent diffusion-based UDA methods approach this problem by synthesizing labeled target-style images and training on the resulting synthetic data. However, their performance depends heavily on the conditioning design: class prompts provide only coarse guidance, while domain adaptation modules mainly control appearance, which may leave target-style synthesis insufficiently specified. We propose VT-DUDA, a visual-token conditioning framework for diffusion-guided UDA. Instead of relying only on text prompts, VT-DUDA uses source images to provide additional instance-level visual context for target-style synthesis. Specifically, VT-DUDA maps each source image to a compact sequence of visual tokens and forms a hybrid conditioning context by concatenating these tokens with the corresponding text embeddings along the cross-attention context dimension of a latent diffusion model. This provides instance-dependent conditioning beyond text alone, while synthesis is performed with the target-domain adapter branch. Because guidance is represented explicitly as a token sequence, the same interface also permits inference-time manipulation of the conditioning signal through token selection and token-strength adjustment. The proposed method preserves the standard diffusion objective and can be integrated into existing adapter-based diffusion frameworks without modifying the backbone. Across Office-31, Office-Home, and VisDA-2017, VT-DUDA improves average target-domain accuracy over strong discriminative and diffusion-based UDA baselines. The results suggest that, in generation-based UDA, a stronger conditioning interface can improve the downstream usefulness of synthetic target-style data.

Method Overview

VT-DUDA introduces a visual-token conditioning interface for diffusion-guided UDA. Each source image is mapped into a compact sequence of visual tokens. These visual tokens are concatenated with text embeddings and injected through the standard cross-attention interface of a latent diffusion model.

During training, source samples are denoised using class prompts and source-image tokens, while target samples are denoised using a generic target prompt and target-image tokens. During generation, the model reuses the same token-augmented cross-attention format, pairing source class prompts with source-image tokens and synthesizing target-style images under the target-domain adapter branch.

Figure 1. Overview of VT-DUDA. We jointly train domain-specific diffusion adapters and an image-to-token encoder on source and target images.

Key Contributions

We propose VT-DUDA, a visual-token conditioning framework for diffusion-guided unsupervised domain adaptation.
VT-DUDA augments the standard cross-attention interface of an adapter-based latent diffusion model with instance-dependent visual tokens, enriching text conditioning without modifying the diffusion backbone, VAE, or denoising objective.
The framework supports both pure-noise target-style synthesis and DDIM-inversion-based target-style translation through the same token-conditioned interface.
Experiments on Office-31, Office-Home, and VisDA-2017 show improved average target-domain accuracy over strong discriminative and diffusion-based UDA baselines.

Main Results

Under the full VT-DUDA configuration, which combines pure-noise target-style synthesis with inversion-based translation, VT-DUDA achieves the strongest average performance across Office-Home, Office-31, and VisDA-2017 when instantiated on top of MCC and ELS.

Method	Office-Home Avg	Office-31 Avg	VisDA-2017 Mean
MCC	72.24	89.61	83.32
ELS	71.84	90.21	83.40
MCC + VT-DUDA	76.25	91.53	87.48
ELS + VT-DUDA	76.74	92.04	88.42

Average transfer accuracies (%) on the main UDA benchmarks.

Additional Visual Results

Figure 2. Token-strength scaling for target-style synthesis under Art → Clipart.

Figure 3. Token-subset manipulation under translation-based augmentation for Real-World → Clipart.

Inversion-free Protocol

VT-DUDA also remains effective under an inversion-free setting, where the synthetic labeled training set is constructed only from pure-noise target-style generation. In this protocol, VT-DUDA uses 50 generated images per class on Office-31 and Office-Home, and 1000 generated images per class on VisDA-2017.

Method	Office-Home Avg	Office-31 Avg	VisDA-2017 Mean
MCC + VT-DUDA	73.50	91.37	86.92
ELS + VT-DUDA	73.72	91.92	87.56

BibTeX

@article{qi2026vtduda,
  title={VT-DUDA: Visual Token Conditioning for Diffusion-guided Unsupervised Domain Adaptation},
  author={Qi, Xuan and Berardini, Daniele and Serez, Dario and Pastore, Vito Paolo and Murino, Vittorio},
  journal={Transactions on Machine Learning Research},
  year={2026},
  url={https://openreview.net/forum?id=Y956680PCe}
}