VT-DUDA: Visual Token Conditioning for Diffusion-guided Unsupervised Domain Adaptation

1AI for Good (AIGO), Istituto Italiano di Tecnologia
2University of Genoa
3University of Verona
Transactions on Machine Learning Research (TMLR)

VT-DUDA uses instance-level visual tokens to enrich diffusion conditioning for unsupervised domain adaptation, enabling more useful labeled target-style data synthesis.

Abstract

Unsupervised domain adaptation (UDA) aims to learn a target-domain classifier from labeled source data and unlabeled target data under distribution shift. Recent diffusion-based UDA methods approach this problem by synthesizing labeled target-style images and training on the resulting synthetic data. However, their performance depends heavily on the conditioning design: class prompts provide only coarse guidance, while domain adaptation modules mainly control appearance, which may leave target-style synthesis insufficiently specified. We propose VT-DUDA, a visual-token conditioning framework for diffusion-guided UDA. Instead of relying only on text prompts, VT-DUDA uses source images to provide additional instance-level visual context for target-style synthesis. Specifically, VT-DUDA maps each source image to a compact sequence of visual tokens and forms a hybrid conditioning context by concatenating these tokens with the corresponding text embeddings along the cross-attention context dimension of a latent diffusion model. This provides instance-dependent conditioning beyond text alone, while synthesis is performed with the target-domain adapter branch. Because guidance is represented explicitly as a token sequence, the same interface also permits inference-time manipulation of the conditioning signal through token selection and token-strength adjustment. The proposed method preserves the standard diffusion objective and can be integrated into existing adapter-based diffusion frameworks without modifying the backbone. Across Office-31, Office-Home, and VisDA-2017, VT-DUDA improves average target-domain accuracy over strong discriminative and diffusion-based UDA baselines. The results suggest that, in generation-based UDA, a stronger conditioning interface can improve the downstream usefulness of synthetic target-style data.

Method Overview

VT-DUDA introduces a visual-token conditioning interface for diffusion-guided UDA. Each source image is mapped into a compact sequence of visual tokens. These visual tokens are concatenated with text embeddings and injected through the standard cross-attention interface of a latent diffusion model.

During training, source samples are denoised using class prompts and source-image tokens, while target samples are denoised using a generic target prompt and target-image tokens. During generation, the model reuses the same token-augmented cross-attention format, pairing source class prompts with source-image tokens and synthesizing target-style images under the target-domain adapter branch.

Figure 1: VT-DUDA overview

Figure 1. Overview of VT-DUDA. We jointly train domain-specific diffusion adapters and an image-to-token encoder on source and target images.

Key Contributions

  • We propose VT-DUDA, a visual-token conditioning framework for diffusion-guided unsupervised domain adaptation.
  • VT-DUDA augments the standard cross-attention interface of an adapter-based latent diffusion model with instance-dependent visual tokens, enriching text conditioning without modifying the diffusion backbone, VAE, or denoising objective.
  • The framework supports both pure-noise target-style synthesis and DDIM-inversion-based target-style translation through the same token-conditioned interface.
  • Experiments on Office-31, Office-Home, and VisDA-2017 show improved average target-domain accuracy over strong discriminative and diffusion-based UDA baselines.

Main Results

Under the full VT-DUDA configuration, which combines pure-noise target-style synthesis with inversion-based translation, VT-DUDA achieves the strongest average performance across Office-Home, Office-31, and VisDA-2017 when instantiated on top of MCC and ELS.

Method Office-Home Avg Office-31 Avg VisDA-2017 Mean
MCC 72.24 89.61 83.32
ELS 71.84 90.21 83.40
MCC + VT-DUDA 76.25 91.53 87.48
ELS + VT-DUDA 76.74 92.04 88.42

Average transfer accuracies (%) on the main UDA benchmarks.

Additional Visual Results

Figure 2: Token-strength scaling for target-style synthesis

Figure 2. Token-strength scaling for target-style synthesis under Art → Clipart.

Figure 3: Token-subset manipulation under translation-based augmentation

Figure 3. Token-subset manipulation under translation-based augmentation for Real-World → Clipart.

Inversion-free Protocol

VT-DUDA also remains effective under an inversion-free setting, where the synthetic labeled training set is constructed only from pure-noise target-style generation. In this protocol, VT-DUDA uses 50 generated images per class on Office-31 and Office-Home, and 1000 generated images per class on VisDA-2017.

Method Office-Home Avg Office-31 Avg VisDA-2017 Mean
MCC + VT-DUDA 73.50 91.37 86.92
ELS + VT-DUDA 73.72 91.92 87.56

BibTeX

@article{qi2026vtduda,
  title={VT-DUDA: Visual Token Conditioning for Diffusion-guided Unsupervised Domain Adaptation},
  author={Qi, Xuan and Berardini, Daniele and Serez, Dario and Pastore, Vito Paolo and Murino, Vittorio},
  journal={Transactions on Machine Learning Research},
  year={2026},
  url={https://openreview.net/forum?id=Y956680PCe}
}