VT-DUDA: Visual Token Conditioning for Diffusion-guided Unsupervised Domain Adaptation
Abstract
Unsupervised domain adaptation (UDA) aims to learn a target-domain classifier from labeled source data and unlabeled target data under distribution shift. Recent diffusion-based UDA methods approach this problem by synthesizing labeled target-style images and training on the resulting synthetic data. However, their performance depends heavily on the conditioning design: class prompts provide only coarse guidance, while domain adaptation modules mainly control appearance, which may leave target-style synthesis insufficiently specified. We propose VT-DUDA, a visual-token conditioning framework for diffusion-guided UDA. Instead of relying only on text prompts, VT-DUDA uses source images to provide additional instance-level visual context for target-style synthesis. Specifically, VT-DUDA maps each source image to a compact sequence of visual tokens and forms a hybrid conditioning context by concatenating these tokens with the corresponding text embeddings along the cross-attention context dimension of a latent diffusion model. This provides instance-dependent conditioning beyond text alone, while synthesis is performed with the target-domain adapter branch. Because guidance is represented explicitly as a token sequence, the same interface also permits inference-time manipulation of the conditioning signal through token selection and token-strength adjustment. The proposed method preserves the standard diffusion objective and can be integrated into existing adapter-based diffusion frameworks without modifying the backbone. Across Office-31, Office-Home, and VisDA-2017, VT-DUDA improves average target-domain accuracy over strong discriminative and diffusion-based UDA baselines. The results suggest that, in generation-based UDA, a stronger conditioning interface can improve the downstream usefulness of synthetic target-style data.
Method Overview
VT-DUDA introduces a visual-token conditioning interface for diffusion-guided UDA. Each source image is mapped into a compact sequence of visual tokens. These visual tokens are concatenated with text embeddings and injected through the standard cross-attention interface of a latent diffusion model.
During training, source samples are denoised using class prompts and source-image tokens, while target samples are denoised using a generic target prompt and target-image tokens. During generation, the model reuses the same token-augmented cross-attention format, pairing source class prompts with source-image tokens and synthesizing target-style images under the target-domain adapter branch.
Figure 1. Overview of VT-DUDA. We jointly train domain-specific diffusion adapters and an image-to-token encoder on source and target images.
Key Contributions
- We propose VT-DUDA, a visual-token conditioning framework for diffusion-guided unsupervised domain adaptation.
- VT-DUDA augments the standard cross-attention interface of an adapter-based latent diffusion model with instance-dependent visual tokens, enriching text conditioning without modifying the diffusion backbone, VAE, or denoising objective.
- The framework supports both pure-noise target-style synthesis and DDIM-inversion-based target-style translation through the same token-conditioned interface.
- Experiments on Office-31, Office-Home, and VisDA-2017 show improved average target-domain accuracy over strong discriminative and diffusion-based UDA baselines.
Main Results
Under the full VT-DUDA configuration, which combines pure-noise target-style synthesis with inversion-based translation, VT-DUDA achieves the strongest average performance across Office-Home, Office-31, and VisDA-2017 when instantiated on top of MCC and ELS.
| Method | Office-Home Avg | Office-31 Avg | VisDA-2017 Mean |
|---|---|---|---|
| MCC | 72.24 | 89.61 | 83.32 |
| ELS | 71.84 | 90.21 | 83.40 |
| MCC + VT-DUDA | 76.25 | 91.53 | 87.48 |
| ELS + VT-DUDA | 76.74 | 92.04 | 88.42 |
Average transfer accuracies (%) on the main UDA benchmarks.
Additional Visual Results
Figure 2. Token-strength scaling for target-style synthesis under Art → Clipart.
Figure 3. Token-subset manipulation under translation-based augmentation for Real-World → Clipart.
Inversion-free Protocol
VT-DUDA also remains effective under an inversion-free setting, where the synthetic labeled training set is constructed only from pure-noise target-style generation. In this protocol, VT-DUDA uses 50 generated images per class on Office-31 and Office-Home, and 1000 generated images per class on VisDA-2017.
| Method | Office-Home Avg | Office-31 Avg | VisDA-2017 Mean |
|---|---|---|---|
| MCC + VT-DUDA | 73.50 | 91.37 | 86.92 |
| ELS + VT-DUDA | 73.72 | 91.92 | 87.56 |
BibTeX
@article{qi2026vtduda,
title={VT-DUDA: Visual Token Conditioning for Diffusion-guided Unsupervised Domain Adaptation},
author={Qi, Xuan and Berardini, Daniele and Serez, Dario and Pastore, Vito Paolo and Murino, Vittorio},
journal={Transactions on Machine Learning Research},
year={2026},
url={https://openreview.net/forum?id=Y956680PCe}
}