Destylize to Stylize

Building High Quality Style Transfer Data

1Jilin University   2Nanjing University   3Shanghai Innovation Institute
4Adobe   5Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China
*Corresponding author
DST-100K Overview

Figure: Overview of DST-100K dataset.

Abstract

We present DST-100K, a high-quality dataset for style transfer, constructed through a novel destylization-based pipeline. The goal of destylization is to reverse the stylization process by recovering the underlying natural appearance of artistic images. This formulation transforms the original style image into an authentic supervision signal, enabling the learning of style transfer from real styles with aligned content. To achieve destylization, we design DST, a text-guided destylization model which can remove stylistic features from artistic images to generate style-free natural counterparts guided by content text. Because imperfect destylization would propagate noise into downstream training, we further introduce DST-Filter, a Chain‑of‑Thought, multi‑stage evaluation that jointly measures content preservation and style discrepancy, automatically discarding low‑quality pairs. Leveraging DST‑100K, we build DST-Transfer, a feed-forward style transfer model based on FLUX.1-dev without adding any new modules and handcraft design. Despite its simplicity, DST‑Transfer consistently surpasses state‑of‑the‑art methods in qualitative and quantitative evaluations. Our approach reframes style transfer as a data problem and introduces a reliable supervision paradigm derived directly from authentic artistic styles, which helps address the critical challenge posed by the absence of ground truth data in style transfer tasks.

Dataset Overview

DST-100K Dataset Statistics

100K
Image Triplets
669
Artists
117
Art Movements
65
Digital Styles
1K
Resolution
DST-100K Overview

Overview of DST-100K dataset.

Destylization

DST: Text-Guided Destylization

Destylization Process

(a) Destylization Dataset Construction: we use high-resolution images from HQ-50K and FFHQ as content images, covering six categories: humans, animals, plants, objects, scenes, and architecture. These images are stylized by four models, and captions are generated using InternVL2.5-7B. This yields triplets in the form of stylized-content-caption. (b) The architecture of DST model.

Image Pool

(a) Style image collection and (b) text-guided destylization pipeline.

DST-Filter

Multi-Stage Evaluation Pipeline

DST-Filter Pipeline

The pipeline of DST-Filter. DST-Filter assesses each pair from two aspects: content preservation and style discrepancy, using GPT-4o with region-level and attribute-level Chain-of-Thought reasoning.

More Results

Diverse Style Transfer Results

More Results

Our method produces a broader range of stylized results across diverse style categories, including 2D styles such as flat design, PS1 game style, cartoon, line art, illustration, and classic artworks, as well as 3D styles such as origami art, 3D voxel art and 3D low poly rendering.

Quantitative Results

Performance Comparison

Quantitative Results

Quantitative comparison with state-of-the-art methods.

Qualitative Comparison

Performance Comparison

Qualitative Comparison

Qualitative comparison of style transfer results.

Comparison to 4o

DST-Transfer vs GPT-4o

Method Comparison

DST-Transfer comparison with 4O method on stylization task.

DST vs 4O

DST comparison with 4O method on destylization task.