Homepage - Zhiyang Chen

Selected Publications (view all )

Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models

Zemin Huang*, Zhiyang Chen*, Zijun Wang, Tiancheng Li, Guo-Jun Qi (* equal contribution)

arxiv 2025

[Paper]

Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models

Zemin Huang*, Zhiyang Chen*, Zijun Wang, Tiancheng Li, Guo-Jun Qi (* equal contribution)

arxiv 2025

[Paper]

Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation

Zilyu Ye, Zhiyang Chen^#, Tiancheng Li, Zemin Huang, Weijian Luo, Guo-Jun Qi^# (^# corresponding author)

Computer Vision and Pattern Recognition (CVPR) 2025 2025

A Reforcement-Tuned Diffusion Model that can adjust the noise schedule on the fly, assigning fewer denoising steps on simple samples while more steps on complex ones.

[Paper] [Code]

Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation

Zilyu Ye, Zhiyang Chen^#, Tiancheng Li, Zemin Huang, Weijian Luo, Guo-Jun Qi^# (^# corresponding author)

Computer Vision and Pattern Recognition (CVPR) 2025 2025

A Reforcement-Tuned Diffusion Model that can adjust the noise schedule on the fly, assigning fewer denoising steps on simple samples while more steps on complex ones.

[Paper] [Code]

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Yufei Zhan, Yousong Zhu, Zhiyang Chen, Fan Yang, Ming Tang, Jinqiao Wang

European Conference on Computer Vision (ECCV) 2024

Griffon is a Large Vision-Language Model that can accurately identify and locate objects of interest based on free-form texts. It is realized with a unified data format containing pure text and a novel language-prompted localization dataset, without introducing any specific tokens or expert models.

[Paper] [Code]

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Yufei Zhan, Yousong Zhu, Zhiyang Chen, Fan Yang, Ming Tang, Jinqiao Wang

European Conference on Computer Vision (ECCV) 2024

[Paper] [Code]

The devil is in details: Delving into lite ffn design for vision transformers

Zhiyang Chen, Yousong Zhu, Zhaowen Li, Fan Yang, Chaoyang Zhao, Jinqiao Wang, Ming Tang

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024

In this paper, we find that the high-dimensional feed-forward networks occupies much computation cost in vision transformers. To this end, we introduce a lightweight, plug-and-play substitute, SparseFFN, that can reduce complexity in both channel and spatial dimension. SparseFFN can effectively reduce model complexity in a broad spectrum of vision models.

[Paper]

The devil is in details: Delving into lite ffn design for vision transformers

Zhiyang Chen, Yousong Zhu, Zhaowen Li, Fan Yang, Chaoyang Zhao, Jinqiao Wang, Ming Tang

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024

[Paper]

Obj2Seq: Formatting objects as sequences with class prompts for visual tasks

Zhiyang Chen, Yousong Zhu, Zhaowen Li, Fan Yang, Wei Li, Hanxin Wang, Chaoyang Zhao, Liwei Wu, Rui Zhao, Jinqiao Wang, Ming Tang

Neural Information Processing Systems (NeurIPS) 2022 Spotlight

We propose a general definition that encompasses a wide range of visual tasks, so that all their outputs can be decoded in an identical way: treating objects as fundamental units and generating multiple sequences based on the input image and class prompts. According to this, we build a language-guided general vision model that can meet diverse task requirements and achieve comparable performance with specialized models.

[Paper] [Code]

Obj2Seq: Formatting objects as sequences with class prompts for visual tasks

Zhiyang Chen, Yousong Zhu, Zhaowen Li, Fan Yang, Wei Li, Hanxin Wang, Chaoyang Zhao, Liwei Wu, Rui Zhao, Jinqiao Wang, Ming Tang

Neural Information Processing Systems (NeurIPS) 2022 Spotlight

[Paper] [Code]

MST: Masked Self-Supervised Transformer for Visual Representation

Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang

Neural Information Processing Systems (NeurIPS) 2021

This paper is an early work to introduce Masked Image Modeling in self-supervised learning. MST utilizes self-attention map to mask background image tokens, and supervises with a pixel-level restoration loss to preserve fine-grained information, in addition to common contrastive learning. MST helps a lot in downstream tasks.

[Paper]

MST: Masked Self-Supervised Transformer for Visual Representation

Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang

Neural Information Processing Systems (NeurIPS) 2021

[Paper]

DPT: Deformable patch-based transformer for visual recognition

Zhiyang Chen, Yousong Zhu, Chaoyang Zhao, Guosheng Hu, Wei Zeng, Jinqiao Wang, Ming Tang

ACM Multimedia Conference (ACM MM) 2021 Oral

The fixed-size patch embedding in current vision transformers might overlook local spatial structures and extract inferior image features. To address this problem, we propose a new module (DePatch) which learns to adaptively split the images into patches with different positions and scales in a data-driven way, resulting in an enhanced backbone to extract more potent image features.

[Paper] [Code]

DPT: Deformable patch-based transformer for visual recognition

Zhiyang Chen, Yousong Zhu, Chaoyang Zhao, Guosheng Hu, Wei Zeng, Jinqiao Wang, Ming Tang

ACM Multimedia Conference (ACM MM) 2021 Oral

[Paper] [Code]

Action required

News

Selected Publications (view all )

Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models

Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models

Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation

Schedule On the Fly: Diffusion Time Prediction for Faster and Better Image Generation

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

The devil is in details: Delving into lite ffn design for vision transformers

The devil is in details: Delving into lite ffn design for vision transformers

Obj2Seq: Formatting objects as sequences with class prompts for visual tasks

Obj2Seq: Formatting objects as sequences with class prompts for visual tasks

MST: Masked Self-Supervised Transformer for Visual Representation

MST: Masked Self-Supervised Transformer for Visual Representation

DPT: Deformable patch-based transformer for visual recognition

DPT: Deformable patch-based transformer for visual recognition

All publications

Education

Experience