Zhiyang Chen

Logo Postdoc, Westlake University (2024)

I am currently a member of Machine Perception & Learning (MAPLE) Lab, Westlake University, working with Prof. Guo-Jun Qi. I obtained my Ph.D. from Institute of Automation, Chinese Academy of Sciences in June 2024.
My research interests includes multi-modal content perception and generation, and I currently focus on building large multimodal models.
If you are looking for Ph.D. or research intern opportunities in these areas, feel free to contact me via email.


Education
  • Xi'an Jiaotong University
    Xi'an Jiaotong University
    B.S. in Automation
    Sep. 2015 - Jun. 2019
  • Institute of Automation, CAS
    Institute of Automation, CAS
    Ph.D. at Foundation Model Research Center
    Sep. 2019 - Jun. 2024
Experience
  • SenseTime, Beijing
    SenseTime, Beijing
    General Vison Model, Self-Supervised Learning
  • MiHoYo, Beijing
    MiHoYo, Beijing
    Large Language Models, AI for Game
Selected Publications (view all )
Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models
Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Yufei Zhan, Yousong Zhu, Zhiyang Chen, Fan Yang, Ming Tang, Jinqiao Wang

European Conference on Computer Vision (ECCV) 2024

Griffon is a Large Vision-Language Model that can accurately identify and locate objects of interest based on free-form texts. It is realized with a unified data format containing pure text and a novel language-prompted localization dataset, without introducing any specific tokens or expert models.

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models
Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Yufei Zhan, Yousong Zhu, Zhiyang Chen, Fan Yang, Ming Tang, Jinqiao Wang

European Conference on Computer Vision (ECCV) 2024

Griffon is a Large Vision-Language Model that can accurately identify and locate objects of interest based on free-form texts. It is realized with a unified data format containing pure text and a novel language-prompted localization dataset, without introducing any specific tokens or expert models.

The devil is in details: Delving into lite ffn design for vision transformers
The devil is in details: Delving into lite ffn design for vision transformers

Zhiyang Chen, Yousong Zhu, Zhaowen Li, Fan Yang, Chaoyang Zhao, Jinqiao Wang, Ming Tang

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024

In this paper, we find that the high-dimensional feed-forward networks occupies much computation cost in vision transformers. To this end, we introduce a lightweight, plug-and-play substitute, SparseFFN, that can reduce complexity in both channel and spatial dimension. SparseFFN can effectively reduce model complexity in a broad spectrum of vision models.

The devil is in details: Delving into lite ffn design for vision transformers
The devil is in details: Delving into lite ffn design for vision transformers

Zhiyang Chen, Yousong Zhu, Zhaowen Li, Fan Yang, Chaoyang Zhao, Jinqiao Wang, Ming Tang

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024

In this paper, we find that the high-dimensional feed-forward networks occupies much computation cost in vision transformers. To this end, we introduce a lightweight, plug-and-play substitute, SparseFFN, that can reduce complexity in both channel and spatial dimension. SparseFFN can effectively reduce model complexity in a broad spectrum of vision models.

Obj2Seq: Formatting objects as sequences with class prompts for visual tasks
Obj2Seq: Formatting objects as sequences with class prompts for visual tasks

Zhiyang Chen, Yousong Zhu, Zhaowen Li, Fan Yang, Wei Li, Hanxin Wang, Chaoyang Zhao, Liwei Wu, Rui Zhao, Jinqiao Wang, Ming Tang

Neural Information Processing Systems (NeurIPS) 2022 Spotlight

We propose a general definition that encompasses a wide range of visual tasks, so that all their outputs can be decoded in an identical way: treating objects as fundamental units and generating multiple sequences based on the input image and class prompts. According to this, we build a language-guided general vision model that can meet diverse task requirements and achieve comparable performance with specialized models.

Obj2Seq: Formatting objects as sequences with class prompts for visual tasks
Obj2Seq: Formatting objects as sequences with class prompts for visual tasks

Zhiyang Chen, Yousong Zhu, Zhaowen Li, Fan Yang, Wei Li, Hanxin Wang, Chaoyang Zhao, Liwei Wu, Rui Zhao, Jinqiao Wang, Ming Tang

Neural Information Processing Systems (NeurIPS) 2022 Spotlight

We propose a general definition that encompasses a wide range of visual tasks, so that all their outputs can be decoded in an identical way: treating objects as fundamental units and generating multiple sequences based on the input image and class prompts. According to this, we build a language-guided general vision model that can meet diverse task requirements and achieve comparable performance with specialized models.

MST: Masked Self-Supervised Transformer for Visual Representation
MST: Masked Self-Supervised Transformer for Visual Representation

Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang

Neural Information Processing Systems (NeurIPS) 2021

This paper is an early work to introduce Masked Image Modeling in self-supervised learning. MST utilizes self-attention map to mask background image tokens, and supervises with a pixel-level restoration loss to preserve fine-grained information, in addition to common contrastive learning. MST helps a lot in downstream tasks.

MST: Masked Self-Supervised Transformer for Visual Representation
MST: Masked Self-Supervised Transformer for Visual Representation

Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang

Neural Information Processing Systems (NeurIPS) 2021

This paper is an early work to introduce Masked Image Modeling in self-supervised learning. MST utilizes self-attention map to mask background image tokens, and supervises with a pixel-level restoration loss to preserve fine-grained information, in addition to common contrastive learning. MST helps a lot in downstream tasks.

DPT: Deformable patch-based transformer for visual recognition
DPT: Deformable patch-based transformer for visual recognition

Zhiyang Chen, Yousong Zhu, Chaoyang Zhao, Guosheng Hu, Wei Zeng, Jinqiao Wang, Ming Tang

ACM Multimedia Conference (ACM MM) 2021 Oral

The fixed-size patch embedding in current vision transformers might overlook local spatial structures and extract inferior image features. To address this problem, we propose a new module (DePatch) which learns to adaptively split the images into patches with different positions and scales in a data-driven way, resulting in an enhanced backbone to extract more potent image features.

DPT: Deformable patch-based transformer for visual recognition
DPT: Deformable patch-based transformer for visual recognition

Zhiyang Chen, Yousong Zhu, Chaoyang Zhao, Guosheng Hu, Wei Zeng, Jinqiao Wang, Ming Tang

ACM Multimedia Conference (ACM MM) 2021 Oral

The fixed-size patch embedding in current vision transformers might overlook local spatial structures and extract inferior image features. To address this problem, we propose a new module (DePatch) which learns to adaptively split the images into patches with different positions and scales in a data-driven way, resulting in an enhanced backbone to extract more potent image features.

All publications