Zhiyang Chen
Postdoc, Westlake University (2024)
I am currently a member of Machine Perception & Learning (MAPLE) Lab, Westlake University, working with Prof. Guo-Jun Qi. I obtained my Ph.D. from Institute of Automation, Chinese Academy of Sciences in June 2024.
My research interests includes multi-modal content perception and generation, and I currently focus on building large multimodal models.
If you are looking for Ph.D. or research intern opportunities in these areas, feel free to contact me via email.
",
which does not match the baseurl
("
") configured in _config.yml
.
baseurl
in _config.yml
to "
".
Yufei Zhan, Yousong Zhu, Zhiyang Chen, Fan Yang, Ming Tang, Jinqiao Wang
European Conference on Computer Vision (ECCV) 2024
Griffon is a Large Vision-Language Model that can accurately identify and locate objects of interest based on free-form texts. It is realized with a unified data format containing pure text and a novel language-prompted localization dataset, without introducing any specific tokens or expert models.
Zhiyang Chen, Yousong Zhu, Zhaowen Li, Fan Yang, Chaoyang Zhao, Jinqiao Wang, Ming Tang
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024
In this paper, we find that the high-dimensional feed-forward networks occupies much computation cost in vision transformers. To this end, we introduce a lightweight, plug-and-play substitute, SparseFFN, that can reduce complexity in both channel and spatial dimension. SparseFFN can effectively reduce model complexity in a broad spectrum of vision models.
Zhiyang Chen, Yousong Zhu, Zhaowen Li, Fan Yang, Wei Li, Hanxin Wang, Chaoyang Zhao, Liwei Wu, Rui Zhao, Jinqiao Wang, Ming Tang
Neural Information Processing Systems (NeurIPS) 2022 Spotlight
We propose a general definition that encompasses a wide range of visual tasks, so that all their outputs can be decoded in an identical way: treating objects as fundamental units and generating multiple sequences based on the input image and class prompts. According to this, we build a language-guided general vision model that can meet diverse task requirements and achieve comparable performance with specialized models.
Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang
Neural Information Processing Systems (NeurIPS) 2021
This paper is an early work to introduce Masked Image Modeling in self-supervised learning. MST utilizes self-attention map to mask background image tokens, and supervises with a pixel-level restoration loss to preserve fine-grained information, in addition to common contrastive learning. MST helps a lot in downstream tasks.
Zhiyang Chen, Yousong Zhu, Chaoyang Zhao, Guosheng Hu, Wei Zeng, Jinqiao Wang, Ming Tang
ACM Multimedia Conference (ACM MM) 2021 Oral
The fixed-size patch embedding in current vision transformers might overlook local spatial structures and extract inferior image features. To address this problem, we propose a new module (DePatch) which learns to adaptively split the images into patches with different positions and scales in a data-driven way, resulting in an enhanced backbone to extract more potent image features.