Feng, Jian; Wu, Peng; Xu, Renjie; Zhang, Xiaoming; Wang, Tao; Li, Xuan

doi:10.1007/s11042-024-18417-3

Back to matches

Your institution may have access to this item. Find your institution then sign in to continue.

Title: CSFNet: a compact and efficient convolution-transformer hybrid vision model.
Authors: Feng, Jian; Wu, Peng; Xu, Renjie; Zhang, Xiaoming; Wang, Tao; Li, Xuan
Abstract: The Vision Transformer (ViT) has demonstrated impressive performance in various visual tasks, but its high computational requirements limit its applicability on edge devices. Conversely, convolutional neural networks (CNNs) are commonly used in mobile applications, but their static and weak global properties hinder their performance. In this work, we propose a lightweight, high-density predictive classification hybrid-based model called CSFNet, which combines good local inductive bias capability with long-distance modeling property. To establish local-global information association, we introduce two layered structures. Firstly, we use the Local-Attention Block (LAB) with adaptive kernels and channel expansion ratio to aggregate n × n local information layer by layer, capturing multi-stage detail features and inducing efficient local inductive properties. Secondly, we introduce a linear complexity Channel-Spatial Fusion Attention (CSFA) that projects the attention matrix from both channel and tokens dimensions. The relationships between tokens are aggregated stage by stage to encode efficient contextual association information using low-rank matrix and element-by-element operations to reduce computational complexity. Experimental results demonstrate that our proposed CSFNet-XXS/XS/S models with 1.4M/2.4M/5.6M parameters and 0.3G/0.5G/1.1G multiply-adds (MAdds) achieve 70.23%/74.91%/78.82% top-1 accuracy on ImageNet-1k with competitive performance compared to recent mainstream methods. Furthermore, CSFNet performs well on small-scale datasets, MS-COCO2017 and ADE-20K.
Subjects: TRANSFORMER models; CONVOLUTIONAL neural networks; IMAGE recognition (Computer vision); LOW-rank matrices; MOBILE apps
Publication: Multimedia Tools & Applications, 2024, Vol 83, Issue 29, p72679
ISSN: 1380-7501
Publication type: Article
DOI: 10.1007/s11042-024-18417-3

We found a match

CSFNet: a compact and efficient convolution-transformer hybrid vision model.

Feng, Jian; Wu, Peng; Xu, Renjie; Zhang, Xiaoming; Wang, Tao; Li, Xuan

TRANSFORMER models; CONVOLUTIONAL neural networks; IMAGE recognition (Computer vision); LOW-rank matrices; MOBILE apps

Multimedia Tools & Applications, 2024, Vol 83, Issue 29, p72679

1380-7501

Article

10.1007/s11042-024-18417-3