The crack image on the rail surface has a small target, many background interference information and a high resolution, making its target detection effect not ideal. In this regard, the paper proposes an improved target detection algorithm of YOLOV5s to improve the detection effect of rail cracks. The VOV-GSCSP module is introduced in the neck network, and the ordinary convolution is replaced by the lighter convolution method GSconv to reduce the computational effort of the network while retaining more detailed information. The feature pyramid structure is improved and a multipath cross-layer fusion structure is proposed to incorporate the information of the backbone network across layers in the process of feature pyramid downsampling to retain more original feature information and improve the accuracy of target detection. Meanwhile, the CA attention module and Transformer structure are introduced to further enhance the information extraction of higher-order semantics. The experimental results show that the improved YOLOV5s algorithm achieves an average mean accuracy (mAP) of 62.4%, which is 6.2 percentage points better than the original YOLOV5s algorithm; the recall rate (Recall) is 92.2%, which is 4.4 percentage points better.