西安邮电大学学报

2025, 03, v.30 103-110

基于复合注意力机制的文档版面分析算法

谢海龙^1,2 罗玮^1,2 徐涛涛^1,2 杨文青^1,2 陈丹丹^1,2 董前前³

1.智信能源科技有限公司 2.中国电建集团江西省电力建设有限公司 3.西安工程大学计算机科学学院

基金项目(Foundation): 江西电建科学研究项目(JEPCC-KYXM-2023-052); 陕西省教育厅重点科学研究计划项目(22JS021)

邮箱(Email):

DOI: 10.13682/j.issn.2095-6533.2025.03.012

16	0	12
下载次数	被引频次	阅读次数

引用本文下载本文

PDF

引用导出

GB/T 7714-2015 MLA APA Refworks EndNote NoteExpress NoteFirst

摘要全文参考文献出版信息相关文章

摘要：

为应对海量非结构化文档中关键信息快速提取的挑战，提出一种基于复合注意力机制的文档版面分析算法。该算法先在特征金字塔网络中添加空间注意力机制聚焦文档图像中信息密集的区域，引入可变性卷积解决偏移域的问题。然后通过连接通道注意力机制自适应调整特征通道的权重，以提升文档图像特征表征质量。最后，采用残差连接方式改善深层网络中的梯度消失问题，从而实现图像特征高效融合。实验结果表明，所提算法在PubLayNet英文数据集和CDLA中文数据集上的mAP分别为88.2%和94.3%,相比对比算法分别提升了0.6%和3.3%,对复杂文档中存在的多元化表格具有更好的检测效果。

关键词： 文档版面分析; 大语言模型; 特征金字塔网络; 空间注意力机制; 通道注意力机制;

Abstract：

For the challenge of extracting key information from massive unstructured documents, a document layout analysis algorithm based on feature pyramid network(FPN) is proposed. The algorithm first incorporates a spatial attention mechanism into the feature pyramid network to focus on information-dense regions within document images, and introduces deformable convolution to handle offset-related issues. Then, a channel attention mechanism is connected to adaptively adjust the weights of feature channels, thereby enhancing the quality of feature representations. Finally, residual connections are employed to alleviate the gradient vanishing problem in deep networks, enabling more efficient feature fusion. Experimental results demonstrate that the proposed algorithm achieves mAP of 88.2% on the PubLayNet dataset and 94.3% on the CDLA dataset, outperforming the comparison methods by 0.6% and 3.3%,respectively. It shows superior detection performance on diverse and complex table structures in documents.

KeyWords： document layout analysis; large language model; feature pyramid network; channel attention mechanism; channel attention mechanism;

如需获取全文，请访问cnki.net

参考文献

[1] 丁俐夫，陈颖，肖谭南，等.基于大语言模型的新型电力系统生成式智能应用模式初探[J].电力系统自动化2024,48(9):1-13.DING L F,CHEN Y,XIAO T N,et al.A novel generative intelligent application model for power systems based on large language model[J].Automation of Electric Power Systems,2024,48(9):1-13.(in Chinese)

[2] 艾洲.基于大模型和区块链的电力知识问答系统设计与实现[J].电力大数据，2024,27(1):87-96.AI Z.Design and implementation of power knowledge question answering system based on large model and blockchain[J].Electric Power Big Data,2018,27(1):87-96.(in Chinese)

[3] 张金营，王天堃，么长英，等.基于大语言模型的电力知识库智能问答系统构建与评价[J].计算机科学，2024,51(12):286-292.ZHANG J Y,WANG T K,MO C Y,et al.Construction and evaluation of power knowledge base intelligent question answering system based on large language model[J].Journal of Computer Science,2024,51(12):286-292.(in Chinese)

[4] MARTíNEK J,LENC L,KRáL P.Building an efficient OCR system for historical documents with little training data[J].Neural Computing & Applications,2020,32(23):17209-17227.

[5] WICK C,PUPPE F.Fully convolutional neural networks for page segmentation of historical document images[C]//2018 13th IAPR International Workshop on Document Analysis Systems (DAS).Vienna:IEEE,2018:287-292.

[6] RAHAL N,VGTLIN L,INGOLD R.Layout analysis of historical document images using a light fully convolutional network[C]//International Conference on Document Analysis and Recognition.[S.l.]:Springer,2023:325-341.

[7] XU Y H,LI M H,CUI L,et al.LayoutLM:Pre-training of text and layout for document image understanding[C]//26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.New York:ACM,2020:1192-1200.

[8] XU Y,XU Y H,LYU T C,et al.Layoutlmv2:Multi-modal pre-training for visually-rich document understanding[C]//59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.[S.l.]:ACL,2021:2579-2591.

[9] LI P Z,GU J X,KUEN J,et al.Selfdoc:Self-supervised document representation learning[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Nashville:IEEE,2021:5648-5656.

[10] REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2017,39(6):1137-1149.

[11] LIU Y,ZHANG Y,WANG Y,et al.A survey of visual transformers[J].IEEE Transactions on Neural Networks and Learning Systems,2024,35(6):7478-7498.

[12] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An Image is worth 16x16 words:Transformers for image recognition at scale[C]//9th International Conference on Learning Representations(ICLR),2021:1-22.

[13] TOUVRON H,CORD M,DOUZE M,et al.Training data-efficient image transformers & distillation through attention[C]//38th International Conference on Machine Learning.[S.l.]:PMLR,2021:10347-10357.

[14] LIU Z,LIN Y T,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//IEEE/CVF International Conference on Computer Vision.Montreal:IEEE,2021:9992-10002.

[15] LUO C,SHEN Y,ZHU Z,et al.Layoutllm:Layout instruction tuning with large language models for document understanding[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.[S.l.]:CVPR).2024:15630-15640.

[16] GU J,KUEN J,MORARIU V I,et al.Unidoc:Unified pretraining framework for document understanding[J].Advances in Neural Information Processing Systems,2021,34:39-50.

[17] WANG Q,ZHOU L,YAO Y,et al.An interconnected feature pyramid networks for object detection[J].Journal of Visual Communication and Image Representation,2021,79(8):103260.

[18] ZHU X,CHENG D,ZHANG Z,et al.An empirical study of spatial attention mechanisms in deep networks[C]//IEEE/CVF International Conference on Computer Vision.Seoul:IEEE,2019:6687-6696.

[19] BORVORNVITCHOTIKARN T,YOOYATIVONG T.Combining convolutional neural networks and spatial-channel "squeeze and excitation" block for multiple-label image classification[J].International Journal of Artificial Intelligence,2024,13(1):368-374.

[20] HE K M,GKIOXARI G,PIOTR D,et al.Mask R-CNN[C]//IEEE International Conference on Computer Vision.Venice:IEEE,2017:2961-2969.

[21] BAO H B,DONG L,WEI F R.Beit:Bert pre-training of image transformers[EB/OL].[2024-12-26].http://arxiv.org/pdf/2106.08254.

[22] GU J,NENKOVA A N,BARMPALIOS N,et al.UniDoc:Unified pretraining framework for document understanding[J].Advances in Neural Information Processing Systems,2021,34:39-50.

[23] LI J L,XU Y H,LYU T C,et al.Dit:Self-supervised pre-training for document image transformer[C]//30th ACM International Conference on Multimedia.Lisboa:ACM,2022:3530-3539.

[24] CHEN H X,XU Z E,GU Z X,et al.Diffute:Universal text editing diffusion model[C]//37th Conference on Advances in Neural Information Processing Systems.New Orleans:Curran Associates,2023:63062-63074.

[25] VARGHESE R,SAMBATH M.YOLOv8:A novel object detection algorithm with enhanced performance and robustness[C]//2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS).Chennai:IEEE,2024:1-6.

[26] LI C X,GUO R Y,ZHOU J,et al.P-structurev2:A stronger document analysis system[J].Computing Research Repository,2022,10:05391.

[27] ZHONG X,TANG J B,YEPES A J.Publaynet:Largest dataset ever for document layout analysis[C]//2019 International Conference on Document Analysis and Recognition.Sydney:IEEE,2019:1015-1022.

[28] LI H.Cdla:A Chinese document layout analysis (cdla) dataset.[EB/OL].[2024-12-08].https://github.com/buptlihang/CDLA.

基本信息:

DOI：10.13682/j.issn.2095-6533.2025.03.012

中图分类号:TP391.41

引用信息:

[1]谢海龙,罗玮,徐涛涛等.基于复合注意力机制的文档版面分析算法[J].西安邮电大学学报,2025,30(03):103-110.DOI:10.13682/j.issn.2095-6533.2025.03.012.

基金信息:

江西电建科学研究项目(JEPCC-KYXM-2023-052); 陕西省教育厅重点科学研究计划项目(22JS021)

请选择需要下载的pdf数据

西安邮电大学学报

Summary

引用

GB/T 7714-2015 格式引文

MLA格式引文

APA格式引文