We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
An Automatic Web Data Extraction Approach based on Path Index Trees.
- Authors
Yan Wen; Qingtian Zeng; Hua Duan; Feng Zhang; Xin Chen
- Abstract
This paper proposes a novel approach called ITE to extract web data records in a fully automatic way. The approach effectively utilizes the tag index information in different layers of the HTML DOM tree and abstracts the concept of index tree together with its repetitiveness and consecutiveness, which can characterize the key structural information in a web page. The concept of repetitiveness indicates the structural similarities among data records, and the concept of consecutiveness represents the sequential features of multiple records. Then, the complex DOM tree can be compressed to a set of index trees based on these concepts. We also provide a series of properties as theoretical support. The extraction process is divided into three steps, namely, repetitiveness discovery, consecutiveness discovery, and index tree merging. To handle data field missing, multiple record roots, and other complicated situations, we propose a digital sequence similarity measurement and a hierarchical clustering approach to find the repeating patterns. Then, data records are identified based on the consecutiveness discovery method, and the data blocks containing full data records are restored by merging the index trees. Experiments demonstrate the effectiveness and efficiency of the proposed approach. It outperforms existing classic work in accuracy and has a satisfying execution time, which means it is applicable to large datasets. The time complexity is linear to the number of leaf nodes in the DOM tree of a web page.
- Subjects
DATA extraction; HTML (Document markup language); WEB databases; INDEX theorems; WEBSITES
- Publication
International Journal of Performability Engineering, 2018, Vol 14, Issue 10, p2449
- ISSN
0973-1318
- Publication type
Article
- DOI
10.23940/ijpe.18.10.p21.24492460