We found a match
Your institution may have rights to this item. Sign in to continue.
- Title
Hybrid Chinese text classification approach using general knowledge from Baidu Baike.
- Authors
Ren, Fuji; Li, Chao
- Abstract
Most of the previous studies focused on enriching text representation to address text classification (TC) task. However, conventional classification approaches with VSM (vector space model) on Chinese text study intensively only the words and their relationship in some specific corpus/dataset but ignore the basic concept of categories and the general knowledge behind the words learned and used to recognize entities by people. This paper focuses on enriching text representation and proposes a novel approach, which complements information from the online Chinese encyclopedia Baidu Baike for Chinese TC. The similarities between every text and each concept of categories and the most related words from Baidu Baike are added to the feature space. The performance of the proposed approach is measured on the Fudan University TC corpus, which is an imbalanced Chinese dataset. In the experiments, the proposed Baidu Baike-based concept similarity approach obtains promising results when compared with a previous research and the conventional method, with macro-precision of 90.31%, recall of 75.45%, and F1 score 80.32%, which are about 0.02%, 0.15%, 0.12%, respectively, higher than the conventional method, which obviously improves the recall for some small categories while keeping precision at high level and improving the macro F1 score. Moreover, the proposed approach has good expandability, so that many other knowledge bases could be integrated and many other concepts could be referred to improve the effectiveness. © 2016 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.
- Subjects
DOCUMENT classification (Electronic documents); CHINESE language; VECTOR spaces; CHINESE encyclopedias &; dictionaries; CORPORA
- Publication
IEEJ Transactions on Electrical & Electronic Engineering, 2016, Vol 11, Issue 4, p488
- ISSN
1931-4973
- Publication type
Article
- DOI
10.1002/tee.22266