We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Subspace Clustering of Text Documents Using Collection and Document Frequencies of Terms.
- Authors
Prasad, N. Sivaram; Rao, K. Rajasekhara
- Abstract
The most widely used document representation model is the Vector Space Model. Higher dimensions and sparseness of the representation model leads to poor clustering performance, demanding more computational effort for clustering. Hence, dimension reduction techniques are used to find a feature subspace for document representation that could enhance clustering performance. This paper proposes a novel unsupervised filter method for feature selection. Feature selection methods represent documents using a subset of the original feature set that maximizes the separation among classes of documents in the collection. Filter methods analyze the intrinsic properties of the documents and they select highly-ranked features according to some criterion, quite different to clustering task. Unsupervised feature selection methods do not use class labels to guide the selection of features. The proposed method assigns a score to a term using its collection and document frequencies. Number of times and number of documents in which a term appears in a document collection are called respectively collection frequency and document frequency of the term. Empirical evaluations proved that the proposed method is not only effective in selecting features giving best clustering performance, but also less computationally complex, when compared to other unsupervised feature selection methods.
- Subjects
VECTOR spaces; COMPUTATIONAL complexity; FEATURE selection; CLUSTER analysis (Statistics); SUBSPACES (Mathematics)
- Publication
International Review on Computers & Software, 2014, Vol 9, Issue 10, p1692
- ISSN
1828-6003
- Publication type
Article
- DOI
10.15866/irecos.v9i10.3894