Prasad, N. Sivaram; Rao, K. Rajasekhara

doi:10.15866/irecos.v9i10.3894

Back to matches

Your institution may have access to this item. Find your institution then sign in to continue.

Title: Subspace Clustering of Text Documents Using Collection and Document Frequencies of Terms.
Authors: Prasad, N. Sivaram; Rao, K. Rajasekhara
Abstract: The most widely used document representation model is the Vector Space Model. Higher dimensions and sparseness of the representation model leads to poor clustering performance, demanding more computational effort for clustering. Hence, dimension reduction techniques are used to find a feature subspace for document representation that could enhance clustering performance. This paper proposes a novel unsupervised filter method for feature selection. Feature selection methods represent documents using a subset of the original feature set that maximizes the separation among classes of documents in the collection. Filter methods analyze the intrinsic properties of the documents and they select highly-ranked features according to some criterion, quite different to clustering task. Unsupervised feature selection methods do not use class labels to guide the selection of features. The proposed method assigns a score to a term using its collection and document frequencies. Number of times and number of documents in which a term appears in a document collection are called respectively collection frequency and document frequency of the term. Empirical evaluations proved that the proposed method is not only effective in selecting features giving best clustering performance, but also less computationally complex, when compared to other unsupervised feature selection methods.
Subjects: VECTOR spaces; COMPUTATIONAL complexity; FEATURE selection; CLUSTER analysis (Statistics); SUBSPACES (Mathematics)
Publication: International Review on Computers & Software, 2014, Vol 9, Issue 10, p1692
ISSN: 1828-6003
Publication type: Article
DOI: 10.15866/irecos.v9i10.3894

We found a match

Subspace Clustering of Text Documents Using Collection and Document Frequencies of Terms.

Prasad, N. Sivaram; Rao, K. Rajasekhara

VECTOR spaces; COMPUTATIONAL complexity; FEATURE selection; CLUSTER analysis (Statistics); SUBSPACES (Mathematics)

International Review on Computers & Software, 2014, Vol 9, Issue 10, p1692

1828-6003

Article

10.15866/irecos.v9i10.3894