copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Document Clustering using Word Clusters via the Information Bottleneck Method

N. Slonim, and N. Tishby. In ACM SIGIR 2000, page 208--215. ACM press, (2000)

Abstract

We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x; y), we first cluster the words, Y , so that the obtained word clusters, Y_hat , maximally preserve the information on the documents. The resulting joint distribution, p(X; Y_hat ), contains most of the original information about the documents, I(X; Y_hat ) ~= I(X;Y ), but it is much less sparse and noisy. Using the same procedure we then cluster the documents, X , so that the information about the word-clusters is preserved. Thus, we first find word-clusters that capture most of the mutual information about the set of documents, and then find document clusters, that preserve the information about the word clusters. We tested this procedure over several document collections based on subsets taken from the standard 20Newsgroups corpus. The results were assessed by calculating the correlation between the document clusters and the correct labels for these documents. Finding from our experiments show that this double clustering procedure, which uses the information bottleneck method, yields significantly superior performance compared to other common document distributional clustering algorithms. Moreover, the double clustering procedure improves all the distributional clustering methods examined here.

Description

Document Clustering using Word Clusters via the Information Bottleneck Method

Links and resources

BibTeX key: Slonim00documentclustering
entry type: inproceedings
booktitle: In ACM SIGIR 2000
year: 2000
pages: 208--215
publisher: ACM press
url: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.3062

@r.b.'s tags highlighted

Cite this publication

@inproceedings{Slonim00documentclustering, abstract = {We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x; y), we first cluster the words, Y , so that the obtained word clusters, Y_hat , maximally preserve the information on the documents. The resulting joint distribution, p(X; Y_hat ), contains most of the original information about the documents, I(X; Y_hat ) ~= I(X;Y ), but it is much less sparse and noisy. Using the same procedure we then cluster the documents, X , so that the information about the word-clusters is preserved. Thus, we first find word-clusters that capture most of the mutual information about the set of documents, and then find document clusters, that preserve the information about the word clusters. We tested this procedure over several document collections based on subsets taken from the standard 20Newsgroups corpus. The results were assessed by calculating the correlation between the document clusters and the correct labels for these documents. Finding from our experiments show that this double clustering procedure, which uses the information bottleneck method, yields significantly superior performance compared to other common document distributional clustering algorithms. Moreover, the double clustering procedure improves all the distributional clustering methods examined here.}, added-at = {2009-12-14T01:17:02.000+0100}, author = {Slonim, Noam and Tishby, Naftali}, biburl = {https://www.bibsonomy.org/bibtex/2ee12e22bd9a34d9e8fa5c15d209caaf7/r.b.}, booktitle = {In ACM SIGIR 2000}, description = {Document Clustering using Word Clusters via the Information Bottleneck Method}, interhash = {9296d4320e349858caedc904e6b7dfd3}, intrahash = {ee12e22bd9a34d9e8fa5c15d209caaf7}, keywords = {2009 bottleneck clustering co-clustering information seminar}, pages = {208--215}, publisher = {ACM press}, timestamp = {2009-12-14T01:17:03.000+0100}, title = {Document Clustering using Word Clusters via the Information Bottleneck Method}, url = {http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.3062}, year = 2000 }

BibSonomy

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Document Clustering using Word Clusters via the Information Bottleneck Method

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews
(0)

BibSonomy

copydeleteadd this publication to your clipboardcommunity posthistory of this postURLDOIBibTeXEndNoteAPAChicagoDIN 1505HarvardMSOffice XML Document Clustering using Word Clusters via the Information Bottleneck Method

Abstract

Description

Links and resources

Tags

community

Cite this publication

More citation styles

search on

Meta data

Comments and Reviews (0)

copy delete add this publication to your clipboard
community post
history of this post
URL
DOI
BibTeX
EndNote
APA
Chicago
DIN 1505
Harvard
MSOffice XML

Document Clustering using Word Clusters via the Information Bottleneck Method

Comments and Reviews
(0)