Similarity Measures for Text Document Clustering
Anna Huang

Department of Computer Science
The University of Waikato, Hamilton, New Zealand

lh92@waikato.ac.nz

ABSTRACT

Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of
meaningful and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing
mechanisms. Partitional clustering algorithms have been
recognized to be more suitable as opposed to the hierarchical clustering schemes for processing large datasets. A
wide variety of distance functions and similarity measures
have been used for clustering, such as squared Euclidean
distance, cosine similarity, and relative entropy.
In this paper, we compare and analyze the effectiveness
of these measures in partitional clustering for text document datasets. Our experiments utilize the standard Kmeans algorithm and we report results on seven text document datasets and five distance/similarity measures that
have been most commonly used in text clustering.

Categories and Subject Descriptors

H.3.3 [Information Search and Retrieval]: Clustering;
I.5.3 [Clustering]: Similarity measures

General Terms
Performance

Keywords

Similarity measures, partitional clustering, text clustering

1.

INTRODUCTION

We are facing an ever increasing volume of text documents.
The abundant texts flowing over the Internet, huge collections of documents in digital libraries and repositories,
and digitized personal information such as blog articles and
emails are piling up quickly everyday. These have brought
challenges for the effective and efficient organization of text
documents.
Clustering in general is an important and useful technique

This paper was published in the proceedings of the New Zealand
Computer Science Research Student Conference 2008. Copyright is held
by the author/owner(s).

NZCSRSC 2008, April 2008, Christchurch, New Zealand.

that automatically organizes a collection with a substantial number of data objects into a much smaller number
of coherent groups [8, 20]. In the particular scenario of text
documents, clustering has proven to be an effective approach
for quite some time—and an interesting research problem as
well. It is becoming even more interesting and demanding
with the development of the World Wide Web and the evolution of Web 2.0. For example, results returned by search
engines are clustered to help users quickly identify and focus
on the relevant set of results. Customer comments are clustered in many online stores, such as Amazon.com, to provide
collaborative recommendations. In collaborative bookmarking or tagging, clusters of users that share certain traits are
identified by their annotations.
Text document clustering groups similar documents that to
form a coherent cluster, while documents that are different
have separated apart into different clusters. However, the
definition of a pair of documents being similar or different is
not always clear and normally varies with the actual problem setting. For example, when clustering research papers,
two documents are regarded as similar if they share similar
thematic topics. When clustering is employed on web sites,
we are usually more interested in clustering the component
pages according to the type of information that is presented
in the page. For instance, when dealing with universities’
web sites, we may want to separate professors’ home pages
from students’ home pages, and pages for courses from pages
for research projects. This kind of clustering can benefit
further analysis and utilize of the dataset such as information retrieval and information extraction, by grouping similar types of information sources together.
Accurate clustering requires a precise definition of the closeness between a pair of objects, in terms of either the pairwised similarity or distance. A variety of similarity or distance measures have been proposed and widely applied, such
as cosine similarity and the Jaccard correlation coefficient.
Meanwhile, similarity is often conceived in terms of dissimilarity or distance as well [15]. Measures such as Euclidean
distance and relative entropy have been applied in clustering
to calculate the pair-wise distances.
Given the diversity of similarity and distance measures available, their effectiveness in text document clustering is still
not clear. Although Strehl et al. compared the effectiveness
of a number of measures [17], our experiments extended their
work by including more measures and experimental datasets,
such as the averaged Kullback-Leibler divergence, which has

50

A. Huang

shown its effectiveness in clustering text and attracted considerable research interest recently. More specifically, we
evaluated five measures with empirical experiments: Euclidean distance, cosine similarity, Jaccard coefficient, Pearson correlation coefficient and averaged Kullback-Leibler divergence. Each of the measures are further discussed in Section 3.
In order to come up with a sound conclusion we have performed an empirical evaluation with seven data sets that
each have different characteristics. They contain such things
as newspaper articles, newsgroup posts, research papers, and
web pages (see Table 1). They all come with a set of categorizing labels, with one category attached to each document.
These pre-assigned labels are very useful for cluster validation; we use them to measure the consistency between the resulting clusters and the categories created by human experts.
We use two measures to evaluate the overall quality of clustering solutions—purity and entropy, which are commonly
used in clustering [23, 22]. Section 5.2 further explains the
evaluation approaches. However, manually assigned labels
are normally not available in clustering, and in these cases
other measure such as within-cluster distances and betweencluster distances [13] can be used for evaluation. These are
not used in this paper because all the datasets already have
labels.
This paper is organized as follows. The next section describes the document representation used in the experiments.
Section 3 discusses the similarity measures and their semantics. Section 4 presents the K-means clustering algorithm
and Section 5 explains experiment settings, evaluation approaches, results and analysis. We point to some related
work in Section 6 and Section 7 concludes and discusses future work.

2. DOCUMENT REPRESENTATION

There are several ways to model a text document. For example, it can be represented as a bag of words, where words
are assumed to appear independently and the order is immaterial. The bag of word model is widely used in information
retrieval and text mining [21]. Words are counted in the
bag, which differs from the mathematical definition of set.
Each word corresponds to a dimension in the resulting data
space and each document then becomes a vector consisting
of non-negative values on each dimension. Here we use the
frequency of each term as its weight, which means terms that
appear more frequently are more important and descriptive
for the document.
Let D = {d1 , . . . , dn } be a set of documents and T = {t1 , . . . ,
tm } the set of distinct terms occurring in D. We discuss
more precisely what we mean by “terms” below: for the
moment just assume they are words. A document is then
−
→
represented as a m-dimensional vector td . Let tf (d, t) denote
the frequency of term t ∈ T in document d ∈ D. Then the
vector representation of a document d is
−
→
td = (tf (d, t1 ), . . . , tf (d, tm ))
Although more frequent words are assumed to be more important as mentioned above, this is not usually the case in
practice. For example, words like a and the are probably the
most frequent words that appear in English text, but neither
are descriptive nor important for the document’s subject. In

Figure 1: Angle between documents
fact, more complicated strategies such as the tfidf weighting
scheme as described below is normally used instead.
With documents presented as vectors, we measure the degree of similarity of two documents as the correlation between their corresponding vectors, which can be further quantified as the cosine of the angle between the two vectors.
Figure 1 shows the angle in two-dimensional space but in
practice the document space usually has tens and thousands
of dimensions. Some useful properties of the cosine measure
are discussed in Section 3.3.
Terms are basically words. But we applied several standard transformations on the basic term vector representation. First, we removed stop words. There are words that
are non-descriptive for the topic of a document, such as a,
and, are and do. Following common practices, we used the
one implemented in the Weka machine learning workbench,
which contains 527 stop words.
Second, words were stemmed using Porter’s suffix-stripping
algorithm [14], so that words with different endings will be
mapped into a single word. For example production, produce,
produces and product will be mapped to the stem produc.
The underlying assumption is that different morphological
variations of words with the same root/stem are thematically similar and should be treated as a single word.
Third, we considered the effect of including infrequent terms
in the document representation on the overall clustering performance and decided to discard words that appear with
less than a given threshold frequency. The rationale by discarding infrequent terms is that in many cases they are not
very descriptive about the document’s subject and make little contribution to the similarity between two documents.
Meanwhile, including rare terms can also introduce noise
into the clustering process and make similarity computation
more expensive. Consequently, we select the top 2000 words
ranked by their weights and use them in our experiments.
In the clustering process, we also need to compare the dissimilarity/similarity between two clusters or between a cluster and an object. In hierarchical clustering this is normally
computed as the complete-link, single-link or average-link
distance [8]. However, in partitional clustering algorithms,
a cluster is usually represented with a centroid object. For
example, in the K-means algorithm the centroid of a cluster
is the average of all the objects in the cluster—that is, the
centroid’s value in each dimension is the arithmetic mean of
that dimension over all the objects in the cluster. Let C be
a set of documents. Its centroid is defined as
1 X −
→
−
→
td ,
tC =
|C| →
−
td ∈C

which is the mean value of all term vectors in the set. Moreover, we normalize the vectors to a unified length to avoid

Similarity Measures for Text Document Clustering
long documents dominating the cluster.
As mentioned previously, the most frequent terms are not
necessarily the most informative ones. On the contrary,
terms that appear frequently in a small number of documents but rarely in the other documents tend to be more
relevant and specific for that particular group of documents,
and therefore more useful for finding similar documents. In
order to capture these terms and reflect their importance, we
transform the basic term frequencies tf (d, t) into the tf idf
(term frequency and inversed document frequency) weighting scheme. Tf idf weighs the frequency of a term t in a
document d with a factor that discounts its importance with
its appearances in the whole document collection, which is
defined as:
|D|
).
tf idf (d, t) = tf (d, t) × log(
df (t)
Here df (t) is the number of documents in which term t appears. In subsequent experiments we use the tf idf value
instead of the absolute term frequency of each term to build
term vectors. To generalize, we use wt,d to denote the weight
of term t in document d in the following sections.

3.

SIMILARITY MEASURES

Before clustering, a similarity/distance measure must be determined. The measure reflects the degree of closeness or
separation of the target objects and should correspond to
the characteristics that are believed to distinguish the clusters embedded in the data. In many cases, these characteristics are dependent on the data or the problem context at
hand, and there is no measure that is universally best for all
kinds of clustering problems.
Moreover, choosing an appropriate similarity measure is also
crucial for cluster analysis, especially for a particular type
of clustering algorithms. For example, the density-based
clustering algorithms, such as DBScan [4], rely heavily on
the similarity computation. Density-based clustering finds
clusters as dense areas in the data set, and the density of a
given point is in turn estimated as the closeness of the corresponding data object to its neighboring objects. Recalling
that closeness is quantified as the distance/similarity value,
we can see that large number of distance/similarity computations are required for finding dense areas and estimate
cluster assignment of new data objects. Therefore, understanding the effectiveness of different measures is of great
importance in helping to choose the best one.
In general, similarity/distance measures map the distance or
similarity between the symbolic description of two objects
into a single numeric value, which depends on two factors—
the properties of the two objects and the measure itself. In
order to make the results of this study comparable to previous research, we include all the measures that were tested
in [17] and add another one—the averaged Kullback-Leibler
divergence. These five measures are discussed below. Different measure not only results in different final partitions, but
also imposes different requirements for the same clustering
algorithm, as we will see in Section 4.

3.1

Metric

Not every distance measure is a metric. To qualify as a metric, a measure d must satisfy the following four conditions.

51

Let x and y be any two objects in a set and d(x, y) be the
distance between x and y.
1. The distance between any two points must be nonnegative, that is, d(x, y) ≥ 0.
2. The distance between two objects must be zero if and
only if the two objects are identical, that is, d(x, y) = 0
if and only if x = y.
3. Distance must be symmetric, that is, distance from
x to y is the same as the distance from y to x, ie.
d(x, y) = d(y, x).
4. The measure must satisfy the triangle inequality, which
is d(x, z) ≤ d(x, y) + d(y, z).

3.2 Euclidean Distance

Euclidean distance is a standard metric for geometrical problems. It is the ordinary distance between two points and can
be easily measured with a ruler in two- or three-dimensional
space. Euclidean distance is widely used in clustering problems, including clustering text. It satisfies all the above four
conditions and therefore is a true metric. It is also the default distance measure used with the K-means algorithm.
Measuring distance between text documents, given two doc−
→
uments da and db represented by their term vectors ta and
−
→
tb respectively, the Euclidean distance of the two documents
is defined as
m
X
−
→ −
→
DE ( ta , tb ) = ( |wt,a − wt,b |2 )1/2 ,
t=1

where the term set is T = {t1 , . . . , tm }. As mentioned
previously, we use the tf idf value as term weights, that is
wt,a = tf idf (da , t).

3.3 Cosine Similarity

When documents are represented as term vectors, the similarity of two documents corresponds to the correlation between the vectors. This is quantified as the cosine of the
angle between vectors, that is, the so-called cosine similarity. Cosine similarity is one of the most popular similarity
measure applied to text documents, such as in numerous information retrieval applications [21] and clustering too [9].
−
→
−
→
Given two documents ta and tb , their cosine similarity is
−
→ −
→
ta · tb
−
→ −
→
SIMC ( ta , tb ) = −
→
−
→ ,
| ta | × | tb |
−
→
−
→
where ta and tb are m-dimensional vectors over the term set
T = {t1 , . . . , tm }. Each dimension represents a term with its
weight in the document, which is non-negative. As a result,
the cosine similarity is non-negative and bounded between
[0,1].
An important property of the cosine similarity is its independence of document length. For example, combining two
identical copies of a document d to get a new pseudo document d0 , the cosine similarity between d and d0 is 1, which
means that these two documents are regarded to be identical. Meanwhile, given another document l, d and d0 will

52

A. Huang

−
→ −
→
have the same similarity value to l, that is, sim( td , tl ) =
−
→ −
→
sim(td0 , tl ). In other words, documents with the same composition but different totals will be treated identically. Strictly
speaking, this does not satisfy the second condition of a metric, because after all the combination of two copies is a different object from the original document. However, in practice,
when the term vectors are normalized to a unit length such
as 1, and in this case the representation of d and d0 is the
same.

P
, π2 = P Q
and M = π1 P + π2 Q. For
where π1 = P +Q
+Q
documents, the averaged KL divergence can be computed
with the following formula:

3.4

X

−
→ −
→
DAvgKL ( ta k tb ) =
π1 × D(wt,a kwt ) + π2 × D(wt,b kwt ) ,

The Jaccard coefficient is a similarity measure and ranges
−
→
−
→
between 0 and 1. It is 1 when the ta = tb and 0 when
−
→
−
→
ta and tb are disjoint, where 1 means the two objects are
the same and 0 means they are completely different. The
corresponding distance measure is DJ = 1 − SIMJ and we
will use DJ instead in subsequent experiments.

Pearson Correlation Coefficient

Pearson’s correlation coefficient is another measure of the
extent to which two vectors are related. There are different
forms of the Pearson correlation coefficient formula. Given
the term set T = {t1 , . . . , tm }, a commonly used form is
P
m m wt,a × wt,b − T Fa × T Fb
−
→ −
→
SIMP ( ta , tb ) = q P t=1
Pm
2
2
2
2
[m m
t=1 wt,a − T Fa ][m
t=1 wt,b − T Fb ]
Pm
Pm
where T Fa = t=1 wt,a and T Fb = t=1 wt,b .
This is also a similarity measure. However, unlike the other
−
→
measures, it ranges from +1 to −1 and it is 1 when ta =
−
→
tb . In subsequent experiments we use the corresponding
distance measure, which is DP = 1−SIMP when SIMP ≥ 0
and DP = |SIMP | when SIMP < 0.

3.6

Averaged Kullback-Leibler Divergence

In information theory based clustering, a document is considered as a probability distribution of terms. The similarity
of two documents is measured as the distance between the
two corresponding probability distributions. The KullbackLeibler divergence (KL divergence), also called the relative
entropy, is a widely applied measure for evaluating the differences between two probability distributions.
Given two distributions P and Q, the KL divergence from
distribution P to distribution Q is defined as
DKL (P kQ) = P log(

DAvgKL (P kQ) = π1 DKL (P kM ) + π2 DKL (QkM ),

m

Jaccard Coefficient

The Jaccard coefficient, which is sometimes referred to as the
Tanimoto coefficient, measures similarity as the intersection
divided by the union of the objects. For text document, the
Jaccard coefficient compares the sum weight of shared terms
to the sum weight of terms that are present in either of the
two document but are not the shared terms. The formal
definition is:
−
→ −
→
ta · tb
−
→ −
→
SIMJ ( ta , tb ) = −
−
→2 −
→ −
→.
→2
ta + tb − ta · tb

3.5

However, unlike the previous measures, the KL divergence
is not symmetric, ie. DKL (P kQ) 6= DKL (QkP ). Therefore
it is not a true metric. As a result, we use the averaged KL
divergence instead, which is defined as

P
).
Q

In the document scenario, the divergence between two distribution of words is:
m
X
wt,a
−
→ −
→
DKL ( ta k tb ) =
wt,a × log(
).
wt,b
t=1

t=1

where
wt,a
, π2 =
π1 = wt,a +w
t,b
wt,b .

wt,b
,
wt,a +wt,b

and wt = π1 × wt,a + π2 ×

The average weighting between two vectors ensures symmetry, that is, the divergence from document i to document j
is the same as the divergence from document j to document
i. The averaged KL divergence has recently been applied
to clustering text documents, such as in the family of the
Information Bottleneck clustering algorithms [18], to good
effect.

4.

CLUSTERING ALGORITHM

For all subsequent experiments, the standard K-means algorithm is chosen as the clustering algorithm. This is an iterative partitional clustering process that aims to minimize the
least squares error criterion [15]. As mentioned previously,
partitional clustering algorithms have been recognized to be
better suited for handling large document datasets than hierarchical ones, due to their relatively low computational
requirements [16, 9, 3].
The standard K-means algorithm works as follows. Given a
set of data objects D and a pre-specified number of clusters
k, k data objects are randomly selected to initialize k clusters, each one being the centroid of a cluster. The remaining
objects are then assigned to the cluster represented by the
nearest or most similar centroid. Next, new centroids are
re-computed for each cluster and in turn all documents are
re-assigned based on the new centroids. This step iterates
until a converged and fixed solution is reached, where all
data objects remain in the same cluster after an update of
centroids.
The generated clustering solutions are locally optimal for
the given data set and the initial seeds. Different choices
of initial seed sets can result in very different final partitions. Methods for finding good starting points have been
proposed [1]. However, we will use the basic K-means algorithm because optimizing the clustering is not the focus of
this paper.
The K-means algorithm works with distance measures which
basically aims to minimize the within-cluster distances. Therefore, similarity measures do not directly fit into the algorithm, because smaller values indicate dissimilarity. The
Euclidean distance and the averaged KL divergence are distance measures, while the cosine similarity, Jaccard coefficient and Pearson coefficient are similarity measures. We
apply a simple transformation to convert the similarity measure to distance values. Because both cosine similarity and

Similarity Measures for Text Document Clustering
Jaccard coefficient are bounded in [0, 1] and monotonic, we
take D = 1 − SIM as the corresponding distance value.
For Pearson coefficient, which ranges from −1 to +1, we
take D = 1 − SIM when SIM ≥ 0 and D = |SIM | when
SIM < 0.

5.

EXPERIMENT

It is very difficult to conduct a systematic study comparing
the impact of similarity metrics on cluster quality, because
objectively evaluating cluster quality is difficult in itself. In
practice, manually assigned category labels are usually used
as a baseline criteria for evaluating clusters. As a result,
the clusters, which are generated in an unsupervised way,
are compared to the pre-defined category structure, which is
normally created by human experts. This kind of evaluation
assumes that the objective of clustering is to replicate human
thinking, so a clustering solution is good if the clusters are
consistent with the manually created categories. However, in
practice datasets often come without any manually created
categories, and this is the exact point where clustering can
help. In this case, measures like cluster coherence in terms
of the within-cluster distances and the well-separateness between clusters in terms of the between-cluster distances can
be used for evaluation [13]. In order to make the result
of this investigation comparable to previous researches, we
choose datasets that have been commonly used for evaluating clustering as the experiment datasets. All of these
datasets have appropriate pre-assigned category labels. The
rest of this section first describes the characteristics of the
datasets, then explains the evaluation measures, and finally
presents and analyzes the experiment results.

5.1

Datasets

The seven data sets in Table 1 were chosen for the empirical experiment. These have been widely used for evaluating feature selection techniques, classification and clustering. Except for the 20news and webkb datasets, all from the
CLUTO package. 1 These datasets differ in terms of document type, number of categories, average category size, and
subjects. In order to ensure diversity, the datasets are from
different sources, some containing newspaper articles, some
containing newsgroup posts, some being web pages and the
remaining being academic papers.
The characteristics and sources of these datasets are summarized in Table 1. The smallest contains 1504 documents
and the largest contained 8282 documents. The number of
classes in each dataset varies from 4 to 20. As mentioned
previously, we removed stop words and applied stemming
as described in Section 2, and only the top 2000 words are
selected.
More specifically, the 20news dataset contained newsgroup
articles from 20 newsgroups on a variety of topics including
politics, computers, etc. The classic dataset contains abstracts of scientific papers from four sources: CACM, CISI,
CRAN and MED. It has been widely used to evaluate information retrieval systems. The hitech dataset consists of San
Jose Mercury newspaper articles on six topics—computers,
electronics, health, medical, research, and technology, and
was part of the TREC collection [19]. The tr41 data set is
1
http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/
datasets.tar.gz

53

also derived from the TREC-5, TREC-6 and TREC-7 collections. The wap and the webkb datasets both consists of
web pages. The wap dataset is from the WebAce project
and contains web pages from the Yahoo! Subject hierarchy;
the webkb was from the Web Knowledge Base project and
contains web pages from several universities about courses,
students, staffs, departments, projects and the like. The
re0 dataset contains newspaper articles and has been widely
used for evaluating clustering algorithms.
For each data set, we experimented with different similarity measures and we had 5 × 7 = 35 experiments in total.
Moreover, each experiment was run 10 times and the results
are the averaged value over 10 runs. Each run has different
initial seed sets.

5.2

Evaluation

For each of the above datasets, we obtained a clustering result from the K-means algorithm. The number of clusters is
set as the same with the number of pre-assigned categories in
the data set. The quality of a clustering result was evaluated
using two evaluation measures—purity and entropy, which
are widely used to evaluate the performance of unsupervised
learning algorithms [23, 22].
To begin with, each cluster is labeled with the majority category that appears in that cluster. Moreover, if a category
label has been assigned to a cluster, it still can be assigned to
other clusters if it is the dominant category in that cluster.
Based on the cluster labels, the purity and entropy measures
are computed as follows.
The purity measure evaluates the coherence of a cluster, that
is, the degree to which a cluster contains documents from a
single category. Given a particular cluster Ci of size ni , the
purity of Ci is formally defined as
P (Ci ) =

1
max(nhi )
ni h

where maxh (nhi ) is the number of documents that are from
the dominant category in cluster Ci and nhi represents the
number of documents from cluster Ci assigned to category
h.
Purity can be interpreted as the classification rate under
the assumption that all samples of the cluster are predicted
to be members of the actual dominant class for the cluster.
For an ideal cluster, which only contains documents from a
single category, its purity value is 1. In general, the higher
the purity value, the better the quality of the cluster is.
The entropy measure evaluates the distribution of categories
in a given cluster. The entropy of a cluster Ci with size ni
is defined to be
E(Ci ) = −

k
nh
1 X nhi
log( i )
log c
ni
ni
h=1

where c is the total number of categories in the data set and
nhi is the number of documents from the hth class that were
assigned to cluster Ci .
The entropy measure is more comprehensive than purity because rather than just considering the number of objects in

54

A. Huang

Data
20news
classic
hitech
re0
tr41
wap
webkb

Table 1: Summary of datasets to evaluate the various similarity measures
Documents Classes Terms Average
Source
Description
Class Size
18828
20
28553 1217
20news-18828
Newsgroup posts
7089
4
12009 1774
CACM/CISI/CRANFIELD/MEDLINE
Academic papers
2301
6
13170 384
San Jose Mercury (TREC, TIPSTER Vol. Newspaper articles
3)
1504
13
2886
131
Reuters-21578 ([10])
Newsgroup posts
878
10
7454
88
TREC5 and TREC6 (TREC 1999)
Newspaper articles
1560
20
8460
78
WebACE ([6])
Web pages
8282
7
20682 1050
Web Knowledge Base ([2])
Web pages

and not in the dominant category, it considers the overall
distribution of all the categories in a given cluster. Contrary
to the purity measure, for an ideal cluster with documents
from only a single category, the entropy of the cluster will
be 0. In general, the smaller the entropy value, the better
the quality of the cluster is.
Moreover, the averaged entropy of the overall solution is
defined to be the weighted sum of the individual entropy
value of each cluster, that is,
Entropy =

k
X
ni
E(Ci )
n
i=1

In many cases the two measures seem very similar, however,
their meanings actually differ. For example, the two matrix below represents two clustering solutions with each row
being a cluster and each column being a pre-defined category, so v(i, j) indicates the number of documents in cluster
i that are from category j. Intuitively, the solution as represented with the first matrix is more balanced and therefore
better than the other solution as represented by the second
matrix. However, the two solutions have the same purity
value, which is 0.5. Meanwhile, the entropy for each individual cluster and the overall solution is 0.40 for the first
matrix and 0.57 for the second matrix. This indicates that
the first solution is better than the other, which is consistent
with our observation.



3 1 1 1
3 0 2 1
1 3 1 1 2 3 1 0



1 1 3 1 2 0 3 1
1 1 1 3
1 0 2 3

5.3

Results

Table 2 and Table 4 show the average purity and entropy result for each similarity/distance measure on the seven datasets.
As shown in Table 2, Euclidean distance performs worst
while the performance of the other four measures are quite
similar. On average, the Jaccard and Pearson measures are
slightly better in generating more coherent clusters, which
means the clusters have higher purity scores. The best result
is achieved with the classic dataset. A closer look at this
dataset found that the categories in this dataset is well separated. Table 3 shows one partition as generated by the KLD
measure, which has the lowest purity score on this dataset
(except for the Euclidean distance measure). The shape of
the pre-defined category structure is still clear in the partition. Meanwhile, the Jaccard coefficient and the averaged

Data
20news
classic
hitech
re0
tr41
wap
webkb

Table
Euclidean
0.1
0.56
0.29
0.53
0.71
0.32
0.42

2: Purity Results
Cosine Jaccard Pearson
0.5
0.5
0.5
0.85
0.98
0.85
0.54
0.51
0.56
0.78
0.75
0.78
0.71
0.72
0.78
0.62
0.63
0.61
0.68
0.57
0.67

KLD
0.38
0.84
0.53
0.77
0.64
0.61
0.75

Table 3: Clustering Results from the KL Divergence
measure
Clusters
CRAN MED CACM CISI
Label
cluster[1] 1
823
40
0
MED(823)
cluster[2] 14
33
661
1444 CISI(1444)
cluster[3] 9
171
2375
15
CACM(2375)
cluster[4] 1374
6
127
1
CRAM(1374)
KL divergence outperform with considerable difference on
the wap dataset and the webkb dataset respectively.
The overall entropy value for each measure is shown in Table
4. As shown in table 4, the overall purity values are close
and sometimes with only 1% difference. However, there is
still some results worth noticing. For example, similar as
above, the Euclidean distance is again proved to be an ineffective metric for modeling the similarity between documents. The averaged KL divergence and Pearson coefficient
tend to outperform the cosine similarity and the Jaccard coefficient. Except for the classic dataset, either the KL divergence measure or the Pearson coefficient has the best result
on a given dataset. This means that these two measures are
more effective in finding more balanced cluster structures.
Considering that the above results all seem very close, in

Data
20news
classic
hitech
re0
tr41
wap
webkb

Table
Euclidean
0.95
0.78
0.92
0.6
0.62
0.75
0.93

4: Entropy results
Cosine Jaccard Pearson
0.49
0.51
0.49
0.29
0.06
0.27
0.64
0.68
0.65
0.27
0.33
0.26
0.33
0.34
0.3
0.39
0.4
0.39
0.6
0.74
0.61

KLD
0.54
0.3
0.63
0.25
0.38
0.4
0.51

Similarity Measures for Text Document Clustering
order to test the statistical significance between different solutions, we used the t-test to compare the solutions. First,
each solution is represented with a set of clusters and each
individual cluster is in turn represented by its centroid vector. Meanwhile, the original dataset is represented as a set of
pre-defined categories, and each category is also represented
with its centroid. Then both the original dataset and the
generated clustering solution are transformed into matrices,
with each row being a centroid object and each column is a
distinct term from the term set. Finally, the two matrices
are tested with the t-test function in the R package.2 The
t-test result of any given two matrix shows that there is a
true difference between any two matrices, because all the
p-values are less than 0.9.

6.

RELATED WORK

The most similar work to this paper is [17], which compares four similarity measures on a collection of Yahoo! news
pages. The present study differs in two aspects. First, we
extended the experiments by including the averaged KL divergence. Our results broadly agree with Strehl et al’s [17].
We both found that the performance of the cosine similarity,
Jaccard correlation and Pearson’s coefficient are very close,
and are significantly better than the Euclidean distance measure. In addition, we found that the KL divergence measure
is comparable and in some cases better than the others. Second, we also experimented with other types of data sets in
addition to the web page documents.
As for the averaged Kullback-Leibler divergence, Lin gave a
clear explanation in [11]. This measure was more frequently
used to assess the similarity between words, especially for
such applications as word sense disambiguation. It was not
until recently that this measure has been utilized for document clustering. Information theoretic clustering algorithms
such as the Information Bottleneck method [18] rely on this
measure and have shown considerable improvement in overall performance.
Meanwhile, enhanced representation of documents has been
a promising direction recently, especially the incorporation
of semantic information and taking account of the semantic
relatedness between documents. A number of researchers
have reported results on these aspects. For example, Hotho
et al. propose to extend the conventional bag of word representation with relevant terms from WordNet [7]. Experiments on document clustering task show the effectiveness of
the extended representation. Moreover, the effectiveness of
different representation strategies also depends on the type
of task at hand. For example, when clustering journalistic
text, proper names have been found to be a more appropriate representation for the text content [5]. This investigation
differs from these strategies in that we use only the basic bag
of words representation. However, combining these extended
representation is likely to improve performance and this is
planned for future work.

7.

CONCLUSIONS

To conclude, this investigation found that except for the Euclidean distance measure, the other measures have comparable effectiveness for the partitional text document clustering task. Pearson correlation coefficient and the averaged
2

http://www.r-project.org/

55

KLD divergence measures are slightly better in that their
resulting clustering solutions are more balanced and has a
closer match with the manually created category structure.
Meanwhile, the Jaccard and Pearson coefficient measures
find more coherent clusters. Despite of the above differences,
these measures’ overall performance is similar. Considering
the type of cluster analysis involved in this study, which is
partitional and require a similarity or distance measure, we
can see that there are three components that affect the final
results—representation of the objects, distance or similarity
measures, and the clustering algorithm itself. This lead us
to two directions for future work as follows.
First, I plan to investigate the impact of using different document representation on clustering performance, and combine
the different representations with similarity measures. In
particular, I will use Wikipedia as a background knowledge
base, and enrich the document representation by adding related terms identified by the relationships between terms in
Wikipedia. Wikipedia provides rich semantic relations between words and phrases, with a extensive coverage [12].
This will also help to alleviate the problems with the bag
of word document model, that words must co-occur literally
and semantic relationships between words are neglected.
Meanwhile, I plan to investigate the effectiveness of these
similarity measures with a multi-view clustering approach.
In many cases we can view a given document from more
than one perspective. For example, web pages intuitively
provide at least three views—the content text appear in the
web page itself, the anchor text of the outgoing links that
are embedded in the page and the anchor texts from incoming links. Conventional clustering normally combines these
different views (if they are taken into account at all) into
a single mixed representation and use it for clustering. We
assume that by dividing the mixed representation up and
using the different aspects individually can provide relevant
information that is well separated according to its characteristics, therefore benefiting subsequent clustering.

8.

REFERENCES

[1] D. Arthur and S. Vassilvitskii. k-means++ the
advantages of careful seeding. In Symposium on
Discrete Algorithms, 2007.
[2] M. Craven, D. DiPasquo, A. McCallum, T. Mitchell,
K. Nigam, and S. Slattery. Learning to extract
symbolic knowledge from the world wide web. In
AAAI-98, 1998.
[3] D. R. Cutting, J. O. Pedersen, D. R. Karger, and
J. W. Tukey. Scatter/gather: A cluster-based
approach to browsing large document collections. In
Proceedings of the ACM SIGIR, 1992.
[4] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A
density-based algorithm for discovering clusters in
large spatial databases with noise. In Proceedings of
2nd International Conference on KDD, 1996.
[5] N. Friburger and D. Maurel. Textual similarity based
on proper names. In Proceedings of Workshop on
Mathematical Formal Methods in Information
Retrieval at th 25th ACM SIGIR Conference, 2002.
[6] E. Han, D. Boley, M. Gini, R. Gross, K. Hastings,
G. Karypis, V. Kumar, B. Mobasher, and J. Moore.
Webace: A web agent for document categorization and

56

A. Huang

[7]
[8]
[9]

[10]
[11]
[12]

[13]
[14]
[15]
[16]
[17]

[18]

[19]
[20]

[21]
[22]

[23]

exploartion. In Proceedings of the 2nd International
Conference on Autonomous Agents., 1998.
A. Hotho, S. Staab, and G. Stumme. Wordnet
improves text document clustering. In Proceedings of
the SIGIR Semantic Web Workshop, Toronto,, 2003.
A. K. Jain, M. N. Murty, and P. J. Flynn. Data
clustering: a review. ACM Computing Surveys
(CSUR), 31(3):264–323, 1999.
B. Larsen and C. Aone. Fast and effective text mining
using linear-time document clustering. In Proceedings
of the Fifth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, 1999.
D. D. Lewis. Reuters-21578 text categorization test
collection distribution 1.0.
http://www.research.att.com/ lewis, 1999.
J. Lin. Divergence measures based on the shannon
entropy. IEEE Transaction on Information Theory,
37(1):145–151, 1991.
D. Milne, O. Medelyan, and I. H. Witten. Mining
domain-specific thesauri from wikipedia: A case study.
In Proc. of the International Conference on Web
Intelligence (IEEE/WIC/ACM WI’2006), 2006.
J. M. Neuhaus and J. D. Kalbfleisch. Between- and
within-cluster covariate effects in the analysis of
clustered data. Biometrics, 54(2):638–645, Jun. 1998.
M. F. Porter. An algorithm for suffix stripping.
Program, 14(3):130–137, 1980.
G. Salton. Automatic Text Processing.
Addison-Wesley, New York, 1989.
M. Steinbach, G. Karypis, and V. Kumar. A
comparison of document clustering techniques. In
KDD Workshop on Text Mining, 2000.
A. Strehl, J. Ghosh, and R. Mooney. Impact of
similarity measures on web-page clustering. In
AAAI-2000: Workshop on Artificial Intelligence for
Web Search, July 2000.
N. Z. Tishby, F. Pereira, and W. Bialek. The
information bottleneck method. In Proceedings of the
37th Allerton Conference on Communication, Control
and Computing, 1999.
E. Voorhees and D. Harman. Overview of the fifth
text retrieval conference (trec-5). In Proc. of the Fifth
Text REtrieval Conference (TREC-5), 1998.
P. Willett. Recent trends in hierarchic document
clustering: a critical review. Information Processing
and Management: an International Journal,
24(5):577–597, 1988.
R. B. Yates and B. R. Neto. Modern Information
Retrieval. ADDISON-WESLEY, New York, 1999.
Y. Zhao and G. Karypis. Evaluation of hierarchical
clustering algorithms for document datasets. In
Proceedings of the International Conference on
Information and Knowledge Management, 2002.
Y. Zhao and G. Karypis. Empirical and theoretical
comparisons of selected criterion functions for
document clustering. Machine Learning, 55(3), 2004.