International Journal of Computer Applications (0975 – 8887)
Volume 68– No.13, April 2013

A Survey of Text Similarity Approaches
Wael H. Gomaa

Aly A. Fahmy

Computer Science Department
Modern Academy for Computer Science &
Management Technology
Cairo, Egypt

Computer Science Department
Faculty of Computers and Information,
Cairo University
Cairo, Egypt

ABSTRACT
Measuring the similarity between words, sentences,
paragraphs and documents is an important component in
various tasks such as information retrieval, document
clustering, word-sense disambiguation, automatic essay
scoring, short answer grading, machine translation and text
summarization. This survey discusses the existing works on
text similarity through partitioning them into three
approaches; String-based, Corpus-based and Knowledgebased similarities. Furthermore, samples of combination
between these similarities are presented.

algorithms respectively. Samples of combinations between
similarity algorithms are introduced in section five and finally
section six presents conclusion of the survey.

LCS
DamerauLevenshtein
Jaro

General Terms
Text Mining, Natural Language Processing.
CharacterBased

Keywords
Text Similarity, Semantic Similarity, String-Based Similarity,
Corpus-Based Similarity, Knowledge-Based Similarity.

NeedlemanWunsch

1. INTRODUCTION
Text similarity measures play an increasingly important role
in text related research and applications in tasks such as
information retrieval, text classification, document clustering,
topic detection, topic tracking, questions generation, question
answering, essay scoring, short answer scoring, machine
translation, text summarization and others. Finding similarity
between words is a fundamental part of text similarity which
is then used as a primary stage for sentence, paragraph and
document similarities. Words can be similar in two ways
lexically and semantically. Words are similar lexically if they
have a similar character sequence. Words are similar
semantically if they have the same thing, are opposite of each
other, used in the same way, used in the same context and one
is a type of another. Lexical similarity is introduced in this
survey though different String-Based algorithms, Semantic
similarity is introduced through Corpus-Based and
Knowledge-Based algorithms. String-Based measures operate
on string sequences and character composition. A string
metric is a metric that measures similarity or dissimilarity
(distance) between two text strings for approximate string
matching or comparison. Corpus-Based similarity is a
semantic similarity measure that determines the similarity
between words according to information gained from large
corpora. Knowledge-Based similarity is a semantic similarity
measure that determines the degree of similarity between
words using information derived from semantic networks. The
most popular for each type will be presented briefly.
This paper is organized as follows: Section two presents
String-Based algorithms by partitioning them into two types
character-based and term-based measures. Sections three and
four introduce Corpus-Based and knowledge-Based

Jaro-Winkler

SmithWaterman
N-gram
String_Based
Block
Distance
Cosine
Similarity
Dice's
Coefficient
Term-Based

Euclidean
Distance
Jaccard
Similarity
Matching
Coefficient
Overlap
Coefficient

Fig 1: String-Based Similarity Measures

2. String-Based Similarity
String similarity measures operate on string sequences and
character composition. A string metric is a metric that

13

International Journal of Computer Applications (0975 – 8887)
Volume 68– No.13, April 2013
measures similarity or dissimilarity (distance) between two
text strings for approximate string matching or comparison.
This survey represents the most popular string similarity
measures which were implemented in SimMetrics package
[1]. As shown in figure 1, fourteen algorithms will be
introduced briefly; Seven of them are character based while
the other are term-based distance measures.

2.1 Character-Based Similarity Measures
Longest Common SubString (LCS) algorithm considers the
similarity between two strings is based on the length of
contiguous chain of characters that exist in both strings.
Damerau-Levenshtein defines distance between two strings
by counting the minimum number of operations needed to
transform one string into the other, where an operation is
defined as an insertion, deletion, or substitution of a single
character, or a transposition of two adjacent characters [2, 3].
Jaro is based on the number and order of the common
characters between two strings; it takes into account typical
spelling deviations and mainly used in the area of record
linkage. [4, 5].
Jaro–Winkler is an extension of Jaro distance; it uses
a prefix scale which gives more favorable ratings to strings
that match from the beginning for a set prefix length [6].
Needleman-Wunsch algorithm is an example of dynamic
programming, and was the first application of dynamic
programming to biological sequence comparison. It performs
a global alignment to find the best alignment over the entire of
two sequences. It is suitable when the two sequences are of
similar length, with a significant degree of similarity
throughout [7].
Smith-Waterman is another example of dynamic
programming. It performs a local alignment to find the best
alignment over the conserved domain of two sequences. It is
useful for dissimilar sequences that are suspected to contain
regions of similarity or similar sequence motifs within their
larger sequence context [8].
N-gram is a sub-sequence of n items from a given sequence
of text. N-gram similarity algorithms compare the n-grams
from each character or word in two strings. Distance is
computed by dividing the number of similar n-grams by
maximal number of n-grams [9].

2.2 Term-based Similarity Measures
Block Distance is also known as Manhattan distance, boxcar
distance, absolute value distance, L1 distance, city block
distance and Manhattan distance. It computes the distance that
would be traveled to get from one data point to the other if a
grid-like path is followed. The Block distance between two
items is the sum of the differences of their corresponding
components [10].
Cosine similarity is a measure of similarity between two
vectors of an inner product space that measures the cosine of
the angle between them.
Dice’s coefficient is defined as twice the number of common
terms in the compared strings divided by the total number of
terms in both strings [11].
Euclidean distance or L2 distance is the square root of the
sum of squared differences between corresponding elements
of the two vectors.

Jaccard similarity is computed as the number of shared
terms over the number of all unique terms in both strings [12].
Matching Coefficient is a very simple vector based approach
which simply counts the number of similar terms,
(dimensions), on which both vectors are non zero.
Overlap coefficient is similar to the Dice's coefficient, but
considers two strings a full match if one is a subset of the
other.

3. Corpus-Based Similarity
Corpus-Based similarity is a semantic similarity measure that
determines the similarity between words according to
information gained from large corpora. A Corpus is a large
collection of written or spoken texts that is used for language
research. Figure 2 shows the Corpus-Based similarity
measures.
Hyperspace Analogue to Language (HAL) [13,14] creates a
semantic space from word co-occurrences. A word-by-word
matrix is formed with each matrix element is the strength of
association between the word represented by the row and the
word represented by the column. The user of the algorithm
then has the option to drop out low entropy columns from the
matrix. As the text is analyzed, a focus word is placed at the
beginning of a ten word window that records which
neighboring words are counted as co-occurring. Matrix values
are accumulated by weighting the co-occurrence inversely
proportional to the distance from the focus word; closer
neighboring words are thought to reflect more of the focus
word's semantics and so are weighted higher. HAL also
records word-ordering information by treating the cooccurrence differently based on whether the neighboring word
appeared before or after the focus word.
Latent Semantic Analysis (LSA) [15] is the most popular
technique of Corpus-Based similarity. LSA assumes that
words that are close in meaning will occur in similar pieces of
text. A matrix containing word counts per paragraph (rows
represent unique words and columns represent each
paragraph) is constructed from a large piece of text and a
mathematical technique which called singular value
decomposition (SVD) is used to reduce the number of
columns while preserving the similarity structure among rows.
Words are then compared by taking the cosine of the angle
between the two vectors formed by any two rows.
Generalized Latent Semantic Analysis (GLSA) [16] is a
framework for computing semantically motivated term and
document vectors. It extends the LSA approach by focusing
on term vectors instead of the dual document-term
representation. GLSA requires a measure of semantic
association between terms and a method of dimensionality
reduction. The GLSA approach can combine any kind of
similarity measure on the space of terms with any suitable
method of dimensionality reduction. The traditional term
document matrix is used in the last step to provide the weights
in the linear combination of term vectors.
Explicit Semantic Analysis (ESA) [17] is a measure used to
compute the semantic relatedness between two arbitrary texts.
The Wikipedia-Based technique represents terms (or texts) as
high- dimensional vectors; each vector entry presents the TFIDF weight between the term and one Wikipedia article. The
semantic relatedness between two terms (or texts) is expressed
by the cosine measure between the corresponding vectors.

14

International Journal of Computer Applications (0975 – 8887)
Volume 68– No.13, April 2013
NGD is zero, or equivalent
between x squared and y squared.
HAL

LSA

GLSA

ESA

CL-ESA

PMI-IR

SOC-PMI

Corpus-Based

NGD
DISCO1
DISCO

to

the

coefficient

Extracting DIStributionally similar words using COoccurrences (DISCO) [23, 24] Distributional similarity
between words assumes that words with similar meaning
occur in similar context. Large text collections are statistically
analyzed to get the distributional similarity. DISCO is a
method that computes distributional similarity between words
by using a simple context window of size ±3 words for
counting co-occurrences. When two words are subjected for
exact similarity DISCO simply retrieves their word vectors
from the indexed data, and computes the similarity according
to Lin measure [25]. If the most distributionally similar word
is required; DISCO returns the second order word vector for
the given word. DISCO has two main similarity measures
DISCO1 and DISCO2; DISCO1 computes the first order
similarity between two input words based on their collocation
sets. DISCO2 computes the second order similarity between
two input words based on their sets of distributionally similar
words.

DISCO2

4. Knowledge-Based Similarity
Fig 2: Corpus-Based Similarity Measures
The cross-language explicit semantic analysis (CLESA) [18] is a multilingual generalization of ESA. CL-ESA
exploits a document-aligned multilingual reference collection
such as Wikipedia to represent a document as a languageindependent concept vector. The relatedness of two
documents in different languages is assessed by the cosine
similarity between the corresponding vector representations.
Pointwise Mutual Information - Information Retrieval
(PMI-IR) [19] is a method for computing the similarity
between pairs of words, it uses AltaVista's Advanced Search
query \ syntax to calculate probabilities. The more often two
words co-occur near each other on a web page, the higher is
their PMI-IR similarity score.
Second-order
co-occurrence
pointwise
mutual
information (SCO-PMI)
[20,21]
is
a semantic
similarity measure using pointwise mutual information to sort
lists of important neighbor words of the two target words from
a large corpus. The advantage of using SOC-PMI is that it can
calculate the similarity between two words that do not cooccur frequently, because they co-occur with the same
neighboring words.
Normalized Google Distance (NGD) [22] is a semantic
similarity measure derived from the number of hits returned
by the Google search engine for a given set of keywords.
Keywords with the same or similar meanings in a natural
language sense tend to be "close" in units of Google distance,
while words with dissimilar meanings tend to be farther apart.
Specifically, the Normalized Google Distance between two
search terms x and y is :

where M is the total number of web pages searched by
Google; f(x) and f(y) are the number of hits for search
terms x and y, respectively; and f(x, y) is the number of web
pages on which both x and y occur. If the two search
terms x and y never occur together on the same web page, but
do occur separately, the normalized Google distance between
them is infinite. If both terms always occur together, their

Knowledge-Based Similarity is one of semantic similarity
measures that bases on identifying the degree of similarity
between words using information derived from semantic
networks [26]. WordNet [27] is the most popular semantic
network in the area of measuring the Knowledge-Based
similarity between words; WordNet is a large lexical database
of English. Nouns, verbs, adjectives and adverbs are grouped
into sets of cognitive synonyms (synsets), each expressing a
distinct concept. Synsets are interlinked by means of
conceptual-semantic and lexical relations.
As shown in figure 3, Knowledge-based similarity measures
can be divided roughly into two groups: measures of semantic
similarity and measures of semantic relatedness. Semantically
similar concepts are deemed to be related on the basis of their
likeness. Semantic relatedness, on the other hand, is a more
general notion of relatedness, not specifically tied to the shape
or form of the concept. In other words, Semantic similarity is
a kind of relatedness between two words, it covers a broader
range of relationships between concepts that includes extra
similarity relations such as is-a-kind-of, is-a-specificexample-of, is-a-part-of, is-the-opposite-of [28].
There are six measures of semantic similarity; three of them
are based on information content: Resnik (res) [29], Lin (lin)
[25] and Jiang & Conrath (jcn) [30]. The other three measures
are based on path length: Leacock & Chodorow (lch) [31],
Wu & Palmer (wup) [32] and Path Length (path).
The related value in res measure is equal to the information
content (IC) of the Least Common Subsumer (most
informative subsumer). This means that the value will always
be greater-than or equal-to zero. The upper bound on the
value is generally quite large and varies depending upon the
size of the corpus used to determine information content
values. The lin and jcn measures augment the information
content of the Least Common Subsumer with the sum of the
information content of concepts A and B themselves. The lin
measure scales the information content of the Least Common
Subsumer by this sum, while jcn takes the difference of this
sum and the information content of the Least Common
Subsumer.

15

International Journal of Computer Applications (0975 – 8887)
Volume 68– No.13, April 2013

res
Information

Content

lin
jcn

Similarity
lch
KnowledgeBased

Relatedness

Path
Length

wup

hso

path

lesk
vector

Fig 3: Knowledge-Based Similarity Measures
lch measure returns a score denoting how similar two word
senses are, based on the shortest path that connects the senses
and the maximum depth of the taxonomy in which the senses
occur. wup measure returns a score denoting how similar two
word senses are, based on the depth of the two senses in the
taxonomy and that of their Least Common Subsumer.
path measure returns a score denoting how similar two word
senses are, based on the shortest path that connects the senses
in the is-a (hypernym/hypnoym) taxonomy.
Furthermore, there are three measures of semantic relatedness:
St.Onge (hso) [33], Lesk (lesk) [34] and vector pairs (vector)
[35]. hso measure works by finding lexical chains linking the
two word senses. There are three classes of relations that are
considered: extra-strong, strong, and medium-strong. The
maximum relatedness score is 16. lesk measure works by
finding overlaps in the glosses of the two synsets. The
relatedness score is the sum of the squares of the overlap
lengths. vector measure creates a co–occurrence matrix for
each word used in the WordNet glosses from a given corpus,
and then represents each gloss/concept with a vector that is
the average of these co–occurrence vectors.
The most popular packages that cover knowledge-based
similarity measures are WordNet::Similarity1 and Natural
Language Toolkit (NLTK)2.

5. Hybrid Similarity Measures
Hybrid methods use multiple similarity measures; many
researches covered this area. Eight semantic similarity
measures were tested in [26]. Two of these measures were
corpus-based measures and the other six were knowledgebased. Firstly, these eight algorithms were evaluated
separately, then they were combined together. The best

1

http://wn-similarity.sourceforge.net/

2

http://nltk.org/

performance was achieved using a method that combines
several similarity metrics into one.
A method for measuring the semantic similarity between
sentences or very short texts, based on semantic and word
order information was presented in [36]. First, semantic
similarity is derived from a lexical knowledge base and a
corpus. Second, the proposed method considers the impact of
word order on sentence meaning. The derived word order
similarity measures the number of different words as well as
the number of word pairs in a different order.
The authors of [37] presented a method and named it
Semantic Text Similarity (STS). This method determines the
similarity of two texts from a combination between semantic
and syntactic information. They considered two mandatory
functions (string similarity and semantic word similarity) and
an optional function (common-word order similarity). STS
method achieved a very good Pearson correlation coefficient
for 30 sentence pairs of data sets and outperformed the results
obtained in [36].
The authors of [38] presented an approach that combines
corpus-based semantic relatedness measure over the whole
sentence along with the knowledge-based semantic similarity
scores that were obtained for the words falling under the same
syntactic roles in both sentences. All the scores as features
were fed to machine learning models, like linear regression,
and bagging models to obtain a single score giving the degree
of similarity between sentences. This approach showed a
significant improvement in calculating the semantic similarity
between sentences by the combing the knowledge-based
similarity measure and the corpus-based relatedness measure
against corpus based measure taken alone.
A Promising correlation between manual and automatic
similarity results were achieved in [39] by combining two
modules. The first module calculates the similarity between
sentences using N-gram based similarity, and the second
module calculates the similarity between concepts in the two
sentences using a concept similarity measure and WordNet.
A system named UKP with reasonable correlation results was
introduced in [40], it used a simple log-linear regression
model based on training data, to combine multiple text
similarity measures. These measures were String similarity,
Semantic similarity, Text expansion mechanisms and
Measures related to structure and style. The UKP final models
consisted of a log-linear combination of about 20 features, out
of the possible 300 features implemented.

6. Conclusion
In this survey three text similarity approaches were discussed;
String-based,
Corpus-based
and
Knowledge-based
similarities. String-Based measures operate on string
sequences and character composition. Fourteen algorithms
were introduced; Seven of them were character based while
the other are term-based distance measures. Corpus-Based
similarity is a semantic similarity measure that determines the
similarity between words according to information gained
from large corpora. Nine algorithms were explained; HAL,
LSA, GLSA, ESA, CL-ESA, PMI-IR, SCO-PMI, NGD and
DISCO. Knowledge-Based similarity is one of semantic
similarity measures that bases on identifying the degree of
similarity between words using information derived from
semantic networks. Nine algorithms were introduced; Six of
them were based on semantic similarity -res, lin, jcn, lch, wup
and path- while the other three were based on semantic
relatedness -hso, lesk and vector-. Some of these algorithms
were combined together in many researches. Finally useful

16

International Journal of Computer Applications (0975 – 8887)
Volume 68– No.13, April 2013
similarity packages were mentioned such as SimMetrics,
WordNet::Similarity and NLTK.

7. REFERENCES
[1] Chapman, S. (2006). SimMetrics : a java & c# .net
library
of
similarity
metrics,
http://sourceforge.net/projects/simmetrics/.
[2] Hall , P. A. V. & Dowling, G. R. (1980) Approximate
string matching, Comput. Surveys, 12:381-402.
[3] Peterson, J. L. (1980). Computer programs for detecting
and correcting spelling errors, Comm. Assoc. Comput.
Mach., 23:676-687.
[4] Jaro, M. A. (1989). Advances in record linkage
methodology as applied to the 1985 census of Tampa
Florida, Journal of the American Statistical Society, vol.
84, 406, pp 414-420.
[5] Jaro, M. A. (1995). Probabilistic linkage of large public
health data file, Statistics in Medicine 14 (5-7), 491-8.
[6] Winkler W. E. (1990). String Comparator Metrics and
Enhanced Decision Rules in the Fellegi-Sunter Model of
Record Linkage, Proceedings of the Section on Survey
Research Methods, American Statistical Association,
354–359.
[7] Needleman, B. S. & Wunsch, D. C.(1970). A general
method applicable to the search for similarities in the
amino acid sequence of two proteins", Journal of
Molecular Biology 48(3): 443–53.
[8] Smith, F. T. & Waterman, S. M. (1981). Identification of
Common Molecular Subsequences, Journal of Molecular
Biology 147: 195–197.
[9] Alberto, B. , Paolo, R., Eneko A. & Gorka L. (2010).
Plagiarism Detection across Distant Language Pairs, In
Proceedings of the 23rd International Conference on
Computational Linguistics, pages 37–45.
[10] Eugene F. K. (1987).
Dover. ISBN 0-486-25202-7.

Taxicab

Geometry

,

[11] Dice, L. (1945). Measures of the amount of ecologic
association between species. Ecology, 26(3).
[12] Jaccard, P. (1901). Étude comparative de la distribution
florale dans une portion des Alpes et des Jura. Bulletin de
la Société Vaudoise des Sciences Naturelles 37, 547-579.
[13] Lund, K., Burgess, C. & Atchley, R. A. (1995). Semantic
and associative priming in a high-dimensional semantic
space. Cognitive Science Proceedings (LEA), 660-665.
[14] Lund, K. & Burgess, C. (1996). Producing highdimensional semantic spaces from lexical cooccurrence. Behavior Research Methods, Instruments &
Computers, 28(2),203-208.
[15] Landauer, T.K. & Dumais, S.T. (1997). A solution to
plato’s problem: The latent semantic analysis theory of
acquisition, induction, and representation of knowledge",
Psychological Review, 104.
[16] Matveeva, I., Levow, G., Farahat, A. & Royer, C. (2005).
Generalized latent semantic analysis for term
representation. In Proc. of RANLP.
[17] Gabrilovich E. & Markovitch, S. (2007). Computing
Semantic Relatedness using Wikipedia-based Explicit

Semantic Analysis, Proceedings of the 20th International
Joint Conference on Artificial Intelligence, pages 6–12.
[18] Martin, P., Benno, S. & Maik, A.(2008). A Wikipediabased multilingual retrieval model. Proceedings of the
30th European Conference on IR Research (ECIR), pp.
522-530.
[19] Turney, P. (2001). Mining the web for synonyms: PMIIR versus LSA on TOEFL. In Proceedings of the Twelfth
European Conference on Machine Learning (ECML).
[20] Islam, A. and Inkpen, D. (2008). Semantic text similarity
using corpus-based word similarity and string similarity.
ACM Trans. Knowl. Discov. Data 2, 2 (Jul. 2008), 1–25.
[21] Islam, A. and Inkpen, D. (2006). Second Order Cooccurrence PMI for Determining the Semantic Similarity
of Words, in Proceedings of the International Conference
on Language Resources and Evaluation (LREC 2006),
Genoa, Italy, pp. 1033–1038.
[22] Cilibrasi, R.L. & Vitanyi, P.M.B. (2007). The Google
Similarity Distance, IEEE Trans. Knowledge and Data
Engineering, 19:3, 370-383.
[23] Peter, K. (2009). Experiments on the difference between
semantic similarity and relatedness. In Proceedings of
the 17th Nordic Conference on Computational
Linguistics - NODALIDA '09, Odense, Denmark.
[24] Peter, K. (2009). Experiments on the difference between
semantic similarity and relatedness. In Proceedings of
the 17th Nordic Conference on Computational
Linguistics - NODALIDA '09, Odense, Denmark.
[25] Lin, D. (1998b). Extracting Collocations from Text
Corpora. In Workshop on Computational Terminology ,
Montreal, Kanada, 57–63.
[26] Mihalcea, R., Corley, C. & Strapparava, C. (2006).
Corpus based and knowledge-based measures of text
semantic similarity. In Proceedings of the American
Association for Artificial Intelligence.(Boston, MA).
[27] Miller, G.A., Beckwith, R., Fellbaum, C.D., Gross, D. &
Miller, K. (1990). WordNet: An online lexical database.
Int. J. Lexicograph. 3, 4, pp. 235–244.
[28] Patwardhan,S. , Banerjee, S. & Pedersen ,T.( 2003).
Using measures of semantic relatedness for word sense
disambiguation. In Proceedings of the Fourth
International Conference on Intelligent Text Processing
and Computational Linguistics, Mexico City , pp. 241–
257.
[29] Resnik, R. (1995). Using information content to evaluate
semantic similarity. In Proceedings of the 14th
International Joint Conference on Artificial Intelligence,
Montreal, Canada.
[30] Jiang, J. & Conrath, D. (1997). Semantic similarity based
on corpus statistics and lexical taxonomy. In Proceedings
of the International Conference on Research in
Computational Linguistics, Taiwan.
[31] Leacock, C. & Chodorow, M. (1998). Combining local
context and WordNet sense similarity for word sense
identification. In WordNet, An Electronic Lexical
Database. The MIT Press.
[32] Wu, Z.& Palmer, M. (1994). Verb semantics and lexical
selection. In Proceedings of the 32nd Annual Meeting of

17

International Journal of Computer Applications (0975 – 8887)
Volume 68– No.13, April 2013
the Association for Computational Linguistics, Las
Cruces, New Mexico.
[33] Hirst, G. & St-Onge, D. (1998). Lexical chains as
representations of context for the detection and
correction of malapropisms. In C. Fellbaum, editor,
WordNet: An electronic lexical database , pp 305–332.
MIT Press.
[34] Banerjee ,S. & Pedersen, T.(2002). An adapted Lesk
algorithm for word sense disambiguation using WordNet.
In Proceedings of the Third International Conference on
Intelligent Text Processing and Computational
Linguistics, , Mexico City, pp 136–145.
[35] Patwardhan, V.( 2003). Incorporating dictionary and
corpus information into a context vector measure of
semantic relatedness. Master’s thesis, University of
Minnesota, Duluth.
[36] Li, Y., McLean, D., Bandar, Z., O’Shea, J., & Crockett,
K. (2006). Sentence similarity based on semantic nets
and corpus statistics. IEEE Transactions on Knowledge
and Data Engineering, 18(8), 1138–1149.
[37] Islam, A., & Inkpen, D. (2008). Semantic text similarity
using corpus-based word similarity and string similarity.
ACM Transactions on Knowledge Discovery from Data,
2(2), 1–25.
[38] Nitish, A., Kartik, A. & Paul, B. (2012). DERI&UPM:
Pushing Corpus Based Relatedness to Similarity: Shared
Task System Description. First Joint Conference on
Lexical and Computational Semantics (*SEM), pages
643–647, Montreal, Canada, June 7-8, 2012 Association
for Computational Linguistics.
[39] Davide, B., Ronan, T., Nathalie A., & Josiane, M.
(2012), IRIT: Textual Similarity Combining Conceptual
Similarity with an N-Gram Comparison Method. First
Joint Conference on Lexical and Computational
Semantics (*SEM), pages 552–556, Montreal, Canada,
June 7-8,
2012 Association for Computational
Linguistics.
[40] Daniel Bar, Chris Biemann, Iryna Gurevych, and Torsten
Zesch (2012), UKP: Computing Semantic Textual
Similarity by Combining Multiple Content Similarity
Measures. First Joint Conference on Lexical and
Computational Semantics (*SEM), pages 435–440,
Montreal, Canada, June 7-8, 2012 Association for
Computational Linguistics.

AUTHORS PROFILE
Wael Hasan Gomaa, is currently working as a teacher
assistant , Computer Science department , Modern Academy
for Computer Science & Management Technology, Cairo,
Egypt. He is a Ph.D student, Faculty of Computer and
Information, Cairo University, Egypt in the field of Automatic
Assessment under supervision of Prof. Aly Aly Fahmy. He
received his B.Sc. and Master degrees from Faculty of

Computers and Information, Helwan University, Egypt. His
master thesis was entitled "Text Mining Hybrid Approach for
Clustered Semantic Analysis". His research interests include
Natural Language Processing, Artificial Intelligence, Data
Mining and Text Mining.
Prof. Aly Aly Fahmy, is the former Dean of the Faculty of
Computing and Information, Cairo University and a Professor
of Artificial Intelligence and Machine Learning, in the
department of Computer Science. He graduated from the
Department of Computer Engineering, Technical College with
honor degree. He specialized in Mathematical Logic and did
his research with Dr. Hervey Gallaire, the former vice
president of Xerox Global. He received a master's degree from
the National School of Space and Aeronautics ENSAE,
Toulouse, France, 1976 in the field of Logical Data Base
systems and then obtained his PhD from the Centre for
Studies and Research – CERT- DERI, 1979, Toulouse France in the field of Artificial Intelligence.
He received practical training in the field of Operating
Systems and Knowledge Based Systems in Germany and the
United States of America. He participated in several national
projects including the establishment of the Egyptian
Universities Network (currently hosted at the Egyptian
Academy of Scientific Research at the Higher Ministry of
Education), building Expert Systems in the field of iron and
steel industry and building Decision Support Systems for
many national entities.
Prof. Fahmy’s main research areas are: Data and Text Mining,
Mathematical Logic, Computational Linguistics, Text
Understanding and Automatic Essay Scoring and
Technologies of Man- Machine Interface in Arabic. He
published many refereed papers and authored the book
"Decision Support Systems and Intelligent Systems" in
Arabic.
He was the Director of the first Center of Excellence in Egypt
in the field of Data Mining and Computer Modeling (DMCM)
in the period of 2005-2010. DMCM was a virtual research
center with more than 40 researchers from universities and
industry. It was founded by an initiative of Dr. Tarek Kamel,
Minister of Communications and Information Technology.
Prof. Aly Fahmy is currently involved in the implementation
of the exploration project of Master’s and Doctorate theses of
Cairo University with the supervision of Prof. Dr. Hussein
Khaled, Vice President of Cairo University for Postgraduate
Studies and Research. The project aims to assess the practical
implementation of Cairo University late strategic research
plan, and to assist in the formulation of the new strategic
research plan for the coming 2011 - 2015.

18