Open-source Resources and Standards for Arabic Word Structure Analysis: Fine Grained Morphological Analysis of Arabic Text Corpora By Majdi Shaker Salem Sawalha Submitted in accordance with the requirements for the degree of Doctor of Philosophy The University of Leeds School of Computing October, 2011 The candidate confirms that the work submitted is his own and that appropriate credit has been given where reference has been made to the work of others. This copy has been supplied on the understanding that it is copyright material and that no quotation from the thesis may be published without proper acknowledgement. - ii - Memory .         I dedicate this thesis to the memory of the most beloved Father, Shaker Sawalha (March 3, 1949 - March 5, 2011) who lived a life of dignity, courage, wisdom, patience and above all affection, and who brought me up on the true values of life. Father, you will remain my personal hero and my inspiration forever. May God bless his soul, Amen. - iii - Acknowledgements I am thanking my GOD Allāh for giving me health, patience and strength to write this thesis and all the graces he has granted to me. I would like to thank my supervisor Dr. Eric Atwell for supervising me during these four years. Thank you very much for your patience, guidance and encouragement. I learnt from how to be a real researcher, how to think differently and how to understand life better. I would also like to thank the NLP group members for the great seminars we used to enjoy almost every week. Again, it’s a great opportunity here to thank Dr. Latifa AlSulaiti for her support, encouragement and advice. And I would like to thank all my friends here in the UK and back home in Jordan. I would like to thank Claire Brierley for being a true friend, and for the discussions, sharing ideas and plans for future research. I am looking forward to producing lots of publications from our great ideas. To my best friend Dr. Mohammad Haji, thank you very much for being my real friend whom I trust. Your wise advice, encouragement and unending generosity made my research and life in the UK easy and enjoyable. Thank you for being there during the good times and the hard times. I really wish you the best of luck in your life and career. Finally, I dedicate this thesis to my family who have always supported me in my studies and life. Without your love, care and patience, I would not have achieved this. I would like to thank my eldest brother Rami and his family members: my sister-in-law Dina, my nephew Faris, and my nieces Tala, Layan and Jude. My sister Noor and her family: my brother-in-law Husam, my niece Hadeel, and my nephew Mohammed (who’s just born). My sister Dua’ and her family: my brother-in-law Mohammed and my nieces Dana and Heba. My sister Eman and her family: my brother-in-law Omar and my niece Hala (who’s just born). My youngest brother Mohammed, I wish you the brightest future. My youngest sister Rahma, we are all lucky to have you as our beloved sister. To my beloved Grandma, I wish you prosperity and a long happy life. The special dedication of this thesis is to the most beloved Mum. Thank you for your patience, care and everything you have done to keep our family gathered in peace and happiness. Thank you for giving us the love we need to survive in this life. I always love you Mum. - iv - Declaration I declare that the work presented in this thesis, is the best of my knowledge of the domain, original, and my own work. Most of the work presented in this thesis have been published. Publications are listed below: (Majdi Sawalha) Chapter 3 1- Sawalha, M. and E. Atwell (2008). Comparative evaluation of Arabic language morphological analysers and stemmers. Proceedings of COLING 2008 22nd International Conference on Computational Linguistics. Chapter 4 2- Sawalha, M. and E. Atwell (2010). Constructing and Using Broad-Coverage Lexical Resource for Enhancing Morphological Analysis of Arabic. Language Resource and Evaluation Conference LREC 2010, Valleta, Malta. Chapters 5 and 6 3- Sawalha, M. and E. Atwell (Under review). "A Theory Standard Tag Set Expounding Traditional Morphological features for Arabic Language Part-of-Speech Tagging." Word structure journal, Edinburgh University Press. Chapter 7 4- Sawalha, M. and E. Atwell (2011).     !   "# $% &'() *( . + %,- "Morphological Analysis of Classical and Modern Standard Arabic Text". 7th International Computing Conference in Arabic (ICCA11), Imam Mohammed Ibn Saud University, Riyadh, KSA. Chapters 8 and 9 5- Sawalha, M. and E. Atwell (2009).  "# $% */ +%0 12)" * 3( ',) 4'5 6 7'8(Adapting Language Grammar Rules for Building Morphological Analyzer for Arabic Language). Proceedings of the workshop of morphological analyzer experts for Arabic language, organized by Arab League Educational, Cultural and Scientific Organization (ALECSO), King Abdul-Aziz City of Technology ( KACT) and Arabic Language Academy., Damascus, Syria. 6- Sawalha, M. and E. Atwell (2009). Linguistically Informed and Corpus Informed Morphological Analysis of Arabic. Proceedings of the 5th International Corpus Linguuistics Conference CL2009, Liverpool, UK. 7- Sawalha, M. and E. Atwell (2010). Fine-Grain Morphological Analyzer and Part-ofSpeech Tagger for Arabic Text. Language Resource and Evaluation Conference LREC 2010 Valleta, Malta. Chapter 10 8- Sawalha, M. and E. Atwell (2011). Accelerating the Processing of Large Corpora: Using Grid Computing Technologies for Lemmatizing 176 Million Words Arabic Internet Corpus. Advanced research computing open event. University of Leeds, Leeds, UK. 9- Sawalha, M. and E. Atwell (2011). Corpus Linguistics Resources and Tools for Arabic Lexicography. Workshop on Arabic Corpus Linguistics, Lancaster University, Lancaster, UK. -v- Abstract Morphological analyzers are preprocessors for text analysis. Many Text Analytics applications need them to perform their tasks. The aim of this thesis is to develop standards, tools and resources that widen the scope of Arabic word structure analysis particularly morphological analysis, to process Arabic text corpora of different domains, formats and genres, of both vowelized and non-vowelized text. We want to morphologically tag our Arabic Corpus, but evaluation of existing morphological analyzers has highlighted shortcomings and shown that more research is required. Tag-assignment is significantly more complex for Arabic than for many languages. The morphological analyzer should add the appropriate linguistic information to each part or morpheme of the word (proclitic, prefix, stem, suffix and enclitic); in effect, instead of a tag for a word, we need a subtag for each part. Very fine-grained distinctions may cause problems for automatic morphosyntactic analysis – particularly probabilistic taggers which require training data, if some words can change grammatical tag depending on function and context; on the other hand, finegrained distinctions may actually help to disambiguate other words in the local context. The SALMA – Tagger is a fine grained morphological analyzer which is mainly depends on linguistic information extracted from traditional Arabic grammar books and priorknowledge broad-coverage lexical resources; the SALMA – ABCLexicon. More fine-grained tag sets may be more appropriate for some tasks. The SALMA – Tag Set is a theory standard for encoding, which captures long-established traditional fine-grained morphological features of Arabic, in a notation format intended to be compact yet transparent. The SALMA – Tagger has been used to lemmatize the 176-million words Arabic Internet Corpus. It has been proposed as a language-engineering toolkit for Arabic lexicography and for phonetically annotating the Qur’an by syllable and primary stress information, as well as, fine-grained morphological tagging. - vi - Contents Memory ...................................................................................................................... ii Acknowledgements .................................................................................................. iii Declaration................................................................................................................ iv Abstract ...................................................................................................................... v Contents .................................................................................................................... vi Figures ...................................................................................................................... xv Tables ....................................................................................................................... xx List of Abbreviations ........................................................................................... xxiv Part I: Introduction and Background Review ....................................................... 1 Chapter 1 Introduction............................................................................................. 2 1.1 This Thesis ................................................................................................... 3 1.2 Computational Morphology ......................................................................... 3 1.3 Arabic Computational Morphology ............................................................. 4 1.4 The Complexity of Arabic Morphology ...................................................... 7 1.5 Motivation and Objectives for this Thesis ................................................... 8 1.6 Thesis Structure ......................................................................................... 10 Chapter 2 Literature Review: Morphosyntactic Analysis of Arabic Text ........ 13 2.1 Introduction ................................................................................................ 13 2.2 Arabic Corpora........................................................................................... 14 2.3 Morphological Analysis for Text Corpora................................................. 16 2.3.1 Approaches to Morphological Analysis......................................... 18 2.3.2 MorphoChallege Competition ....................................................... 19 2.3.3 Applications of Morphological analysis ........................................ 20 2.3.4 Morphological Analysis for Arabic Text ....................................... 21 2.3.4.1 Challenges of Arabic Morphology..................................... 22 2.3.4.2 Basic Concepts of Arabic Morphological Analysis ........... 27 2.3.4.3 Morphological Analysis of Classical Quranic Arabic Text 28 2.3.4.4 Four Approaches to Morphological Analysis for MSA Arabic Text ........................................................................... 30 2.3.4.5 Requirements for Developing Morphological Analysers for Arabic Text ........................................................................... 31 2.3.4.6 Morphological Analysers for Modern Standard Arabic Text31 - vii 2.3.4.7 The ALECSO/KACST Initiative of developing and evaluating Morphological Analysers of Arabic text ............. 36 2.4. Part-of-Speech Tagging ............................................................................ 37 2.4.1 Part-of-Speech Taggers for Arabic Text ........................................ 39 2.5 Chapter Summary ...................................................................................... 40 Part II: Background Analysis and Design ............................................................ 42 Chapter 3 Comparative Evaluation of Arabic Morphological Analyzers and Stemmers ........................................................................................................ 43 3.1 Introduction ................................................................................................ 44 3.2 Three Stemming Algorithms...................................................................... 45 3.2.1 Shereen Khoja’s Stemmer.............................................................. 45 3.2.2 Tim Buckwalter’s Morphological Analyzer .................................. 46 3.2.3 Triliteral Root Extraction Algorithm ............................................. 46 3.3 Stemming by Ensemble or Voting ............................................................. 47 3.4 Gold standard for Evaluation ..................................................................... 49 3.5 Four Experiments and Results ................................................................... 51 3.6 Comparative Evaluation Conclusions ........................................................ 55 3.7 Analytical Study of Arabic Triliteral Roots ............................................... 56 3.7.1 A Study of Triliteral Roots in the Qur’an ..................................... 56 3.7.2. A Study of Triliteral Roots in Traditional Arabic Lexicons ......... 58 3.7.3 Discussion of the Analytical Study of Arabic Triliteral Roots ...... 60 3.8 Summary and Conclusions ........................................................................ 61 Chapter 4 The SALMA-ABCLexicon: Prior-Knowledge Broad-Coverage Lexical Resource to Improve Morphological Analyses ............................. 63 4.1 Introduction ................................................................................................ 64 4.1.1 Morphological Lexicons of Other Languages ............................... 64 4.1.2 Morphological Lexicons for Arabic............................................... 68 4.2 Traditional Arabic Lexicons and Lexicography ........................................ 69 4.3 Methodologies for Ordering Lexical Entries in the Traditional Arabic Lexicons .................................................................................................. 73 4.3.1 The al-ẖalῑl Methodology .............................................................. 73 4.3.2 The abū ‘ubayd Methodology........................................................ 74 4.3.3 The al-ğawharῑ Methodology ........................................................ 74 4.3.4 The al-barmakῑ Methodology ........................................................ 75 4.4 Constructing the SALMA-ABCLexicon ................................................... 76 4.4.1 The Text Corpus ............................................................................ 78 - viii 4.4.2 Morphological Knowledge Used to Extract the Lexical Entries ... 78 4.4.3 Combining the Processed Lexicons into the SALMA-ABCLexicon81 4.4.4 Format of the SALMA-ABCLexicon ............................................ 82 4.4.5 Retrieval of the Lexical Entries ..................................................... 84 4.5 Evaluation of the SALMA-ABCLexicon .................................................. 86 4.6 The Corpus of Traditional Arabic Lexicons .............................................. 89 4.7 Discussion of the Results, Limitations and Improvement ......................... 91 4.8 Chapter Summary ...................................................................................... 93 Chapter 5 Survey of Arabic Morphosyntactic Tag Sets and Standards; Background to Designing the SALMA Tag Set .......................................... 95 5.1 Introduction ................................................................................................ 96 5.2 Traditional Arabic Part-of-Speech Classification ...................................... 97 5.3 Existing Arabic Part-of-Speech Tag Sets .................................................. 98 5.3.1 Khoja’s Arabic Tag Set .................................................................. 99 5.3.2 Penn Arabic Treebank (PATB) Part-of-Speech Tag Set ............... 99 5.3.3 ARBTAGS Tag Set...................................................................... 103 5.3.4 MorphoChallenge 2009 Qur’an Gold Standard Part-of-Speech Tag Set ................................................................................................ 104 5.3.5 The Quranic Arabic Corpus Part-of-Speech Tag Set ................... 105 5.3.6 Columbia Arabic Treebank CATiB Part-of-Speech Tag Set ....... 106 5.3.7 Comparison of Arabic Part-of-Speech Tag Sets .......................... 107 5.4 Morphological Features in Tag Set Design Criteria ................................ 110 5.4.1 Mnemonic Tag Names ................................................................. 111 5.4.2 Underlying Linguistic Theory...................................................... 112 5.4.3 Classification by Form or Function ............................................. 112 5.4.4 Idiosyncratic Words ..................................................................... 113 5.4.5 Categorization Problems .............................................................. 113 5.4.6 Tokenisation: What Counts as a Word?....................................... 114 5.4.7 Multi-Word Lexical Items ........................................................... 114 5.4.8 Target Users and/or Applications ................................................ 115 5.4.9 Availability and/or Adaptability of Tagger Software .................. 115 5.4.10 Adherence to Standards ............................................................. 115 5.4.11 Genre, Register or Type of Language ........................................ 115 5.4.12 Degree of Delicacy of the Tag Set ............................................. 116 5.5 Complex Morphology of Arabic .............................................................. 118 - ix 5.6 Chapter Summary .................................................................................... 119 Part III: Proposed Standards for Arabic Morphological Analysis .................. 121 Chapter 6 The SALMA – Tag Set ....................................................................... 122 6.1 The Theory Standard Tag Set Expounding Morphological Features ...... 123 6.2 The Morphological Features of the SALMA Tag Set ............................. 125 6.2.1 Main Part-of-Speech Categories .................................................. 126 6.2.2 Part-of-Speech Subcategories of Noun ........................................ 127 6.2.3 Part-of-Speech Subcategories of Verb ......................................... 133 6.2.4 Part-of-Speech Subcategories of Particles ................................... 134 6.2.5 Part-of-Speech Subcategories of Others (Residuals) ................... 138 6.2.6 Part-of-Speech Subcategories of Punctuation Marks .................. 141 6.2.7 Morphological Feature of Gender ................................................ 142 6.2.8 Morphological Feature of Number .............................................. 144 6.2.9 Morphological Feature of Person................................................. 147 6.2.10 Morphological Feature Category of Inflectional Morphology .. 148 6.2.11 Morphological Feature Category of Case or Mood ................... 150 6.2.12 The Morphological Feature of Case and Mood Marks .............. 153 6.2.13 The Morphological Feature of Definiteness .............................. 155 6.2.14 Morphological Feature of Voice ................................................ 156 6.2.15 Morphological Feature of Emphasized and Non-emphasized ... 156 6.2.16 The Morphological Feature of Transitivity................................ 157 6.2.17 The Morphological Feature of Rational ..................................... 159 6.2.18 The Morphological Feature of Declension and Conjugation ..... 160 6.2.19 The Morphological Feature of Unaugmented and Augmented . 163 6.2.20 The Morphological Feature of Number of Root Letters ............ 165 6.2.21 The Morphological Feature of Verb Root ................................. 166 6.2.22 The Morphological Feature of Types of Noun Finals ............... 168 6.3 Chapter Summary .................................................................................... 171 Chapter 7 Applying the SALMA – Tag Set ........................................................ 172 7.1 Introduction .............................................................................................. 173 7.2 Why was Manual Annotation not Applied?............................................. 174 7.3 Methodologies for Evaluating the SALMA Tag Set ............................... 174 7.4 Mapping the Quranic Arabic Corpus (QAC) Morphological Tags to SALMA Tags ........................................................................................ 176 7.4.1 Mapping Classical to Modern Character-Set ............................... 176 -x7.4.2 Splitting Whole-Word Tags into Morpheme-Tags ...................... 177 7.4.3 Mapping of Feature-Labels .......................................................... 178 7.4.4 Adjustments to Morpheme Tokenization..................................... 179 7.4.5 Extrapolation of Missing Fine-Grain Features ............................ 182 7.4.6 Manual proofreading and correction of the mapped SALMA tags ...................................................................................... 184 7.5 Evaluation of the Mapping Process ......................................................... 185 7.6 Discussion of Evaluation of the SALMA Tag Set ................................... 188 7.7 Conclusions and Summary ...................................................................... 189 Part IV: Tools and Applications for Arabic Morphological Analysis ............. 191 Chapter 8 The SALMA Tagger for Arabic Text ............................................... 192 8.1 Introduction .............................................................................................. 193 8.2 Specifications and Standards of Arabic Morphological Analyses ........... 193 8.2.1 ALECSO/KACST Initiative on Morphological Analyzers for Arabic Text .................................................................................. 194 8.2.2 ALECSO/KACST Prerequisites for a Good Morphological Analyser for Arabic Text ............................................................. 195 8.2.3 ALECSO/KACST: Design Recommendations............................ 195 8.2.3.1 ALECSO/KACST: Design Recommendations of Inputs 196 8.2.3.2 ALECSO/KACST: Design Recommendations of Analysis196 8.2.3.3 ALECSO/KACST: Design Recommendations of Outputs201 8.2.4 Discussion of ALECSO/KACST Recommendations .................. 202 8.3 The SALMA – Tagger Algorithm ........................................................... 203 8.3.1 Module 1: SALMA – Tokenizer .................................................. 204 8.3.1.1 Step 1, Tokenization ........................................................ 205 8.3.1.2 Step 2, Spelling Errors Detection and Correction ............ 206 8.3.1.3 Step 3, Word Segmentation (Clitics, Affixes and Stems) 207 8.3.1.4 Which Segmentation to Use? ........................................... 207 8.3.1.5 Constructing the Clitics and Affixes Dictionaries ........... 209 8.3.1.6 Matching the Affixes and Clitics with the Word’s Segments ............................................................................. 211 8.3.2 Module 2: SALMA- Lemmatizer and Stemmer .......................... 213 8.3.2.1 The Use of the SALMA ABCLexicon............................. 214 8.3.2.2 Step 1, Root extraction ..................................................... 215 8.3.2.3 Step 2, Function Words.................................................... 216 8.3.2.4 Step 3, Lemmatizing ........................................................ 216 - xi 8.3.3 Module 3: SALMA – Pattern Generator ...................................... 217 8.3.3.1 Constructing the Patterns Dictionary ............................... 220 8.3.3.2 Pattern Matching Algorithm 1 ......................................... 221 8.3.3.3 Pattern Matching Algorithm 2 ......................................... 222 8.3.4 Module 4: SALMA – Vowelizer ................................................. 226 8.3.5 Module 5: SALMA – Tagger ....................................................... 226 8.3.5.1 Initially-assigned SALMA Tags ...................................... 227 8.3.5.2 Rule-Based System to Predict the Morphological Feature Values of the Word’s Morphemes ...................................... 228 8.3.5.3 Colour Coding the Analyzed Words ................................ 230 8.4 Rules for Predicting the Morphological features of Arabic Word Morphemes ........................................................................................... 231 8.4.1 Rules for Predicting the Morphological Feature of Person ......... 233 8.4.2 Rules for Predicting the Morphological Feature of Rational ....... 235 8.4.3 Rules for Predicting the Morphological Feature of Noun Finals . 237 8.5 Output Format .......................................................................................... 238 8.6 Chapter Summary .................................................................................... 243 Chapter 9 Evaluation for the SALMA – Tagger................................................ 245 9.1 Introduction .............................................................................................. 246 9.2 ALECSO/KACST Initiative Guidelines for Evaluating Morphological Analyzers for Arabic Text .................................................................... 247 9.2.1 Evaluation of the Linguistic Specifications ................................. 248 9.2.2 Evaluation of the Technical Specifications.................................. 248 9.2.2.1 The Approach to Implementation .................................... 248 9.2.2.2 User Friendliness ............................................................. 249 9.2.2.3 Database Management ..................................................... 249 9.2.2.4 Copyright and licensing ................................................... 249 9.2.2.5 Evaluation Metrics of Recall and Precision ..................... 249 9.3 MorphoChallenge Guidelines for Evaluating Morphological Analyzers for Arabic Text ........................................................................................... 249 9.3.1 MorphoChallenge 2009 Competition 1: Evaluation using Gold Standard ....................................................................................... 250 9.3.2 MorphoChallenge 2009 Qur’an Gold Standard ........................... 251 9.4 Gold Standard for Evaluation .................................................................. 252 9.4.1 Problem domain ........................................................................... 253 9.4.2 The Corpora ................................................................................. 253 - xii 9.4.3 Gold Standard Format .................................................................. 253 9.4.4 Gold Standard Size ...................................................................... 254 9.5 Building the SALMA – Gold Standard ................................................... 254 9.5.1 The Qur’an Gold Standard ........................................................... 255 9.5.1.1 Specifications of the Qur’an part of the SALMA Gold Standard .............................................................................. 256 9.5.2 The Corpus of Contemporary Arabic Gold Standard .................. 259 9.5.2.1 Specifications of the CCA part of the SALMA Gold Standard .............................................................................. 259 9.6 Deciding on Accuracy Measurements ..................................................... 262 9.7 Evaluating the SALMA – Tagger Using Gold Standards ........................ 263 9.8 Discussion of Results ............................................................................... 274 9.8.1 Results of Predicting the Value of Main Part of Speech ............. 275 9.8.2 Results of Predicting the Value of the Part-of-Speech Subcategory of Noun ........................................................................................ 275 9.8.3 Results of Predicting the Value of the Part-of-Speech Subcategories of Verb and Particle .............................................. 276 9.8.4 Results of Predicting the Value of the Part-of-Speech Subcategory of Others (Residuals) ................................................................... 276 9.8.5 Results of Predicting the Value of Punctuations.......................... 276 9.8.6 Results of Predicting the Value of the Morphological Features of Gender, Number and Person ........................................................ 277 9.8.7 Results of Predicting the Value of the Morphological Features of Inflectional Morphology, Case or Mood, and Case and Mood Marks ........................................................................................... 278 9.8.8 Results of Predicting the Value of the Morphological Feature of Definiteness.................................................................................. 280 9.8.9 Results of Predicting the Value of the Morphological Feature of Voice ............................................................................................ 280 9.8.10 Results of Predicting the Value of the Morphological Feature of Emphasized and Non-Emphasized .............................................. 281 9.8.11 Results of Predicting the Value of the Morphological Feature of Transitivity ................................................................................... 281 9.8.12 Results of Predicting the Value of the Morphological Feature of Rational ........................................................................................ 281 9.8.13 Results of Predicting the Value of the Morphological Feature of Declension and Conjugation ........................................................ 282 - xiii 9.8.14 Results of Predicting the Value of the Morphological Features of Unaugmented and Augmented, Number of Root Letters, and Verb Roots ............................................................................................ 282 9.8.15 Results of Predicting the Value of the Morphological Feature of Noun Finals .................................................................................. 283 9.8.16 More Conclusions ...................................................................... 283 9.9 Limitations and improvements ................................................................ 284 9.10 Extension of the SALMA – Tag Set ...................................................... 285 9.11 Chapter Summary .................................................................................. 287 Chapter 10 Practical Applications of the SALMA – Tagger ............................ 290 10.1 Introduction ............................................................................................ 291 10.2 Lemmatizing the 176-million words Arabic Internet Corpus ................ 291 10.2.1 Evaluation of the Lemmatizer Accuracy ................................... 294 10.3 Corpus Linguistics Resources and Tools for Arabic Lexicography ...... 296 10.4 Chapter Summary .................................................................................. 301 Part V: Conclusions and Future Work ............................................................... 303 Chapter 11 Conclusions and Future Work ........................................................ 304 11.1 Overview ................................................................................................ 304 11.2 Thesis Achievements and Conclusions .................................................. 304 11.2.1 The Practical Challenge of Morphological Analysis for Arabic Text .............................................................................................. 305 11.2.2 Resources for improving Arabic Morphological Analysis ........ 306 11.2.3 Standards for Arabic Morphosyntactic Analysis ....................... 308 11.2.4 Applications and Implementations ............................................ 310 11.2.5 Evaluation .................................................................................. 311 11.3 Future work ............................................................................................ 316 11.3.1 Improving the SALMA – Tagger .............................................. 316 11.3.2 A Syntactic Analyzer (parser) for Arabic Text .......................... 318 11.3.3 Open Source Morphosyntactically Annotated Arabic Corpus... 319 11.3.4 Arabic Phonetics and Phonology for Text Analytics and Natural Language Processing Applications .............................................. 320 11.4 Summary: PhD impact, originality, and contributions to research field 321 11.4.1 Utilizing the Linguistic Wisdom and Knowledge in Arabic NLP322 11.4.2 Dimensions of Contributions to Arabic NLP ............................ 322 11.4.3 Impact ........................................................................................ 323 - xiv References .............................................................................................................. 324 Appendix A The SALMA Tag Set for Arabic text............................................. 335 A.1 Position 1; Main part-of-speech .............................................................. 337 A.2 Position 2; Part-of-Speech Subcategories of Noun ................................. 338 A.3 Position 3; Part-of-Speech Subcategories of Verb .................................. 339 A.4 Position 4; Part-of-Speech Subcategories of Particle ............................. 339 A.5 Position 5; Part-of-Speech Subcategories of Other (Residuals) ............. 340 A.6 Position 6; Part-of-Speech Subcategories of Punctuation Marks ........... 341 A.7 Position 7; Morphological Feature of Gender......................................... 341 A.8 Position 8; Morphological Feature of Number ....................................... 342 A.9 Position 9; Morphological Feature of Person ......................................... 342 A.10 Position 10; Morphological Feature of Inflectional Morphology ......... 343 A.11 Position 11; Morphological Feature Category of Case or Mood .......... 343 A.12 Position 12; The Morphological Feature of Case and Mood Marks ..... 344 A.13 Position 13; The Morphological Feature of Definiteness ..................... 344 A.14 Position 14; The Morphological Feature of Voice................................ 345 A.15 Position 15; The Morphological Feature of Emphasized and Nonemphasized ............................................................................................ 345 A.16 Position 16; The Morphological Feature of Transitivity ...................... 345 A.17 Position 17; The Morphological Feature of Rational............................ 345 A.18 Position 18; The Morphological Feature of Declension and Conjugation346 A.19 Position 19; The Morphological Feature of Unaugmented and Augmented ............................................................................................ 346 A.20 Position 20; The Morphological Feature of Number of Root Letters ... 347 A.21 Position 21; The Morphological Feature of Verb Root ........................ 347 A.22 Position 22; The Morphological Feature of Noun Finals ..................... 348 Appendix B Summary of Arabic Part-of-Speech Tagging Systems ................. 349 - xv - Figures Figure 1.1 Example of ambiguous Arabic word ......................................................... 8 Figure 2.1 Sample of the morphological and part-of-speech tags of the Quranic Arabic Corpus taken from chapter 29 .............................................................. 29 Figure 3.1 The statistical, computational and representational methods for better and more accurate ensemble (Dietterich 2000) ............................................... 48 Figure 3.2 Sample from Gold Standard first document taken from Chapter 29 of the Qur’an (left) and the CCA (right). ................................................................... 50 Figure 3.3 Accuracy rates resulting from the four different experiments for the Qur’an test document ....................................................................................... 52 Figure 3.4 Sample output of the three algorithms, the voting experiments and the gold standard of the Qur’an test document ...................................................... 52 Figure 3.5 Accuracy rates results of the four different experiments for the CCA test document .......................................................................................................... 54 Figure 3.6 Sample output of the three algorithms, the voting experiments and the gold standard of the CCA test document ......................................................... 54 Figure 3.7 Root distribution (left) and word distribution (right) of the Qur’an ....... 58 Figure 3.8 Root distribution (left) and Word type distribution (right) of the broadlexical resource ................................................................................................ 60 Figure 4.1 A sample of text from the traditional Arabic lexicons corpus “lisān al‘arab”, the target lexical entries are underlined and highlighted in blue......... 70 Figure 4.2 A Human translation of the sample of text from the traditional Arabic lexicons “lisān al-‘arab”, the target lexical entries are highlighted in blue and square brackets. ................................................................................................ 71 Figure 4.3 A Sample of the definition of the root ktb from an Arabic-English Lexicon by Edward Lane (Lane 1968), http://www.tyndalearchive.com/TABS/Lane/ , the target lexical entries are underlined. ....................................................................................................... 71 Figure 4.4 A sample of text from the traditional Arabic lexicon “al-muğrib fῑ tartῑb al-mu‘rib”, the target lexical entries are underlined and highlighted in blue. . 72 Figure 4.5 A sample of a traditional Arabic lexicon aṣ-ṣiḥāḥ fῑ al-luḡah    ‘The Correct Language’, the original manuscript. ........................................... 72 Figure 4.6 Using linguistic knowledge to select word-root pairs from traditional Arabic lexicons. The selected word-root pairs are underlined and highlighted in blue............................................................................................................... 80 Figure 4.7 The first 60 lexical entries of the root  k-t-b ‘wrote’ stored in the SALMA – ABCLexicon .................................................................................. 82 - xvi Figure 4.8 XML and tab separated column files formats of the SALMAABCLexicon .................................................................................................... 83 Figure 4.9 The entity relationship diagram of the SALMA-ABCLexicon ............... 83 Figure 4.10 Lexicon Python Classes interface – implementation of the methods is not included ...................................................................................................... 85 Figure 4.11 Web interface for searching the traditional Arabic lexicons ................. 85 Figure 4.12 The coverage of the SALMA-ABCLexicon using exact match method86 Figure 4.13 Coverage percentage of the SALMA-ABCLexicon using the lemmatizer........................................................................................................ 87 Figure 4.14 A sample of common words which are not covered by the lexicon ...... 89 Figure 4.15 The Corpus of Traditional Arabic Lexicons frequency list ................... 90 Figure 4.16 XML structure of The Corpus of Traditional Arabic Lexicons ............ 91 Figure 5.1 Example sentence illustrating rival English part-of-speech tagging (from the ALMAGAM multi-tagged corpus) ............................................................ 96 Figure 5.2 Example of tagged sentence using Khoja’s tag set ................................. 99 Figure 5.3 The Penn Arabic Treebank Tag Set; basic tags, which can be combined100 Figure 5.4 Buckwalter morphological analysis of a sentence from the Arabic Treebank ........................................................................................................ 101 Figure 5.5 Disambiguated sentence from the Arabic Treebank using FULL tag set102 Figure 5.6 Buckwalter morphological analysis of a sentence from the Quran ....... 102 Figure 5.7 Disambiguated sentence from the Quran using FULL tag set .............. 102 Figure 5.8 A sample of tagged sentence using the FULL, RTS and ERTS tag sets103 Figure 5.9 The 28 general tags of the ARBTAGS tag set ...................................... 104 Figure 5.10 Sample of tagged text taken from the MorphoChallenge 2009 Qur’an Gold Standard. The first part uses Arabic script and the second one uses romanized letters using Tim Buckwalter transliteration scheme. .................. 105 Figure 5.11 A sample of a tagged sentence taken from the Quranic Arabic Corpus106 Figure 5.12 Example of part-of-speech tagged sentence using CATiB tag set ...... 107 Figure 5.13 Example of tokenization, the SALMA tag assignment for separate morphemes and the combination of the morphemes tags into the word tag .. 119 Figure 6.1 Sample of Tagged vowelized Qur’an text using the SALMA Tag Set . 124 Figure 6.2 Sample of Tagged non-vowelized newspaper text using the SALMA Tag Set .................................................................................................................. 124 Figure 6.3 Main part-of-speech category attributes and letters used to represent them at position 1 ........................................................................................... 127 Figure 6.4 The classification attributes of noun part-of-speech subcategories with letter at position 2........................................................................................... 133 Figure 6.5 Part-of-Speech subcategories of verb, with letter at position 3 ............. 134 - xvii Figure 6.6 Subcategories of Particle, with letter at position 4 ................................ 135 Figure 6.7 The word structure and the residuals that belong to each part of the word, with letter at position 5 .................................................................................. 140 Figure 6.8 Punctuation marks used in Arabic, with letters at position 6 ................ 141 Figure 6.9 Arabic classification of nouns according to gender, with letter at position 7...................................................................................................................... 143 Figure 6.10 Morphological feature of number category attributes, with letter at position 8........................................................................................................ 145 Figure 6.11 Morphological feature of person category attributes, with letter at position 9........................................................................................................ 148 Figure 6.12 The morphological feature subcategories of Morphology attributes, with letter at position 10 ................................................................................ 149 Figure 6.13 The morphological feature of Case or Mood, with letter at position 11153 Figure 6.14 The morphological feature Case and Mood Marks, with letter at position 12...................................................................................................... 155 Figure 6.15 The morphological feature of Definiteness, with letter at position 13 155 Figure 6.16 The morphological feature of Voice, with letter at position 14 .......... 156 Figure 6.17 The morphological feature of Emphasized and Non-emphasized, with letter at position 15......................................................................................... 157 Figure 6.18 The morphological feature of Transitivity, with letter at position 16 . 158 Figure 6.19 Morphological feature category of Rational, with letter at position 17160 Figure 6.20 The the classification of nouns and verbs according to the morphological feature of Declension and Conjugation, with letter at position 18.................................................................................................................... 163 Figure 6.21 The Unaugmented and Augmented category attributes, with letter at position 19...................................................................................................... 165 Figure 6.22 The Number of Root Letters category, with letter at position 20 ........ 165 Figure 6.23 Verb Root attributes, with letter at position 21 ................................... 168 Figure 6.24 The classification of nouns according to their final letters, for the morphological feature of Noun Finals, with letter at position 22 .................. 170 Figure 7.1 Examples of spelling / tokenization variations between the Othmani script and MSA script .................................................................................... 177 Figure 7.2 mapping example, preserving the part-of-speech tag ............................ 177 Figure 7.3 Example of tokenizing Quranic Arabic Corpus words and their morphological tags into morphemes and their morpheme tags ..................... 178 Figure 7.4 Part of the dictionary data structure used to map the Quranic Arabic Corpus tag set to the morphological features tag set ..................................... 178 Figure 7.5 A sample of the morphological features tag templates ......................... 179 Figure 7.6 Examples of the clitics and affixes lists ................................................ 180 - xviii Figure 7.7 A sample of the mapped SALMA tags after applying mapping steps 1 to 4...................................................................................................................... 181 Figure 7.8 A Sample of the QAC tags and their mapped SALMA tags after applying the mapping procedure’s steps 1-4, step 5 and manually correcting the tags. .......................................................................................................... 185 Figure 7.9 Accuracy of mapping after steps 4 and step 5 of mapping QAC to SALMA tags .................................................................................................. 187 Figure 7.10 Recall of mapping after steps 4 and step 5 of mapping QAC to SALMA tags ................................................................................................................. 188 Figure 7.11 Precision of mapping after steps 4 and step 5 of mapping QAC to SALMA tags. ................................................................................................. 188 Figure 8.1 Examples of the output verb analyses ................................................... 201 Figure 8.2 Examples of the output noun analyses .................................................. 202 Figure 8.3 Examples of the output particle analyses .............................................. 202 Figure 8.4 The SALMA Tagger algorithm ............................................................. 204 Figure 8.5 The word data structure ........................................................................ 205 Figure 8.6 A sample output of the tokenization module component after processing the Qur’an , chapter 29................................................................................... 206 Figure 8.7 Example of applying letter-vowelization templates to a word. The matching templates are highlighted in bold. .................................................. 207 Figure 8.8 Example of tokenization of some words ............................................... 208 Figure 8.9 Sample of the proclitics and prefixes with their morphological tags, attributes and descriptions.............................................................................. 210 Figure 8.10 Sample of the suffixes and enclitics with their morphological tags, attributes and descriptions.............................................................................. 211 Figure 8.11 Example of prefix-stem-suffix agreement between a word’s morphemes213 Figure 8.12 Example set of words grouped to root and lemma .............................. 214 Figure 8.13 Example of root extraction module ..................................................... 215 Figure 8.14 Sample of the function words list ........................................................ 216 Figure 8.15 Examples of the three named entities gazetteers ................................. 217 Figure 8.16 Examples of broken plurals ................................................................. 217 Figure 8.17 Sample of the patterns dictionary ........................................................ 221 Figure 8.18 Example of extracting the pattern of the words using the first method (the word and its root) .................................................................................... 224 Figure 8.19 Example on Pattern Matching Algorithm 2 processing steps ............. 225 Figure 8.20 Example of using the Pattern Matching Algorithm 2 .......................... 225 Figure 8.21 Vowelization process example ............................................................ 226 Figure 8.22 Example of assigning initial SALMA Tags to all word’s morphemes 228 - xix Figure 8.23 Examples of the linguistic rules applied to validate and predict the values of the morphological features ............................................................. 229 Figure 8.24 Colour codes used to colour code the morphemes of the analyzed words230 Figure 8.25 Colour-coded example of a word from the Qur’an gold standard....... 230 Figure 8.26 SALMA – Tagger output formatted in a tab separated column file .... 239 Figure 8.27 SALMA – Tagger outputs format stored in XML file ........................ 240 Figure 8.28 SALMA – Tagger outputs formatted in HTML file ............................ 242 Figure 8.29 Colour coded output of the analyzed text samples of the Qur’an and MSA. .............................................................................................................. 243 Figure 10.1 Sample of lemmatized sentence from the Arabic Internet Corpus ...... 293 Figure 10.2 Lemma and root accuracy of the lemmatized Arabic internet corpus . 296 Figure 10.3 Example of the concordance line of the word  ğāmi‘at “University” from the Arabic Internet Corpus .................................................................... 297 Figure 10.4 Example of the collocations of the word  ğāmi‘at “University” from the Arabic Internet Corpus ............................................................................. 298 Figure 10.5 The Corpus of Traditional Arabic Lexicons frequency lists ................ 299 Figure 10.6 A proposed web interface for Arabic dictionary .................................. 300 Figure A.1 Sample of Tagged document of vowelized Qur’an Text using SALMA Tag Set ........................................................................................................... 336 Figure A.2 SALMA tag structure ........................................................................... 336 - xx - Tables Table 2.1 The submitted unsupervised morpheme analysis compared to the Gold Standard in non-vowelized Arabic (Competition 1). ....................................... 20 Table 2.2 ALCSO/KACST competition participants ............................................... 37 Table 3.1 Summary of detailed analysis of the Arabic text documents used in the experiments ...................................................................................................... 50 Table 3.2 Results of the four evaluation experiments of the 3 stemming algorithms tested using the Qur’an text sample ................................................................. 51 Table 3.3 Tokens and word types accuracy of the 3 stemming algorithms and voting algorithms tested on CCA sample.................................................................... 53 Table 3.4 Category distribution of Roots-Types and Word-Tokens extracted from the Qur’an ........................................................................................................ 57 Table 3.5 Summary of category distribution of root and tokens of the Qur’an ........ 57 Table 3.6 Category distribution of Root and Word type extracted from the lexicon 59 Table 3.7 Summary of category distribution of root and word types of the lexicons59 Table 4.1 statistical analysis of the lexicon text used to construct the broadcoverage lexical resource ................................................................................. 78 Table 4.2 Statistics of the traditional Arabic lexicons and morphological databases used to construct the SALMA-ABCLexicon ................................................... 80 Table 4.3 Number of records extracted from 7 analyzed lexicons, and the number and the percentage of records combined to the SALMA-ABCLexicon. ......... 81 Table 4.4 The coverage of the lexicon using exact word-match method ................. 86 Table 4.5 Coverage including function words .......................................................... 87 Table 4.6 Coverage excluding function words ......................................................... 87 Table 5.1 Comparison of Arabic part-of-speech tag sets ........................................ 108 Table 5.2 The upper limit of possible combinations of SALMA features.............. 117 Table 6.1 Arabic Morphological Feature Categories .............................................. 126 Table 6.2 Noun types as classified in traditional Arabic grammar ......................... 127 Table 6.3 Verb types as classified by Arab grammarians ....................................... 134 Table 6.4 Examples of part-of-speech category attributes...................................... 135 Table 6.5 Examples of the part-of-speech category of Others (residuals) .............. 139 Table 6.6 Subcategories of punctuation and examples of their attributes .............. 141 Table 6.7 Examples of gender category attributes for nouns, verbs, adjectives and pronouns ......................................................................................................... 143 Table 6.8 Examples of the morphological feature category of Number ................. 146 - xxi Table 6.9 The three main attributes of person category with examples ................. 147 Table 6.10 Examples of the morphological feature category of Inflectional Morphology.................................................................................................... 149 Table 6.11 The different attribute values of Case under each part-of-speech heading, as recommended by EAGLES......................................................... 151 Table 6.12 Examples of morphological feature category of Case or Mood ........... 152 Table 6.13 Examples of each attribute of the Case and Mood Marks category ..... 154 Table 6.14 Examples of the morphological feature of Definiteness ....................... 155 Table 6.15 Examples of Voice category attributes in sentences ............................. 156 Table 6.16 Examples of the morphological feature Emphasized and Nonemphasized ..................................................................................................... 157 Table 6.17 shows examples of the Transitivity category attributes in sentences ... 158 Table 6.18 Examples of the morphological feature category of Rational .............. 159 Table 6.19 Examples of the Declension and Conjugation morphological feature . 162 Table 6.20 Examples of Unaugmented and Augmented category attributes .......... 164 Table 6.21 Examples of Number of Root Letters category attributes ................... 165 Table 6.22 Verb Root category attributes and their tags at position 21 .................. 166 Table 6.23 Examples of the attributes of the morphological feature of Noun Finals170 Table 7.1 The mapping success rate after applying the first four mapping steps ... 182 Table 7.2 The mapping success rate after applying the fifth mapping step ............ 184 Table 7.3 Accuracy, recall and precision of the mapping procedure after steps 4 and 5...................................................................................................................... 187 Table 8.1 The 18 subcategories of nouns with examples ....................................... 199 Table 8.2 Example of the process of selecting the matched clitics and affixes ...... 212 Table 8.3 Rules for predicting the values of the morphological features of Person, Number and Gender for perfect verbs ........................................................... 234 Table 8.4 Rules for predicting the values of the morphological features of Person, Number and Gender for imperfect verbs ....................................................... 234 Table 8.5 Rules for predicting the values of the morphological features of Person, Number and Gender for imperative verbs ..................................................... 235 Table 8.6 Rules for predicting the values of the morphological features of Rational236 Table 8.7 Default value of Rational and Irrational for sub part-of-speech categories of nouns, with a tag symbol at position 2 ...................................................... 236 Table 8.8 Rules for predicting the values of the morphological features of Noun Finals .............................................................................................................. 238 Table 9.1 Accuracy metrics for evaluating the CCA test sample ........................... 270 Table 9.2 Accuracy metrics for evaluating the Qur’an – Chapter 29 test sample .. 271 - xxii Table 9.3 Extended attributes of the Part-of-speech subcategories of Other (Residuals) and their tags at position 5 .......................................................... 287 Table 9.4 Extended attributes of the Part-of-speech subcategories of Punctuation Marks and their tags at position 6 .................................................................. 287 Table 10.1 Lemma accuracy ................................................................................... 295 Table 10.2 Root accuracy ....................................................................................... 295 Table A.1 SALMA Tag Set categories ................................................................... 337 Table A.2 Main part-of-speech category attributes and tags at position 1 ............. 337 Table A.3 Part-of-Speech subcategories of Noun attributes and their tags at position 2...................................................................................................................... 338 Table A.4 Part-of-Speech subcategory of verb attributes and their tags at position 3339 Table A.5 Part-of-speech subcategories of Particles attributes and their tags at position 4........................................................................................................ 339 Table A.6 Part-of-speech subcategories of Other (Residuals) attributes and their tags at position 5 ............................................................................................ 340 Table A.7 Part-of-speech subcategories of Punctuation Marks attributes and their tags at position 6 ............................................................................................ 341 Table A.8 Morphological feature of Gender attributes and their tags at position 7 341 Table A.9: Morphological feature of Number attributes and their tags at position 8342 Table A.10 Morphological feature of Person category attributes and their tags at position 9........................................................................................................ 342 Table A.11 The morphological feature category of Inflectional Morphology attributes and their tags at position 10 ........................................................... 343 Table A.12 The morphological feature of Case or Mood category attributes and their tags at position 11 .................................................................................. 343 Table A.13 The morphological feature category of Case and Mood Marks attributes and tags at position 12.................................................................................... 344 Table A.14 The morphological feature of Definiteness category attributes and their tags at position 13 .......................................................................................... 344 Table A.15 The morphological feature of Voice category attributes and their tags at position 14...................................................................................................... 345 Table A.16 The morphological feature of Emphasized and Non-emphasized category attributes and their tags at position 15............................................. 345 Table A.17 The morphological feature of Transitivity category attributes and their tags at position 17 .......................................................................................... 345 Table A.18 Morphological feature category of Rational attributes and their tags at position 17...................................................................................................... 345 Table A.19 The morphological feature of Declension and Conjugation category attributes and their tags at position 18 ........................................................... 346 - xxiii Table A.20 The morphological feature of Unaugmented and Augmented category attributes and their tags at position 19 ........................................................... 346 Table A.21 The morphological feature of Number of Root Letters category attributes and their tags at position 20 ........................................................... 347 Table A.22 The morphological feature of Verb Root category attributes and their tags at position 21 .......................................................................................... 347 Table A.23 The morphological feature of Noun Finals category attributes and their tags at position 22 .......................................................................................... 348 - xxiv - List of Abbreviations Abbreviation Meaning BAMA Buckwalter’s Morphological Analyzer CCA The Corpus of Contemporary Arabic MSA Modren Standard Arabic LDC Linguisic Data Consortium APT Khoja’s Arabic Part-of-speech Tagger FST Finite state transducer NLTK Natural Language Toolkit SALMA-ABCLexicon SALMA-Tag Set SALMA-Tokenizer SALMA-Lemmatizer & Stemmer SALMA-Pattern Generator Sawalha Atwell Leeds Morphological Analysis – Arabic Broad-Coverage Lexicon Sawalha Atwell Leeds Morphological Analysis – Tag Set Sawalha Atwell Leeds Tokenizer Sawalha Atwell Leeds Lemmatizer and Stemmer Morphological Analysis – Morphological Analysis – Sawalha Atwell Leeds Morphological Analysis – Pattern Generator SALMA-Tagger Sawalha Atwell Leeds Morphological Analysis – Vowelizer Sawalha Atwell Leeds Morphological Analysis – Tagger CML Croatian Morphological Lexicon EAGLES Expert Advisory Group on Language Engineering Standards SKEL Software and Knowledge Engineering Laborartory SALMA-Vowelizer Lefff LMF Lexique des formes fléchies du français – Lexicon of French inflected forms Lexical Markup Framework, the ISO/TC37 standard for NLP lexicons XML Extensible Markup Language ACL SIGLEX The Special Interest Group on the Lexicon of the Association for Computational Linguistics COMLEX COMmon LEXicon OTA Oxford Text Archive - xxv - AWN Arabic WordNet PWN Princeton WordNet CLAWS The Constituent Likelihood Automatic Word Tagging System BNC The British National Corpus AMALGAM Automatic Mapping Among Lexico-Grammatical Annotation Models ICE International Corpus of English LLC London-Lund Corpus LOB Lancaster-Oslo/Bergen Corpus SKRIBE Spoken Corpus Recoddings In British English PoW Polytechnic of Wales corpus SEC Spoken English Corpus UPenn University of Pennsylvania corpus SALMA Tag Set Sawalha Atwell Leeds Morphological Analysis – Tag Set ALECSO/KACST King Abdul-Aziz City of Science and Technology PADT Prague Arabic Dependency Treebank PATB The Penn Arabic Treebank MWEs Multi-Word Expressions HMM Hidden Marcov Model Arab League Educational, Cultural and scientific Organization / -1- Part I: Introduction and Background Review -2- Chapter 1 Introduction "I2> ;H; / ; C= 4; & ; '. ;$= '?;E; += F; G;H CD 2> ; B B = >>A2@; = ;: *= > ?,= ; 2;<;:" ’anā al-baḥru fῑ ’aḥšā’ihi ad-durru kāminun fahal sa’alū al-ḡawwāṣ ‘an ṣadafātῑ “Arabic says: I am the sea where pearls are hidden inside. Have they (the people) asked the diver about my seashells?” Hafiz Ibrahim (1872 – 1932) Chapter Summary Morphological analysis for Arabic text corpora is the topic of this thesis. The thesis topic is introduced in the first section of this chapter. This chapter also provides a general definition of computational morphology. It presents Arabic computational morphology and the complexity of Arabic morphology. The motivations and objectives of the thesis, and the original contributions of developed resources, proposed standards and tools are summarized in section 1.5. Finally, this chapter presents the structure of the thesis. -3- 1.1 This Thesis The topic of this thesis is morphological analysis for Arabic text corpora. Morphological analysis for text corpora is a prerequisite for many text analytics applications, which has attracted many researchers from different disciplines such as linguistics (computational and corpus linguistics), artificial intelligence, and natural language processing, to morphosyntactically analyze text of different languages including Arabic. Recently, several researchers have investigated different approaches to morphological and syntactic analysis for Arabic text. Many systems have been developed which vary in complexity from light stemmers, root extraction systems, lemmatizers, complex morphological analyzers, part-of-speech taggers and parsers. This introduction will detail what is special about morphological analysis for Arabic text corpora. We will introduce computational morphology and the complexity of Arabic morphology that has inspired this research. The motivation and the objectives for this thesis will be discussed. Both research and practical perspectives on the value of carrying out this research will be explained. We present the argument that the linguistic wisdom in traditional Arabic grammars and lexicons can be utilized (i.e. renewed and re-validated) in an Arabic NLP toolkit which is easy to access and implement. We believe that such detailed knowledge is applicable to Modern Standard Arabic and that it can be used to restore orthographic (e.g. short vowels) and morphological features which signify important linguistic distinctions. Moreover, fine-grained morphological analysis is possible (i.e. achievable) and advantageous. The implemented Arabic NLP toolkit is general-purpose, adherent to standards and reusable, which will fulfil many researchers’ and users’ needs. 1.2 Computational Morphology Morphology is the study, identification, analysis and description of the minimal meaning bearing units that constitute a word. The minimal meaning bearing unit of a word is called a morpheme. Categorizing and building a representative structure of the component morphemes is called morphological analysis. Both orthographic rules and morphological rules are important for categorizing a word’s morphemes. For instance, orthographic rules for pluralizing English words ending with –y such as party indicates changing the –y to -i- and adding –es. And morphological rules tell us that fish has null plural and the plural of goose is formed by a vowel change. Morphological analysis of the surface or input form going is the verbal stem go plus the –ing morpheme VERB-go + GERUND-ing (Jurafsky and Martin 2008); section 2.3 defines morphological analysis in general, while section 2.3.4 redefines morphological analysis for Arabic text. -4Computational morphology is a branch of computational linguistics (i.e. natural language processing or language engineering). The main concern of computational morphology is to develop computer applications (i.e. toolkits) that analyze words of a given text and deal with the internal structure of words such as determining their part-ofspeech and morphological features (e.g. gender, number, person, case, mood, voice, etc) (Kiraz 2001); see sections 2.3 and 2.3.4. Morphological analysis has many applications throughout speech and language processing. In web searching for morphologically complex languages, morphological analysis enables searching for the inflected form of the word even if the search query contains only the base form. Morphological analysis gives the most important information for a part-of-speech tagger to select the most suitable analysis for a given context. Dictionary construction and spell-checking applications rely on a robust morphological analysis. Machine translation systems rely on highly accurate morphological analysis to specify the correct translation of an input sentence (Jurafsky and Martin 2008). Lemmatization is an aspect of morphological analysis. Google’s search facilities use lemmatization to produce hits of all inflectional forms of the input word. Statistical models of language in machine translation and speech recognition also use lemmatization. Lexicographic applications use lemmatizers as an essential tool for corpus-based compilation (Pauw and Schryver 2008). Morphological analysis techniques form the basis of most natural language processing systems. Such techniques are very useful for many applications, such as information retrieval, text categorization, dictionary automation, text compression, data encryption, vowelization and spelling aids, automatic translation, and computer-aided instruction (Al-Sughaiyer and Al-Kharashi 2004); see also section 2.3.3. 1.3 Arabic Computational Morphology Arabic is a living language that belongs to the Semitic group of languages. The Semitic group of languages include other living languages such as: Modern Hebrew, Amharic, Aramaic, Tigrinya and Maltese (Haywood and Nahmad 1965). The main characteristic feature of Semitic languages is their nonconcatenative morphology where words are derived from their basis of mostly triliteral consonantal roots. Roots of Semitic languages carry the basic conceptual meanings, while varying the vowelling of the simple root and adding prefixes, suffixes and infixes to produce the different variations in shade of meaning (Haywood and Nahmad 1965). For example, from the Arabic root - k-t-b ‘wrote’ we can derive the following words by filling in the vowels: J2;-> kitāb ‘book’, ?-? kutub ‘books’, >82; kātib ‘writer’, J2.-? kuttāb ‘writers’, ; ;-; kataba ‘he wrote’, ? ?-= ;! yaktubu ‘he writes’, etc. Sections 1.4 and 2.3.4.1 discuss in detail the complexity of Arabic morphology. -5Arabic is classified into Classical Arabic (e.g. the Qur’an); Modern Standard Arabic (e.g. newspapers and magazines); and Spoken or Colloquial Arabic. Modern Standard Arabic varies in idiom and vocabulary from Classical Arabic. However, the grammar of the 6th century Classical Arabic still applies largely to modern written Arabic. This is because Classical Arabic was the vehicle of God’s Revelation in the Qur’an (Haywood and Nahmad 1965). The study of traditional Arabic grammar started in the 8th century. The main reason for Arabic linguistic studies was to preserve the original Arab language due to the wide expansion of the Islamic community that included many non-Arabic native speaking Muslims who spoke Arabic to perform daily worship. The first Arabic order for establishing traditional Arabic grammar language was given by the fourth Khalifa Imam > al-’imām ‘alī bin ’abī ṭālib to Abu Al-Aswad AdAli bin Abi Talib >2; K;: C= >" L>%4; M2; N B Q= '; = ; '?";: ’abū ’al-’aswad ad-du’alī to write the fundamentals of Arabic Du’aly O>P;  > grammar. Early scholars such as Abū Amr bin Al-Ala’ 1# ; C" R= 4; '?";: ’abū ‘amr bin al-‘alā’ established the relations between language and its grammar rules; and the connections of Qur’an recitation styles. Al-Khalil bin Ahmad Al-Farahidi  > ;S ; ;= ;: C= >" +=>%;T al-ẖalīl bin ’aḥmad al-farāhīdī is the founder of Arabic grammar as a discipline where he defined its rules, regulations, documentation methodologies. These methodologies allowed Sibawayh =!'; G;=> sībawayh to write the first comprehensive traditional Arabic grammar book called > al-kitāb ‘The Book’ (Wlad Abah 2008). Al-Kitab J2;- Present-day Arabic language scholars are still interested in studying traditional Arabic grammar books. These interests include rewriting and verifying manuscripts and studying the life of their authors and their methodologies. Among the recent interests of Arabic linguists is the study of new international linguistic knowledge and its application to Arabic. Moreover, researchers are interested in connecting the results of modern linguistic studies applied to Arabic with the findings and conclusions of the early Arabic traditional grammar scholars (Wlad Abah 2008). Modern linguistic theories of Arabic morphology have studied the derivation process of Arabic words from two points of view: root-based and stem-based (or wordbased). The theory of Prosodic Morphology (McCarthy and Prince 1990b; McCarthy and Prince 1990a) defines the basic character of phonological structure and its consequences for morphology. The true templatic morphology is represented by the derivational categories of the Arabic verbs. Using multiple levels of representation, Arabic verbs have three auto-segmental tiers: consonantal tier (i.e. the root), CV skeleton (i.e. patterns) and vocalic melody (i.e. short vowels). Benmamoun (1999) studied the nature and role of the imperfective verb in Arabic. The imperfective verb is not specified for tense. Hence, it is the default form of the verb -6that does not carry temporal features. This feature of unmarked status for imperfective verbs is consistent with its central role in word formation which allows for a unified analysis of nominal and verbal morphology. In conclusion, a word-based approach for Arabic word formation is more important than root-based. Morphological analysis for Arabic entails computer applications that analyze Arabic words of a given text and deal with the internal structure. It involves a series of processes that identify all possible analyses of the orthographic word. These processes are both form-based and function-based (Thabet 2004; Hamada 2009a; Habash 2010; Hamada 2010). Morphological analyzers for Arabic text are required to develop processes that deal with both the form and the function of the word. These processes include tokenization, spell-checking, stemming and lemmatization, pattern matching, diacritization, predicting the morphological features of the word’s morphemes, part-ofspeech tagging and parsing. Many morphological analyzers for Arabic text were developed using a range of methodologies. These methodologies are: Syllable-Based Morphology (SBM), which depends on analyzing the syllables of the word; Root-Pattern Methodology, which depends on the root and the pattern of the word for analysis; Lexeme-based Morphology, where the stem of the word is the crucial information that needs to be extracted from the word; and Stem-based Arabic lexicons with grammar and lexis specifications (Soudi, Cavalli-Sforza and Jamari 2001; Soudi, Bosch and Neumann 2007). Morphological analyzers are different in their methodologies and their tasks. Stemmers are responsible for extracting the stem/root of words (Khoja 2001; AlSughaiyer and Al-Kharashi 2002; Al-Shalabi, Kanaan and Al-Serhan 2003; Khoja 2003; Al-Shalabi 2005; AlSerhan and Ayesh 2006; Boudlal et al. 2011). Lemmatizers identify the canonical form, dictionary form, or citation form, which is also called the lemma for words (Dichy 2001; Al-Shammari and Lin 2008). Pattern matching algorithms generate the templatic form (i.e. patterns) and vocalism of the analysed words. However, the representation of the templatic forms and vocalism might vary from one algorithm to another (Dichy and Farghaly 2003; Al-Shalabi 2005; Alqrainy 2008; Yousfi 2010). General purpose morphological analyzers generate all possible analyses of the words out of their contexts. Key morphological analyzers for Arabic text are: Xerox system (Beesley 1996; Beesley 1998), Buckwalter’s Morphological Analyzer (BAMA) (Buckwalter 2002; Buckwalter 2004), ElixirMF (Smrz 2007), AlKhalil (Boudlal et al. 2010), MORPH2 (Hamado, Belghayth and Sha’baan 2009; Kammoun, Belguith and Hamadou 2010) and MIDAD (Sabir and Abdul-Mun’im 2009). -7- 1.4 The Complexity of Arabic Morphology Arabic is a highly inflectional language which makes processing tasks for Arabic text extremely hard. Morphological analysis of Arabic text is not an easy task and it affects higher level applications such as part-of-speech tagging and parsing. Due to the rich “root-and-pattern” non-concatenative (or nonlinear) morphology and the highly complex word formation process of root and patterns, hundreds of words can be derived from a single root by following certain patterns and conjoining affixes and clitics to the word. The attachment of affixes and clitics significantly increases the number of derived words. Ambiguity in Arabic text is a major challenge for processing. Ambiguity is due to the absence of short vowels for most Arabic texts and the interaction between affixes or clitics letters and the original letters that compose the root especially if one or two long vowels are part of the root letters. Clitics and affixes of Arabic words are productive. Therefore, storing word forms in a dictionary and doing morphological analysis by dictionary lookup is not possible, as we cannot list all morphological variants of every Arabic word. Thus, morphological analysis done dynamically is unavoidable. A word such as >=!; >';>" bi-wālidayhi ‘in his parents’ > bi ‘in’ is a preposition, ; > wālida ‘parent’ is the noun stem consists of four morphemes J ; morpheme, = y ‘two’ is a dual letter, and U> hi ‘his’ is object relative pronoun. The proclitic J> bi ‘in’ and the enclitic U> hi ‘his’ are productive clitics. The root letters can be hard to guess and increase text ambiguity if one or two root letters are long vowels or belong to the affixes and clitics’ letters. The absence of short vowels can make morphological analysis even harder. For example, the word 2)! wldynā has two possible morphological analyses, see figure 1.1. First, 2;)G=!; ;; waladaynā ‘Our two sons’ has the root  w-l-d ‘descendant, offspring, child, son’ and has three morphemes ; ;; walada ‘son or boy’, C; =! yna ‘dual letters’, and  ā ‘our’ nominative suffixed pronoun. Second, 2;)G=!; ;; wa-ladaynā ‘and we have got’ of the root  l-d-y has three morphemes; ; wa ‘and’ is a conjunction proclitic, = ; ; laday ‘have got’ a perfect verb stem, and 2;< nā ‘we’ a genitive suffixed pronoun. In this example, the interaction between the clitic letter and the underlying letter of the word increases the complexity of morphological analysis for Arabic text. The first letter of the word  wa is one of the underlying letters of the word in the first analysis and it can be analyzed as a conjunction letter as shown in the second analysis. Section 2.3.4.1 discusses the challenges of complex Arabic morphology. Sections 5.5 and 8.3.1.4 define our approach to defining the word’s morphemes. -8- 2)! wldynā  + C; =! + ; ;; = 2;)G=!; ;; waladaynā ‘Our two sons’ has the root  w-l-d ‘descendant, offspring, child, son’ 2;< + = ; ; + ; = 2;)G=!; ;; wa-ladaynā ‘and we have got’ of the root  l-d-y Figure 1.1 Example of ambiguous Arabic word Gemination is one of the orthographic issues that the morphological analyzer has to deal with correctly. Other orthographic issues of Arabic such as short vowels ( ◌; ◌? ◌> ) and gemination šaddah ( ◌Y ) are: hamzah (1 Z : P [), tā’ marbūṭah ( \ ) and hā’ ( U ), yā’ ( ) and ’alif maqṣūrā (  ) and maddah ( ] ) or extension which is a compound letter of hamzah and ’alif ( : ). Chapter 2 discusses the morphological complexity of Arabic text. 1.5 Motivation and Objectives for this Thesis Our research into morphological analysis of Arabic text corpora involves original scientific research, and focuses on the question of how to widen the scope of Arabic morphological analyses, to develop an NLP toolkit that can process Arabic text in a wide range of formats, domains, and genres, of both vowelized and non-vowelized Arabic text. The inspiration behind this research is centuries-old linguistic wisdom and knowledge captured and readily available in traditional Arabic grammars and lexicons. The knowledge can be utilized in an Arabic NLP toolkit which can be accessed, standardized, reused and implemented in Arabic natural language processing. The detailed knowledge is applicable to both Classical and Modern Standard Arabic and can be used to restore orthographic (e.g. short vowels) and morphological features which signify important linguistic distinctions. Fine-grained morphological analysis is possible, achievable and advantageous in processing Arabic text. Enriching the text with linguistic analysis will maximize the potential for corpus re-use in a wide range of applications. We foresee the advantage of enriching the text with part-of-speech tags of very fine-grained grammatical distinctions, which reflect expert interest in syntax and morphology, but not specific needs of end-users, because end-user applications are not known in advance. The objective of the thesis has been achieved through developing a novel languageengineering toolkit for morphological analysis of Arabic text, the SALMA – Tagger. The SALMA – Tagger combines sophisticated modules that break down the complex morphological analysis problem into achievable tasks which each address a particular problem and also constitute stand-alone units. These modules are: • The SALMA – Tokenizer which tokenizes the input text files and identifies the Arabic words, spell-checks and corrects the words, and identifies the word’s parts or morphemes. -9- • • • • The SALMA – Lemmatizer and Stemmer which extracts the lemma and the root of the analysed word. The SALMA – Pattern Generator which is responsible for matching the word with its pattern. The SALMA – Vowelizer which is responsible for adding the short vowels to the analysed words. The SALMA – Tagger module that predicts the fine-grained morphological features for each of the analysed word’s morphemes. These modules are useful as stand-alone tools which users can select and/or customise to their own applications. The previously mentioned original Arabic NLP toolkit depends on two novel and original resources and proposed standards developed throughout this project. These are: • The SALMA – Tag Set, the theory informing the morphological features tag set, and developed in this thesis, is to base the tag set on traditional morphological features as • defined in long-established Arabic grammar, in a notation format intended to be compact yet transparent. The SALMA – ABCLexicon, a novel broad-coverage lexical resource constructed by extracting information from many traditional Arabic lexicons, constructed over 1200 years, of disparate formats. An additional resource resulting from the construction the SALMA – ABCLexicon is the Corpus of Traditional Arabic Lexicons. The Corpus of Traditional Arabic Lexicons is a special corpus of Arabic which is compiled from the text of 23 traditional Arabic lexicons that cover a period of 13-hundred years and shows the evolution of Arabic vocabulary. It contains about 14 million word tokens and about 2 million word types. In summary, this research has contributed to Arabic NLP in three dimensions: resources, proposed standards and tools (i.e. practical software). The following is a list of the contributions classified into the three dimensions: A. Resources 1. The SALMA – ABCLexicon. 2. The Corpus of Traditional Arabic Lexicons. 3. The morphological lists of the SALMA – Patterns Dictionary and the SALMA – Clitics and Affixes lists. 4. The several linguistic lists that are used by the SALMA – Tagger such as: function words list, named entities lists, broken plural list, conjugated and nonconjugated verbs list, and transitive verbs lists. 5. The Lemmatized version of the Arabic Internet Corpus. - 10 B. Proposed Standards 6. The SALMA – Tag Set. 7. The SALMA – Gold Standard for evaluating morphological analyzers for Arabic text. 8. The MorphoChallenge 2009 Qur’an Gold Standard. 9. Proposed standards for developing morphological analyzers for Arabic text. 10. Proposed standards for evaluating morphological analyzers for Arabic text. C. Tools (practical software) 11. The SALMA – Tagger 12. The SALMA – Tokenizer 13. The SALMA – Lemmatizer and Stemmer 14. The SALMA - Vowelizer 15. The SALMA – Pattern Generator Finally, a potential future application of using these contributions is as a languageengineering toolkit for Arabic lexicography to construct Arabic monolingual and bilingual dictionaries (Section 10.3). 1.6 Thesis Structure This thesis is organized into five parts. Part I: Introduction includes Chapter 1. Part II: Background Review includes Chapters 2, 3, 4 and 5. Part III: Standards for Arabic Morphological Analysis includes Chapters 6 and 7. Part IV: Tools and Applications for Arabic Morphological Analysis includes Chapters 8, 9 and 10. Part V: Conclusions and Future Work includes Chapter 11. The following highlights the thrust of the work presented in this thesis: Part I: Introduction and Background Review includes: • o Chapter 1: Introduction where the previous sections have given an introduction to the problems associated with studying morphological analysis in general and for Arabic text in particular. Section 1.5 discussed the motivations and objectives for this thesis. It also summarized the original contributions to the Arabic NLP field of study.Chapter 2: Literature Review: Morphological Analyses of Arabic Text presents coverage of background and literature surveys relevant to the research. First, a survey of Arabic text corpora is discussed in section 2.2. Second, a literature survey of morphological analysis in general and morphological analysis for Arabic text in particular is discussed in section 2.3. This section presents the general methodologies of morphological analysis and those which have been applied to Arabic text. It also surveys the existing key - 11 - • morphological analyzers for Arabic text and discusses their attributes. Third, a survey of part-of-speech taggers for Arabic text is presented in section 2.4. It comparatively evaluates existing part-of-speech taggers for Arabic text. Part II: Background Analysis and Design includes: o Chapter 3: Comparative Evaluation of Arabic Morphological Analyzers and Stemmers surveys stemming algorithms for Arabic text used in the comparative evaluation in section 3.2. Then it discusses four different fair and precise evaluation experiments using a gold standard for evaluation in sections 3.4 and 3.5. Finally, it presents an analytical study of the triliteral Arabic roots in section 3.7. o Chapter 4: The SALMA-ABCLexicon: Prior-Knowledge Broad-Coverage Lexical Resource to Improve Morphological Analyses surveys morphological lexicons for Arabic and other languages in section 4.1. Traditional Arabic lexicons and lexicography are presented in section 4.2. Twenty-three traditional Arabic lexicons are listed and and classified according to their ordering methodology in section 4.3. The construction methodology of the SALMA – ABCLexicon using the traditional Arabic lexicons and its evaluation are discussed in sections 4.4 and 4.5. The Corpus of Traditional Arabic Lexicons is described in section 4.6. o Chapter 5: The survey of Arabic Morphosyntactic Tag Sets and Standards for • Designing the SALMA Tag Set presents existing part-of-speech tagging systems and tag sets for Arabic text in sections 5.2 and 5.3. Section 5.4 discusses the morphological features in Tag Set design criteria. Part III: Proposed Standards for Arabic Morphological Analysis includes: o Chapter 6: The SALMA Tag Set analyzes 22 morphological features of Arabic word morphemes. It defines the attributes of each morphological feature by identifying their characteristics and deciding which attributes are used for the analysis of specific morphological categories. o Chapter 7: Applying the SALMA Tag Set explores the evaluation methodologies of the SALMA – Tag Set in section 7.3. A practical application of the SALMA – Tag Set has been achieved by mapping from the Quranic Arabic Corpus morphological tag set in section 7.4. The evaluation of the mapping process is reported in section 7.5 and discussed in section 7.6. • Part IV: Tools and Applications for Arabic Morphological Analysis includes: o Chapter 8: The SALMA Tagger for Arabic Text discusses morphological analysis for Arabic text. It presents standards for developing a robust morphological analyzer for Arabic text based on our experiences in participating in two contests for developing morphological analyzers for Arabic text: the ALECSO/KACT initiative and MorphoChallenge 2009 competition (section 8.2). - 12 The SALMA – Tagger algorithm is described in section 8.3. The SALMA – Tagger is decomposed into sophisticated modules that break down the complex morphological analysis problem into achievable tasks so they solve particular problems and are useful in their own right. These modules are: The SALMA – Tokenizer; the SALMA – Lemmatizer and Stemmer; and the SALMA – Pattern Generator. A rule-based system for predicting the morphological features of Arabic word morphemes is discussed in section 8.4. Finally, standard output formats of the SALMA – Tagger are described in section 8.5. o Chapter 9: Evaluation for the SALMA – Tagger depends on developing agreed standards for evaluating morphological analyzers for Arabic text, based on our experiences and participation in two evaluation contests: the ALECSO/KACT initiative for developing and evaluating morphological analyzers; and the MorphoChallenge 2009 competition, section 9.2. The construction of a reusable general purpose gold standard (the SALMA – Gold Standard) for evaluating the SALMA – Tagger and morphological analyzers for Arabic text in general is described in sections 9.4 and 9.5. Sections 9.6 and 9.7 discuss the process of evaluating the SALMA – Tagger using gold standards. Evaluation metrics are discussed and the results of the evaluation reported. The discussion of the results analyzes the prediction process, the challenges and suggestions for improvement for each morphological feature category in section 9.8. o Chapter 10: Practical Applications of the SALMA Tagger describes two practical applictions for applying the resources, standards, and tools developed in this thesis. The first application was achieved by lemmatizing the 176-million word Arabic Internet Corpus, section 10.2, and an exemplar for using the resources, standards and tools is as a language-engineering toolkit for Arabic lexicography to construct Arabic monolingual and bi-lingual dictionaries, in section 10.3. • Part V: Conclusions and Future Work includes: o Chapter 11: Conclusions and Future Work summarizes the conclusions of this thesis. It reviews the motivations and objectives for this thesis and lists the main contributions and their impact on Arabic NLP. The second part of the chapter discusses future work that can be done to improve the developed resources, standards and tools. It also shows example projects of higher NLP applications that can benefit directly from our contributions and from our research interests. - 13 - Chapter 2 Literature Review: Morphosyntactic Analysis of Arabic Text 2.1 Introduction This chapter surveys existing morphosyntactic analysis systems for text corpora. The survey studies these systems in three dimensions. First, it explores Arabic text corpora as a background prerequisite for morphosyntactic analysis. Second, it studies morphological analysers for text corpora concentrating on methodologies, challenges, examples of existing morphological analysers, and evaluation standards. Third, it surveys part-of-speech tagging technology and existing part-of-speech taggers for Arabic text. Arabic corpora started to appear in the late 1980s. Most of the existing Arabic corpora are of MSA written text, mainly newspaper text. Only two corpora are opensource and available to download. These are the Corpus of Contemporary Arabic (CCA) (Al-Sulaiti and Atwell 2006) and the Quranic Arabic Corpus (QAC) (Dukes, Atwell and Sharaf 2010; Dukes and Habash 2010). The CCA represents MSA and contains 1 million words of raw text, and the QAC represents Classical Arabic and consists of the Qur’an text of about 80,000 words. The QAC is enriched with morphological and syntactic annotation layers. Section 2.2 surveys existing Arabic corpora. Several morphological analysers for Arabic text exist. Morphological analysis is an important pre-processing step for many text analytics applications. The aim of morphological analysis is to define words in a corpus in terms of morphosyntactic information such as: (i) information about the word structure (i.e. root, affixes, clitics, patterns and vowelization); (ii) part-of-speech of the word (i.e. noun, verb and particle) (iii) part-of-speech subcategories of the word (e.g. gerund, noun of place, active participle, generic noun, proper nouns, pronouns, perfect verb, imperfect verb, imperative verbs, prepositions, etc.); and (iv) the morphological features of the word (e.g. Gender, Number, Person, Case or Mood, Transitivity, Rational, Number of root letters, etc.). The information resulting from morphological analysers can be used in different levels of NLP applications. Section 2.3 surveys morphological analysis of text corpora focusing on its approaches, applications, the specific definition of morphological analysis for Arabic text, challenges of Arabic morphology, and morphological analysis of both Classical and MSA text. It also surveys state of the art morphological analysers and evaluation methodologies. Morphological analysers are designed to generate all possible analyses of the analysed words out of their context. Disambiguating the analysis to suit the context is - 14 done by using part-of-speech taggers. Section 2.4 surveys part-of-speech technology. It lists state of the art part-of-speech taggers for English, the tagged corpora and the standards. The section surveys existing part-of-speech taggers for Arabic text. It briefly lists existing part-of-speech taggers, their development approaches and their accuracy as reported by their developers. 2.2 Arabic Corpora Arabic corpora started to appear in the late 1980s; the following list of Arabic corpora developed from (Al-Sulaiti and Atwell 2006) outlines their size, type, purpose of development and the materials used to develop them: • • Buckwalter Arabic Corpus (1986-2003) consists of about 3 million words of public resources on the web to be used in lexicography. Leuven Corpus (1990-2004) developed at the Catholic University of Leuven, Belgium, consists of about 3 million words of written and spoken text from internet sources, radio and TV and primary school books, to be used in the development of Arabic-Dutch /Dutch-Arabic learner’s dictionaries. • Arabic Newswire Corpus (1994) developed at the University of Pennsylvania LDC, consists of 80 million words of written text collected from Agence France Presse (AFP), Xinhua News Agency, and Umma Press, to be used in education and the development of technology. • CALLFRIEND Corpus (1995) developed at the University of Pennsylvania LDC. This corpus comprises 60 telephone conversations by Egyptian native speakers, to be used in the development of language identification technology. Nijmegen Corpus (1996) developed at Nijmegen University consists of over 2 • • million written words collected from magazines and fiction, to be used in ArabicDutch / Dutch-Arabic dictionaries. CALLHOME Corpus (1997) developed at the University of Pennsylvania LDC, consists of 120 telephone conversations of Egyptian native speakers, to be used in telephony and speech recognition. • CLARA (1997) developed at Charles University, Prague, consists of 50 million words collected from periodicals, books, internet sources from 1975-present, to be used for lexicography. • Egypt (1999) developed at John Hopkins University, a parallel corpus of the Qur’an in English and Arabic to be used in machine translation. Broadcast News Speech (2000) developed at University of Pennsylvania LDC, consists of more than 110 News broadcasts from the Voice of America radio station, to be used in speech recognition. • - 15 - • • DINAR Corpus (2000) developed at Nijmegen University and SOTETEL-IT, in co-ordination with Lyon2 University, consists of 10 million words, to be used in lexicography, general research, and NLP. An-Nahar Corpus (2001) developed by ELRA, consists of 140 million words of written text collected from An-Nahar newspaper (Lebanon), to be used in general text research. • Al-Hayat Corpus (2002) developed by ELRA consists of 18.6 million words of written text collected from Al-Hayat newspaper (Lebanon), to be used for language engineering and information retrieval applications. Arabic Gigaword (2002) developed at the University of Pennsylvania LDC, • consists of around 400 million words collected from Agence France Press (AFP), Al-Hayat news agency, An-Nahar news agency and Xinhua news agency, to be used in natural language processing, information retrieval and language modelling. E-A Parallel Corpus (2003) developed at the University of Kuwait, consists of 3 • million words of written text collected from publications from Kuwait National • • • • • Council, to be used in teaching, translation and lexicography. General Scientific Arabic Corpus (2004) developed at UMIST, UK, consists of 1.6 words of written text, to be used in investigating Arabic compounds. Classical Arabic Corpus (CAC) (2004) developed at UMIST, UK, consists of 5 million words of written text, to be used in lexical analysis. Multilingual Corpus (2004) developed at UMIST, UK, consists of 11.5 million words of written text including 2.5 million words in Arabic, collected from ITspecialized websites-computer system and online software help-one book, to be used in translation studies. SOTETEL Corpus developed at SOTETEL-IT, Tunisia, consists of 8 million words of written text collected from literature, academic and journalistic materials, to be used in lexicography. Corpus of Contemporary Arabic (CCA) (2004) developed at the University of Leeds, consists of 1 million words of written and spoken data, collected from • websites and online magazines, to be used in language teaching and language technology. DARPA Babylon Levantine Arabic Speech and Transcripts (2005) developed at the University of Pennsylvania LDC, consists of about 2000 telephone calls collected from Fisher style telephone speech collection, to be used in machine • translation, speech recognition and spoken dialogue systems. The Penn Arabic Treebank (2001) Part 1 consists of 166,000 words of written Modern Standard Arabic newswire from the Agence France Presse corpus; and Part 2 consists of 144,000 words from Al-Hayat distributed by Ummah Arabic News - 16 Text, to be used in computational linguistics. New features of annotation in the UMAAH (UMmah Arabic Al-Hayat) corpus include complete vocalization (including case endings), lemma IDs, and more specific part-of-speech tags for verbs and particles. The Arabic Treebank corpora are annotated for morphological information, part-of-speech, English gloss (all in the “part-of-speech” phase of annotation), and for syntactic structure (Maamouri and Bies 2004). • The Quranic Arabic Corpus (2009) contains the classical Arabic source text of the Quran, the holy book of Islam. The text consists of nearly 80,000 words, divided into numbered chapters and verses. The text is being enriched with morphological analysis, Part-of-Speech tagging, dependency parsing, coreference resolution, and other linguistic markup, via a collaborative web-based project. The annotated corpus is online, used by Quranic scholars, linguists, and the general public with an interest in Islam. Nearly all these corpora have been collected by Arabic corpus linguistics research groups for their own purposes, and are not freely downloadable. The Corpus of Contemporary Arabic (CCA) developed at the University of Leeds (Al-Sulaiti and Atwell 2004; Al-Sulaiti and Atwell 2005; Al-Sulaiti and Atwell 2006), is the only freely available corpus on the web which has been widely reused for linguistic research. But it has not been annotated by part-of-speech tags. The only annotated corpus of the Arabic language used widely in computational linguistics research is the Penn Arabic Treebank (Maamouri and Bies 2004) developed at the University of Pennsylvania and distributed (at cost) by LDC Linguistic Data Consortium. The Quranic Arabic Corpus, developed recently, is starting to be used in tagging and parsing research. 2.3 Morphological Analysis for Text Corpora Morphology is the study, identification, analysis and description of the minimal meaning bearing units (morphemes) that constitute a word. Morphological analysis is the process of categorizing and building a representative structure of the component morphemes where both orthographic rules and morphological rules are important for categorizing a word’s morphemes. For instance, the plural of party is parties where orthographic rules indicate changing the –y to -i- and adding –es. And morphological rules tell us that fish has null plural (Jurafsky and Martin 2008). Automatic morphological analysis started in the 1950s to support machine translation systems. The Porter stemmer (Porter 1980) is an example early morphological analysis system which is widely used in information retrieval applications. Automatic morphological analyses are beneficial for many early developed applications such as spelling correction, text input systems and text-to-speech synthesis. There was little - 17 interest in evaluating the correctness of results obtained by morphological analysers in early applications. The concern was on the soundness of the results rather than the methods (Roark and Sproat 2007). Finite-state methodology has been dominant since the 1980s. The Finite-state approach for automatic morphological analysis was originally investigated at Xerox and the first practical application was due to Koskenniemi (Koskenniemi 1983); this has been used to develop wide-coverage morphological analysers for several languages. Two main approaches for computational morphology are: explicitly finite-state approaches which are based on a finite-state model and morphotactics, and integrating finite-state morphology and phonology, with unification of morphosyntactic features (Roark and Sproat 2007). Morphological analyzers have been developed for a wide range of languages; the following are some examples. EMERGE1 is a morphological analyzer for Spanish. It analyzes words and shows their canonical form, grammatical category and the inflection or derivation they come from. ExtraLink is an information extraction (IE) system and automatic hyperlinking that uses ontologies to define the relationships. Its IE system is SProUT2, a generic multilingual shallow analysis platform, which can process English, German, Italian, French, Spanish, Czech, Polish, Japanese, and Chinese. It has modules for tokenization, morphological analysis, and named entity recognition. FLEMM3 is a rule-based program (lemmatizer) for French that performs flexional morphological analysis for a tagged text using the Brill Tagger or TreeTagger, and extracts the lemma of words. It uses a small lexicon of 3,000 entries to handle exceptions. FreeLing4 is a library that provides language analysis services for Spanish, English, and Catalan such as tokenizing, sentence splitting, morphological analysis, NE detection, date/number/currency recognition, PoS tagging, and chart-based shallow parsing. POSTAG5 is morphological analysis plus part-of-speech tagging with morpheme dictionary for Korean. ROSANA6 (RObust Syntax-based ANAphor resolution) is a coreference resolution system for English text. It identifies co-referring of anaphoric expressions such as third person pronouns, possessives, reflexives, common nouns, and names. TWOL7 is a two-level morphological analysis tools for English, German, Swedish, Finnish, Danish, and Norwegian. XeLDA8 is a framework that provides a general-purpose 1 2 3 4 5 6 7 8 EMERGE http://protos.dis.ulpgc.es/morfolog/morfolog.htm SProUT http://sprout.dfki.de/ FLEMM http://www.univ-nancy2.fr/pers/namer/Telecharger_Flemm.htm FreeLing http://www.lsi.upc.edu/~nlp/freeling POSTAG http://nlp.postech.ac.kr/DownLoad/k_api.html ROSANA http://www.stuckardt.de/rosana.htm TWOL http://www.lingsoft.fi/ XeLDA http://www.mkms.xerox.com/ - 18 text retrieval system which includes several language processing operations such as: language identification; tokenization; morphological analysis; part-of-speech disambiguation; noun phrase extraction; contextual dictionary lookup; idiomatic expression recognition; relational morphology; and shallow parsing. It supports processing for text of several languages (Dutch, English, French, German, Italian, Portuguese, Spanish, Czech, Hungarian, Polish, Russian, Danish, Swedish, Finnish Norwegian, and Chinese) and other languages in development (Czech, Arabic, Japanese and Korean). It also includes bilingual dictionaries of English, French and German to English, French, German, Italian and Spanish. 2.3.1 Approaches to Morphological Analysis The two-level formalism is the most widely used theoretical approach to morphological analysis. It is based on construction of a collection of finite-state transducers which each implement a particular morphological rule. The transducers attempt to map between the surface and the lexical realizations of a given morpheme. The main drawbacks of this approach are: it is language dependent and it needs manual construction of the transducers for each language which makes developing a morphological analyzer very costly and time consuming (Pauw and Schryver 2008). The minimum requirements for building a morphological analyzer using the two-level formalism approach are as follows. First, it requires a lexicon of stems and affixes together with basic information about them. Second, it is informed by morphotactics where the model of morpheme ordering is explained and the relations between morpheme classes inside a word are determined. Third, orthographic rules that govern the spelling of the word are used to model the changes that occur in a word (Jurafsky and Martin 2008). Corpus-based approaches to morphological analysis use morphologically annotated corpora to build a morphological database rather than depending on linguistic knowledge. For example, CELEX is a lexical database for English, Dutch and German. It contains detailed information on orthography and phonology such as phonetic transcription of variant pronunciations, syllable structure and primary stress. CELEX morphology includes derivational and compositional structure and inflexional paradigms. Syntactic information includes word class, word class-specific subcategorizations and agreement structure. It also contains information about word frequency such as word and lemma counts based on representative text corpora (Baayen, Piepenbrock and Rijn 1995). Corpus-based approaches to building morphological analysis can be used to provide a morphological database that is used in statistical processing and machine-learning techniques to morphological analysis. Statistical processing and machine-learning techniques are language independent, so in principle they can be ported to new domains - 19 and languages. Moreover, data-driven approaches to morphological analysis can outperform manually constructed rule-based analyzers (Pauw and Schryver 2008). Recently, unsupervised approaches to morphological analysis have been explored, based on using minimum-distance edit metrics and pattern-matching techniques to automatically guess the morphological properties of a language on the basis of raw, unannotated text (Pauw and Schryver 2008). The unsupervised morpheme analysis contest MorphoChallenge is a challenge to design a statistical machine-learning algorithm for morphological analysis. The challenge has been run 5 times since 2005. The next section gives more detail about MorphoChallenge 2009 in particular. 2.3.2 MorphoChallege Competition The MorphoChallenge task is to develop an unsupervised learning algorithm which can return the morpheme analyses of each word given lists of words of several languages; for Morphochallenge 2009 these were Arabic, English, Finish, German and Turkish. The preferred algorithm needs to be as language independent as possible. All words in the training corpus occur in sentences, so the algorithm might utilize information about word context (Kurimo, Virpioja and Turunen 2009). The training corpora were 3 million sentences for English, Finnish and German, and 1 million sentences for Turkish in plain unannotated text files. The training corpus for Arabic was the Quran, which is a small corpus consisting of only 78K words. The text of the Qur’an corpus is available in both vowelized and non-vowelized formats. For Arabic, the participants could test their algorithms using the vowelized words or the unvowelized, or both. The algorithms were separately evaluated against the vowelized and the nonvowelized gold standard analyses. For all Arabic data, the Arabic writing scripts were provided as well as the Roman script (Buckwalter transliteration), see figure 9.1. However, only the morpheme analysis submitted in Roman script, was evaluated (Kurimo et al. 2009). In Competition 1 the proposed unsupervised morpheme analyses were compared to the correct grammatical morpheme analyses called here the linguistic gold standard. The gold standard morpheme analyses were prepared in exactly the same format as the result file the participants were asked to submit: alternative analyses separated by commas. For Arabic the gold standard had in each line: the word, the root, the pattern and then the morphological and part-of-speech analysis (Kurimo et al. 2009). Section 9.3 discusses the MorphoChallenge competition as a standard for evaluating morphological analyzers. Twelve algorithms were evaluated against the Arabic Qur’an gold standard. The evaluation results for Arabic turned out to be quite surprising, because most algorithms gave rather low recall and F-measure and the simple “letters” reference outperformed all - 20 other participating algorithms; see section 9.3.1 for the definitions of the accuracy measures. “Promodes” and “Ungrade” methods scored clearly better than the rest of the participants in Arabic. Tables 2.1 shows the evaluation results for the twelve algorithms compared to the gold standards of non-vowelized as reported by (Kurimo et al. 2009). Table 2.1 The submitted unsupervised morpheme analysis compared to the Gold Standard in non-vowelized Arabic (Competition 1). AUTHOR(S) Spiegler et al. Spiegler et al. Spiegler et al. Golénia et al. Virpioja & Kohonen Bernhard Monson et al. Monson et al. Lavallée & Langlais Tchoukalov et al. Monson et al. Lavallée & Langlais METHOD letters PROMODES 2 PROMODES committee PROMODES UNGRADE Allomorfessor Morfessor Baseline MorphoNet ParaMor-Morfessor Union ParaMor-Morfessor Mimic RALI-ANA MetaMorph ParaMor Mimic RALI-COF PRECISION 70.48% 76.96% 77.06% 81.10% 83.48% 91.62% 91.77% 90.49% 93.72% 93.76% 92.40% 95.05% 91.29% 94.56% RECALL 53.51% 37.02% 36.96% 20.57% 15.95% 6.59% 6.44% 4.95% 4.81% 4.55% 4.40% 2.72% 2.56% 2.13% F-MEASURE 60.83% 50.00% 49.96% 32.82% 26.78% 12.30% 12.03% 9.39% 9.14% 8.67% 8.41% 5.29% 4.97% 4.18% 2.3.3 Applications of Morphological analysis Morphological analysis has many applications throughout speech and language processing. Morphological analysis techniques form the basis of most natural language processing systems (Kiraz 2001; Al-Sughaiyer and Al-Kharashi 2004; Jurafsky and Martin 2008; Pauw and Schryver 2008). Such applications are: • Searching the Web: In web searching for morphologically complex languages, morphological analysis enables searching for the inflected form of the word even if the search query contains only the base form. • Part-of-speech taggers: Morphological analysis gives the most important information for a part-of-speech tagger to select the most suitable analysis for a given context. • Dictionaries and Spell-checkers: Dictionary construction and spell-checking applications rely on a robust morphological analysis. • Machine translators: Machine translation systems rely on highly accurate morphological analysis to specify the correct translation of an input sentence (Jurafsky and Martin 2008). • Lemmatizers: lemmatization is part of morphological analysis. Google’s search facilities use lemmatization to produce hits of all inflectional forms of the input word. Statistical models of language in machine translation and speech recognition also use - 21 lemmatization. Lexicographic applications use lemmatizers as an essential tool for corpus-based compilation (Pauw and Schryver 2008). • Other applications: morphological analysis is useful for many applications, such as information retrieval, text categorization, dictionary automation, text compression, data encryption, vowelization and spelling aids, automatic translation, and computeraided instruction (Al-Sughaiyer and Al-Kharashi 2004). 2.3.4 Morphological Analysis for Arabic Text Morphological analysis is the process of assigning the morphological features of a word such as: its root or stem, the morphological pattern of the word, the morphological attributes of the word (part-of-speech of the word whether it is noun, verb or particle). It also involves specifying the number of the word (singular, dual or plural), and the case or mood (nominative, accusative, genitive or jussive). Moreover, it identifies the internal structure of the word such as prefixes, suffixes, clitics and the root or stem (Thabet 2004); see sections 1.2 for general definition of morphology and morphological analysis. Hamada (2009), also Hamada (2010) defined morphological analysis of Arabic text as a series of processes. Morphological analysis for Arabic text includes extracting the root of the analyzed word, deriving all possible derivatives of a given root, analyzing the words into their morphemes, distinguishing the stem of the word by separating its prefixes and suffixes and stripping the conjugated or inflectional affixes of the word. Habash (2010) distinguished between two types of approaches to morphology: form-based morphology and functional morphology. The morpheme as the smallest meaningful unit in a language is the central concept in form-based morphology. However, the central concept of functional morphology is the study of words and morphemes in terms of their morpho-syntactic and morpho-semantic behaviour in context. (Habash 2010) defined morphological analysis as the process of determining all possible morphological analyses of the orthographic word. This process includes identifying the main part-of-speech of the analyzed word. The morphological analysis is either formbased where the word’s morphemes are identified or based on functional morphology where the functions (grammatical features) of each morpheme are determined. The previous definitions of morphological analysis for Arabic text agree with the general definition of computational morphology in section 1.2. A pragmatic definition of morphological analysis for Arabic is computer applications that analyze Arabic words of - 22 a given text and deal with their internal structure. This involves a series of processes that identify all possible analyses of the orthographic word. These processes are both formbased and function-based. Orthographic words can be fully-vowelized, partiallyvowelized or non-vowelized. They also can be Classical Arabic or Modern Standard Arabic. Form-based analysis deals with the orthographic word to identify its morphemes. These processes include tokenization, spell-checking, stemming and lemmatization, pattern matching and diacritization. Function-based processes deal with identifying the morphosyntactic features and functions of the word. These processes include predicting the morphological features of the word’s morphemes, part-of-speech tagging and parsing. The following subsections survey Arabic morphological analysis. The first subsection explores the challenges for Arabic morphological analysers. The second subsection defines basic related concepts which are used throughout this thesis. The third and fourth subsections discuss morphological analysis of Classical and Modern Standard Arabic respectively. The fifth subsection surveys the approaches for morphological analysis development. The sixth subsection discusses the requirements of developing Arabic morphological analysers. The seventh subsection surveys existing morphological analysis systems for MSA text. The last subsection gives an example of a communitybased approach for evaluating Arabic morphological analysers, the ALECSO/KACST initiative for developing and evaluating morphological analysers for Arabic text; see also section 8.2. 2.3.4.1 Challenges of Arabic Morphology Arabic is a morphologically complex and highly inflectional language. Its rootpattern nonconcatenative (i.e. nonlinear) morphology makes both theoretical and computational processing tasks for Arabic text extremely hard. Morphological analysis of Arabic text affects higher level applications such as part-of-speech tagging and parsing. It affects both syntactic and phonological levels of analysis (Beesley 1996; Al-Sughaiyer and Al-Kharashi 2004; Smrz 2007; Soudi et al. 2007; Attia 2008; Habash 2010). Chapter 8 discusses practical solutions for these challenges as implemented in the SALMA – Tagger. Here is a list of major challenges that face Arabic morphological analysis: 1- The orthography of Arabic: the orthography of Arabic is based on standard Arabic script. The Arabic alphabet consists of: 25 consonants; 6 vowels divided into three - 23 long vowels (    ) (ā, w, y) and three short vowels written as diacritics ( ◌>  ◌?  ◌; ) (a, u, i); and a glottal stop hamzah. In addition, the writing system for Arabic contains other shapes of letters such as ’alif maqṣūrah (). Arabic letters change their shape according to their position in the word as Arabic script requires connection of the word’s letters. Other orthographic issues in Arabic are the use of diacritics above or below letters. These diacritics include sukūn (◌= ) to mark silent letters (i.e. absence of short vowel); and gemination or incorporation9 šaddah ( ◌Y ) to indicate a doubled letter; and tanwīn (◌_  ◌^  ◌D ) the syntactic case mark of indefinite singular nouns. hamzah has 5 shapes ([ P Z : 1). tā’ marbūṭah ( \ ) shares phonetic properties of the two consonants tā’ (`) and hā’ (U) and is used to mark feminine singular nouns. maddah ( ] ) or extension is a compound letter of hamzah and ’alif ( 1). 2- Nonconcatenative nature: the rich “root-and-pattern” nonconcatenative (or nonlinear) morphology results in a highly complex word formation process of roots and patterns. Hundreds of words can be derived from a single root by following certain patterns. These patterns are abstract templates where root radicals (i.e. mostly triliteral roots) and vocalism (i.e. short vowels) are inserted in certain positions within the pattern. The pattern also has prefixed letters appearing before the position of the first root radical; suffixed letters appearing after the position of the last root radical; and infixed letters appearing between the root radicals. Patterns transmit morphological and semantic features to the derived words. During the derivation process changes might occur to the original root letters such as assimilation, elision and gemination. Broken plurals exemplify the nonconcatenative nature of Arabic (Clark 2007). For example, the plural form of the word =%G;5 qalb ‘heart’ is J'?%G?5 qulūb ‘hearts’ and this is formed by adding the letter  wāw as an infix between the second > and the third radicals. And the plural form of the word a2;( =  miṣbāḥ ‘light’ is b=>"2( ; ; maṣābīḥ which is formed using the special pattern of broken plural +=>42S; ; mafā‘īl that re-arranges the root radicals and the infixes. This “root and pattern” morphology also 9 Gemination or incorporation are used in the thesis to indicate a doubled letter which usually marked by šaddah ( ◌Y ) in vowelized text. šaddah does not appear in non-vowelized text. Therefore, the absence of šaddah represents a challenge to morphological analyzers for Arabic text. - 24 brings problems for western linguistic terminology. A “morpheme” in Western traditions is an indivisible “atomic” lexical unit, and the “stem” is the core morpheme of a word. In Arabic, the “stem” combines root and pattern. In this thesis, we refer to stem as a morpheme, but purists may argue a stem is really 2 morphemes – root and pattern. 3- Arabic clitics: clitics and affixes of Arabic words are productive. Clitics are conjunctions, prepositions, particles, and genitive suffix-pronouns that are attached to the beginnings and at the ends of words. According to our classification into clitics or affixes as explained later in sections 8.3.1.4 and 8.3.1.5, the definite article is classified as a proclitic rather than a prefix because the definite article is not part of the pattern even though it cannot appear as a stand-alone word. Therefore, storing word forms in a dictionary and doing morphological analysis by dictionary lookup is not possible, as we cannot list all morphological variants of every Arabic word. Thus, morphological analysis done dynamically is unavoidable. A word such as >=!; >';>" bi> bi ‘in’ is a preposition, ; > wālidayhi ‘in his parents’ consists of four morphemes J ; wālida ‘parent’ is the noun stem, = y ‘two’ is a dual letter, and U> hi ‘his’ is object > bi ‘in’ and the enclitic U> hi ‘his’ are productive relative pronoun. The proclitic J clitics. 4- High degree of ambiguity: Arabic also has a high degree of ambiguity for many reasons such as: a. Assimilation or elision of vowels: the presence of long vowels in some root radicals causes these weak radicals to be deleted or changed during the derivation process. For example, the weak radical  wāw of the root c'5 q-w-l is changed into another vowel or is deleted according to vocalic environment. It is changed into  ’alif in the past verb c2 ; ;5 qāl ‘he said’; and into yā’ in the passive past verb +; =>5 qīla ‘it is said’; and deleted in the first person past verb d ? %= G?5 qultu ‘I said’. b. Interaction between affix or clitic letters and the root radicals: word affixes and clitics can be homographic with the underlying letters of the word which means the morphological analyzer must deal with words whose clitics and affixes interact with the underlying letters by producing all possible analyses of - 25 these words. For example, the word `2;52;e>" biṭāqāt; can have two possible analyses. One way is to treat the first letter of the word as a prepositional > bi “with”, where the root is hggf ṭ-w-q and it means ‘with the proclitic J abilities’.The other way is to treat the first letter as an underlying letter where the root is hgfgJ b-ṭ-q and it means ‘cards’, where it has no clitic or prefix. Section 8.2.3.2 gives more examples. c. Tokenization10 (i.e. segmentation) of words into their morphemes where word tokens out of context can be segmented into different sequences of morpheme tokens. Therefore, morphological analyzers need to investigate all possible variants correctly for words out of context. Morphemes such as ` tā’ can be attached to verbs to indicate second person masculine subject or second person feminine subject. For example, the ` tā’ morpheme of the word dH frmt > can be analyzed as: d ; =;G;Hfaramta ‘you (2MS) chopped’; or d=;G;H faramti ‘you (2FS) chopped’. The same form can involve one morpheme d ; ;= G;H farmata ‘he formatted’ which represents a foreign word; or three morphemes d ; =?G;H = + M + 3 ` farumta ‘you (2MS) desired’ which has the root M r-w-m; or d = ;;G;H = + M + 3 ` faramat ‘she (3FS) threw’ from the root L r-m-y. d. Extracting the root letters of the word: root letters can be hard to extract or predict and increase the text ambiguity if the one or two root letters are long vowels or belong to the affixes and clitics letters. For example, the form i! ysr involves two roots: i! y-s-r where the word i> ;! yasir means ‘ease or prosperity’; and  s-r-r where the word Bi> ;! yasirru means ‘he tells a secret’. Moreover, assimilation or elision occurring on root radicals or affix letters increases the complexity of root extraction algorithms especially those that assume letters which are not shared with clitic and affix letters are original root radicals. For example, the letter f ṭah of the word M; ; ;e/ = ’iṣṭama ‘impact’ which has the root M/ ṣ-d-m, will be treated as a root radical, where it has changed from the underlying letter ` tā’ of the pattern +; #; G;-G=H’ifta‘ala. 10 Tokenization refers to both word tokenization and morpheme tokenization throughout the thesis - 26 e. The omission of short vowels especially in MSA text: will affect the functional behaviour and the part-of-speech classification of words. For example, Qwrd: can be QD=; wardun “roses” representing a noun or Q;;; warada “to come” un representing a verb; J rb: J j ; rubb “God” is a noun, while J . ? rubba “many” is a particle;. A non-vowelized word can be noun, verb and particle. Thus +" bl; un j+;" ball “moistening” is a noun; +. ;" balla “to moisten, wet, make wet” is a verb; += ;" bal “nay, -rather …, (and) even, but, however, yet” is a particle. 5- Phonology, morphology and syntax: morphology interacts with phonology and syntax. Phonology deals with phonemes which are sound units smaller than morphemes, and syntax deals with rules of composing sentences by combining words. Phonological processes cannot be separated from morphology. Therefore, morphological analyzers need to deal with the different kinds of phonological processes such as assimilation, syncope or deletion, epenthesis or insertion, and gemination or doubling. Syllabification is a well-studied phonological phenomenon in English dictionaries, but it is not established in Arabic dictionaries. On the other hand, syntax interacts significantly with morphology such that many words require contextual knowledge to solve their morphological ambiguities. In conclusion, morphological analysis modules must account for phonology and syntax which increases the complexity of developing morphological analysis systems for Arabic text (Kiraz 2001). 6- Punctuation: punctuation has been introduced recently into the Arabic writing system. MSA text is characterized by inconsistency and irregularity in the use of punctuation marks. In addition to the late introduction of punctuation to MSA text, the absence of a comprehensive treatment of punctuation in Arabic grammar books increases the problem of inconsistency in the use of punctuation in MSA text. Moreover, the use of punctuation in Arabic text is prescriptive rather than based on a linguistic description of actual usage in authentic written samples (Khafaji 2001; Attia 2008). Punctuation plays a significant part in phrase break prediction for English, and serves as an input to the classifier along with POS tags in both rule-based (Liberman and Church 1992) and probabilistic (Taylor and Black, 1998; Ingulfsen et. al, 2005) approaches. - 27 2.3.4.2 Basic Concepts of Arabic Morphological Analysis This section defines the basic concepts related to Arabic morphological analysis. These terms will be used in this thesis according to these definitions. Some of them are drawn from Wikipedia, as although Wikipedia is not an authoritative academic source, it is a widely-used explanatory source. • • Tokenization or segmentation: is the process of defining the word’s morphemes. These morphemes can be classified into 5 types: proclitics, prefixes, stem, suffixes and enclitics. A word must have at least one stem morpheme. Combinations of clitics and affixes can be attached to the word. A morphological analyzer is responsible for defining all possible variations of segmenting a word into its morphemes. Stemming: is the process of assigning morphological variants of words to equivalence classes, such that each class corresponds to a single stem. It is also defined as reducing inflected words to their stem, base, or root form11. For example words such as writing, write, writer and written are reduced to the root write. For distinguishing between stem and root in Arabic – see note 2 on section 2.3.4.1. • Lemmatization: is the process of grouping a set of words into the canonical form, dictionary form, or citation form which is also called the lemma. E.g., in English, run, runs, ran and running are forms of the same lexeme, with run as the lemma12. The lemma is usually also the stem. • Root: is the smallest lexical unit. An Arabic root usually consists of three letters (i.e. radicals) which carries the aspects of semantic contents13. Both root and pattern are used to derive Arabic words. In the derivation process the root radicals are inserted into their positions in the pattern. These positions are not necessarily consecutive. Morpheme: is the minimal meaning bearing unit that for constituting a word. The principal difference between morpheme and word is that morphemes may or may not • • be standalone units, while a word is a meaningful freestanding unit14. Patterns: are the templates of combinations of consonants and vowels. The consonants represent slots for the root radicals to be inserted and the vowels represent the vocalism. The pattern is represented by sequences of Cs representing the consonants and Vs representing vocalism. The CV approach for representing patterns is widely used across languages (McCarthy and Prince 1990b; McCarthy and Prince 1990a; Smrz 2007; Attia 2008; Habash 2010). The original representation of patterns was proposed by Arabic grammar scholars as *( kl m al-mῑzān aṣ-ṣarfῑ 11 Wikipedia explanation, http://en.wikipedia.org/wiki/Stemming Wikipedia explanation of Lemma, http://en.wikipedia.org/wiki/Lemma_(linguistics) 13 Wikipedia explanation of Root, http://en.wikipedia.org/wiki/Root_(linguistics) 14 Wikipedia explanation of Morpheme, http://en.wikipedia.org/wiki/Morpheme 12 - 28 ‘the morphological scale’ which uses the past verb +; #; G;H ‘did’ to represent the root radicals (Ali 1987; al-Saydawi 2006). • Pattern matching: is the process of matching words with their possible patterns, either morphosyntactic patterns or morphophonemic patterns. The pattern matching algorithm must deal with three types of changes: incorporation or assimilation, substitution and deletion of vowel letters. • Function words: are words with little semantic content meaning. They serve as important elements in the structure of sentences. They define grammatical relationships with other words within the sentence. They also signal the structural relationships that words have with one another. Function words are pronouns, • prepositions, determiners, conjunctions, auxiliary and modal verbs (Baker, Hardie and McEnery 2006). In some languages, some function words are not free-standing, but clitics attached to content words. Diacritization or vowelization: is the process of adding the correct short vowels and diacritics to words. Vowelization is an important characteristic of the Arabic word. Vowelization helps in determining some morphological features of words. The presence of the short vowel on the last letter helps in determining the case or mood of • • the word. And the presence of a vowel on the first letter determines whether the verb is active or passive. The presence of other diacritics such as šaddah and maddah (extension) solve some ambiguities of words. Part-of-speech tagging: is the process of assigning part-of-speech grammatical category labels to the words of a corpus. Tagging is done automatically using part-ofspeech tagger programs, and manual proofreading to content errors. Parsing: is the process of analysing the grammatical structure of a sequence of words or tokens. Parsing is automatically accomplished by using syntactic parser programs which output the syntax trees of the analysed text. 2.3.4.3 Morphological Analysis of Classical Quranic Arabic Text The Quranic Arabic Corpus is a newly available resource enriched with multiple layers of annotation including morphological segmentation and part-of-speech tagging. The motivation behind this work is to produce a resource that enables further syntactic and semantic analysis of the Qur’an; a genre difficult to compare with other forms of Arabic, since the vocabulary and the spelling differs from Modern Standard Arabic (Dukes and Habash 2010). The Quranic Arabic Corpus uses the old Arabic script called the Othmani script; this is the same script used in writing the first copies of the Qur’an about 1,400 years ago. In addition, dots, short vowels and diacritics were added to the same word skeletons of the first written Qur’an. - 29 Buckwalter’s Arabic Morphological Analyzer (BAMA) was used to generate the initial tagging. The analyzer was adapted to work with Quranic Arabic text. After that, the annotated corpus was then put online to allow for collaborative proofreading and correction of the annotation (Dukes and Habash 2010). Mapping was required to convert from the Modern Standard Arabic BAMA tag set to the classical grammar model used in the Quranic Arabic Corpus tag set. Manual disambiguation was required for some cases, where one-to-one mapping was not applicable such as particles. In order to adapt BAMA to process the Quranic Arabic Corpus text, three main modifications were made. First, spelling of the Qur’an differs from MSA. The differences involve orthographic variations of hamzah, ’alif and the long vowel ā. Second, the multiple diacritized analyses produced by BAMA for the processed words were ranked in terms of their edit-distance from the Qur’anic diacritization, with closer match ranked higher. Finally, filtering is done by choosing the highest rank analysis part-of-speech as a solution (Dukes and Habash 2010). Manual annotation involves adding some parts of the morphological analysis, such as missing verb voice (active/passive), the energetic mood for verbs, the interrogative alif prefix, identifying particles, verb forms, and disambiguating lām prefix (Dukes and Habash 2010). Figure 2.1 shows a sample of the morphological and part-of-speech tags of the Quranic Arabic Corpus taken from chapter 29. Index 29 | 1 | 1 29 | 2 | 1 29 | 2 | 2 29 | 2 | 3 29 | 2 | 4 29 | 2 | 5 29 | 2 | 6 29 | 2 | 7 29 | 2 | 8 29 | 2 | 9 29 | 2 | 10 Word QAC morphological tag ‫ ٓال ٓم‬POS:INL ‫ب‬ َ ‫ أَ َح ِس‬A:INTG+ POS:V PERF ROOT:Hsb 3MS ‫اس‬ ُ ‫ ٱلنﱠ‬Al+ POS:N LEM:an ‫ يُ ْت َر ُك ٓو ۟ا‬POS:V IMPF PASS ROOT:trk 3MP MOOD:SUBJ ‫ أَن‬POS:SUB LEM:>an ‫ يَقُولُ ٓو ۟ا‬POS:V IMPF ROOT:qwl 3MP MOOD:SUBJ ‫امنﱠا‬ َ ‫ َء‬POS:V PERF (IV) ROOT:Amn 1MP ‫ َوھُ ْم‬wa+ POS:PRON 3MP ‫ َال‬POS:NEG LEM:laA ‫ون‬ َ ُ‫ يُ ْفتَن‬POS:V IMPF PASS ROOT:ftn 3MP Figure 2.1 Sample of the morphological and part-of-speech tags of the Quranic Arabic Corpus taken from chapter 29 The automatic algorithm produced an analysis for 67,516 out of 77,430 words, followed by manual annotation done by native Arabic speakers. In the first stage the - 30 annotators corrected 21,550 words (28%) including 9,914 words missed by the analyzer and 11,636 corrections to existing analyses. In the second stage, another annotator made changes to 1,014 words (1.38% of all words). In the final stage, the corpus was put online for community volunteer correction, resulting in over 2,000 (2.6%) approved corrections to words (Dukes and Habash 2010). The Quranic Arabic Corpus tag set adapts traditional Arabic grammar leading to morphological annotation that uses familiar terminology. This terminology enables people with Quranic syntax experience to participate in the online annotation to be verified against existing recognized standard textbooks on Quranic Grammar (Dukes and Habash 2010). 2.3.4.4 Four Approaches to Morphological Analysis for MSA Arabic Text Generally, there are four main methodologies for developing robust morphological analysers. Arabic morphological analysis techniques include two-level and finite-state morphology (Al-Sughaiyer and Al-Kharashi 2004). The four main methodologies used for Arabic morphological analysis are: • Syllable-Based Morphology (SBM), which depends on analysing the syllables of the word. • Root-Pattern Methodology, which depends on the root and the pattern of the word for analysis. Using this method, the root of the word is extracted by matching the word with lists of patterns and affixes. • Lexeme-based Morphology, where the stem of the word is the crucial information that needs to be extracted from the word. • Stem-based Arabic lexicon with grammar and lexis specifications, where stemgrounded lexical databases with entries associated with grammar and lexis specifications, is the most appropriate organization for the storage of Arabic lexical information. All these methodologies (Al-Sughaiyer and Al-Kharashi 2004; Soudi et al. 2007) use pre-stored lists of root, stems, patterns and affixes and grammar and linguistic information encoded with the analysers. A fifth methodology is using tagged corpora and computer algorithms to extract a morphological database of the tagged words. Machine learning algorithms do not really apply given the absence of morphologically tagged corpora and the absence of tractable learning algorithms. - 31 Moreover, other challenges that face the application of machine learning algorithms to solve Arabic morphological analysis problems are: the encoding differences of Arabic text samples coded in Unicode and systems which only accept text coded in ASCII; the nature of Arabic as a highly inflected language; its variable word order of (VSO) for morphologically rich languages could lead to greater contextual ambiguity. Therefore it would require a higher-order model than languages like English and it would require a larger training corpus (Sánchez León and Nieto Serrano 1997; Hardie 2004); and the large tag set size used. 2.3.4.5 Requirements for Developing Morphological Analysers for Arabic Text A robust and well-designed morphological analyzer for Arabic text has to meet the following conditions. First, it can correctly divide the analysed word into morphemes such as proclitics, prefixes, stem or root, suffixes and enclitics and specify the morphological features for each morpheme. Second, it can generate the correct pattern of the word and specify whether the generated pattern is a noun pattern, verb pattern or both. Third, it can extract the correct root of the word, whether it is a tri-literal root or quadriliteral root. Fourth, it can deal with unambiguous words (inert or stop words), irregular words, rare words and borrowed words. Fifth, it can specify the rules of transitive and intransitive verbs. Sixth, it can specify the derivation rules of past verbs, progress verbs and imperative verbs. Finally, it can deal with the orthographic aspects of the words such as vowelizing, incorporation, substitution and the writing of hamzah, which helps in correcting spelling mistakes (Al-Bawaab 2009; Hamada 2009a). Section 8.2 discusses the requirements and specifications for developing an Arabic morphological analyser. 2.3.4.6 Morphological Analysers for Modern Standard Arabic Text In this section, we will survey existing morphological analysers of Arabic text. Each morphological analyzer is studied in terms of the approach used to build it, the definition of a word’s morphemes, the database used to support morphological analysis, the morphological features that the analyzer can determine and the tag set used to encode these features. - 32 1- Xerox Arabic Finite-State Morphological Analysis and Generation System (1998) Xerox deals with Modern Standard Arabic text. It accepts input text which is fullyvowelized, partially-vowelized or non-vowelized, and outputs root, pattern, and affixes of the analysed word with feature tags such as: part-of-speech, person, number, mood, voice and aspect. The Xerox system aims to solve three challenges of Arabic: morphotactics, short vowels and Arabic lexicon lookup. The Xerox system is based on a lexicon of rootpattern representation of 5000 roots and 400 phonologically distinct patterns. It is based on the large two-level morphological analyzer for Arabic ALPNET. Xerox finite-state calculus was used to insert roots into their patterns and effectively generated 85,000 valid stems. The lexicon transducer also contains suitable prefixes and suffixes which are added to stems in the normal concatenative way. The result of the analysis returns back the upper-side string as root base-form followed by relevant morphosyntactic features of the analysis (Beesley 1996; Beesley 1998). The advantages of the Xerox system are its large coverage; the reconstruction of short vowels; and the English glossary provided for each word. However, it has disadvantages such as lack of specification for multiword expressions (MWEs) and improper spelling relaxation rules. The major disadvantages of Xerox are: overgeneration in word derivation due to uneven distribution of patterns for roots; the coarsegrained classification of words which is limited to 4 part-of-speech tags (verbs, nouns including adjectives and adverbs, particles and function words); and the high-level of ambiguity where it produces many analyses for most words (Attia 2008). 2- ElixirFM Functional Arabic Morphology (2007) ElixirFM is an implementation of a novel computational model of the morphological processes in Modern Written Arabic. It is still in active development and related to the Prague Arabic Dependency Treebank (PADT) project (Hajič et al. 2004; Smrž et al. 2008). The system includes two essential components, namely a multipurpose programming library promoting clear style and abstraction in the model, and a linguistically refined, yet intuitive and efficient, morphological lexicon. ElixirFM provides the user with four different modes of operation: • Resolve provides tokenization and morphological analysis of the inserted text, even if one omits some symbols or does not spell everything correctly (Smrz 2007; Smrž 2009). The tokenization decision follows the conventions of PADT and PATB. For - 33 example the word -% lil-kutub ‘for the books’ has the following analyses (Habash 2010): o P--------- li ‘l’ ‘li’ o N-----P2D al-kutub ‘k t b’ al >| FuCuL | << ‘i’ • Inflect transforms words into the forms required by context. • Derive converts words into their counterparts of similar meaning but different grammatical category, specified via natural language descriptions or morphological tags. Word forms are encoded using morphophonemic patterns pertaining to morphological stem and reflect their phonological qualities. • Lookup can lookup lexical entries by the citation form and nests of entries by the root. The lexicon of ElixirFM is derived from the open-source Buckwalter lexicon which contains about 40,000 entries that are grouped into about 10,000 nested entries. Word forms are encoded via carefully designed morphophonemic patterns that interlock with roots or literal word stems. ElixirFM implements the comprehensive rules that draw the information from the lexicon and generate the word forms given the appropriate morphosyntactic parameters. ElixirFM also implements derivation, in any direction, between verbs, active or passive participles, and masdars (i.e. de-verbal nouns). ElixirFM effectively exploits the inflectional invariant during the resolution of word forms from its root. ElixirFM presents the results of tokenization and morphological analysis in form of MorphoTrees which introduce intuitive hierarchies over the tokens and their readings that can be further pruned and disambiguated (Smrz 2007; Smrž 2009). The advantages of the ElixirFM are the use of morphophonemic patterns that avoid the design of special rules to avoid the challenges of assimilation, gemination and deletion and listing the forms for each lexical item. However, the lexicon size of the morphophonemic patterns in the system is 4,290, which might suffer from coverage problems. Moreover, use of the open-source Buckwalter lexicon which contains about 40 thousands entries, inherits the disadvantages to the system such as the lack of specification for MWEs; improper spelling relaxation rules; and the lack of grammarlexis specifications. 3- AlKhalil Morpho Sys (2010) Alkhalil Morpho Sys is a morphological analyzer for Standard Arabic text. Alkhalil processes non-vowelized, partially vowelized and fully-vowelized MSA text. It is based on modeling a very large set of Arabic morphological rules, and on integrating linguistic resources that are useful to the analysis, such as (i) the root database; (ii) vowelized - 34 morphophonemic patterns associated with roots, (iii) and proclitic and enclitic lists. The outputs of analyzing Arabic words are presented in a table which shows: the fullyvowelized stem; its grammatical category and morphosyntactic features in natural language phrases; its possible roots associated with corresponding patterns; and its proclitics and enclitics (Boudlal et al. 2010). The lists of noun patterns and verb patterns were obtained using Sarf (Arabic Morphology System) (ALECSO 2008b) and NEMLAR corpus (Attia et al., 2005). These lists contain a large number of about 28,000 morphophonemic patterns with full vowelization. Alkhalil contains about 7000 roots obtained from Sarf where each root is connected with specific derivation patterns used to derive words of that root (Mazroui et al. 2009; Boudlal et al. 2011). Matching the roots with their vowelized pattern gives the analyzer control over the derivations of that root, which solves the over-generation problem. However, using morphophonemic patterns has the shortcoming of undergeneration. Moreover, Alkhalil inherited the limitations of Sarf of uncovering all derivatives such as broken plurals and non-derived words. Alkhalil processes words by segmenting the words into (proclitics + stem + enclitics) then matches the stem with the non-derived words list. Then it treats the word as a derived word in the second phase and identifies the possible roots and patterns by analyzing the clitics and matching the words with the patterns. The system classifies nouns into 5 categories: gerund, active participle, passive participle, noun of place and time, and instrumental noun. It identifies morphological features of gender, number and syntactic form. Verbs are classified into perfect, imperfect and imperative. The morphological features of voice, syntactic form, number of root letters, conjugation, person and transitivity are identified for analyzed verbs. Particles are classified into their subcategories (Mazroui et al. 2009; Boudlal et al. 2011). No evaluation was reported due to the unavailability of a test corpus. A basic evaluation was carried out to show the ability of the system to analyze words, by examining the outputs of Alkhalil on a sample of the Qur’an – chapter 20, which has about 1000 words. The outputs of Alkhalil showed that about 13.37% (132 words out of 987word of the sample) have no analysis. Most of the non-analyzed words belong to the function word and proper nouns categories. 4- MORPH2: A Morphological Analyzer for Arabic Text (2006-2010) MORPH2 is a morphological analyzer for Arabic text and it is an extension to MORPH (Hadrich and Chaâben 2006). The focus of the improvement was adding a new step of vocalization and validation. MORPH2 uses a standard model of Arabic morphology. The model interprets all possible rules that govern the derivation of a word - 35 from its morpheme (root). MORPH2 takes into account the orthographic issues of Arabic words such as incorporation, substitution, vowelization and omission. The inputs are either fully vowelized words, partially vowelized words or non-vowelized words. The outputs are stored in an XML file and .xsl stylesheet in a structured format. MORPH2 depends on a pre-stored list of patterns and generated patterns to deal with substitution and vowelization cases. The analysis of words is carried out by following 5 steps: • Tokenization step: is based on contextual exploration of punctuation that divides • the text into sentences, then detection of words within sentences. Morphological pre-processing step: extracts clitics of the analysed words. Then, a filter process classifies the stem of the analysed word into particle, number, date or proper noun. • Affix analysis step: identifies the basic elements of the word, namely: root and affixes. This process is accomplished following a five-stage process of (i) prefix and suffix identification; (ii) candidate affix identification; (iii) lexical filtering; (iv) • association control of root radicals and affixes; and (v) transformation recognition. Morphological analysis step: determines all possible morphosyntactic features • which are made in three stages: (i) identification of the part-of-speech of the word (i.e. noun, verb and particle); (ii) identification of the morphological features (i.e. gender, number, time and person); and (iii) filtering of the feature lists. Vocalization and validation step: depends on the previous two steps of affix and morphological analysis. The vowelization of the analysed word is done according to the morphosyntactic features and by matching the analysed word with its pattern. The validation process deals with transformation, omission and assimilation operations which occur for the analysed words. MORPH2 contains many XML lexicons that provide necessary information for each step. Such lexicons are: the lexicon of proclitics, enclitics, and particles; lexicon of affixes and roots; and lexicon of derived and primitive nouns. The most important lexicon is the triliteral and quadriliteral roots of 5,754 entries, where patterns are connected with their corresponding roots. This combination provides 15,212 verbal stems and 28,024 nominal stems (Kammoun et al. 2010). The evaluation of MORPH2 is done by calculating the recall and precision of analysing 23,121 word types of the test corpus which has all possible analyses of each word without taking into account the context of the words. The reported average recall and precision are 89.77% and 82.51% respectively. The limitation of the system is failure to detect relation nouns and non-derived (primitive) nouns (Hamado et al. 2009; Kammoun et al. 2010). - 36 5- MIDAD Morphological Analyzer for Arabic Text (2009) MIDAD applies linguistic knowledge of Arabic morphology to develop computer algorithms and rules that simulate human methods for deriving and analyzing words. The analyzer uses a database of Arabic roots and irregular words that need special processing. This database can be used to generate a larger database which includes most Arabic vocabulary. The use of the roots and irregular words database makes the program small, fast and robust (Sabir and Abdul-Mun’im 2009). 6- Application Oriented Arabic Morphological Analyzer (2009) The analyzer depends on a novel algorithm that classifies the word’s letters into letters belonging to affixes or underlying letters. The algorithm applies rules governing the relations between the word’s letters. The algorithm does not depend on any pre-stored dictionaries. The analyzer depends on this algorithm to extract the root or stem, the affixes and the pattern of the analysed word. The inputs are either fully vowelized words, partially vowelized words or non-vowelized words. The outputs show all possible roots, affixes and patterns of the analysed word. They report an accuracy rate of 97.7% and they claim that the analyzer is five times faster than any existing analyser. As reported, the analyzer can be integrated into other applications and parts of the analyzer might be reused (Sonbul, Ghnaim and Dusouqi 2009). 2.3.4.7 The ALECSO/KACST Initiative of developing and evaluating Morphological Analysers of Arabic text The Arab League Educational, Cultural and Scientific Organization (ALECSO) and King Abdul-Aziz City of Science and Technology (KACST) have promoted an initiative on morphological analysers for Arabic text which aims to encourage research in developing an open source morphological analyzer for Arabic text which has high accuracy, is easy to develop and which can be integrated into higher levels of applications for processing Arabic text. Six morphological analysers entered the ALECSO/ KACST competition for evaluating morphological analysers for Arabic text. Table 2.3 lists the names, affiliations and the major contributions of the participants. According to the evaluation methodology, the organizers of the ALECSO/KACST workshop evaluated the results of the morphological analysers. The highest scores were achieved by Mazroui, Meziane et al. (2009), and Boudlal, Lakhouaja et al. (2010). The official results and scores of the ALECSO/KACST competition have not been published for unspecified and unknown reasons. Only specifications for development and evaluation methodology were published (Al-Bawaab 2009; Hamada 2009b; Hamada 2009a; Hamada 2010). Section 9.2 discusses the initiative as guidelines for evaluating Arabic morphological analysers. - 37 Table 2.2 ALCSO/KACST competition participants Author(s) Affiliation bin Hamdo et al MIRACL Labs, Tunis. University of Mohammed I, Morocco. MIDAD, Egypt. Mazroui et al Sabir and AbdulMun’im Sawalha Atwell Sonbul et al Smrz and University Leeds, UK. of Higher Institute of Applied Science and Technology (HIAST), Syria. Charles University in Prague, Czech republic. Algorithm Name MORPH Alkhalil MIDAD SALMA - ElixirFM Methodology Depends on pre-stored list of patterns and generated patterns Depends on databases of verbs, derived nouns and original nouns derived using Sarf (Arabic Morphology System) Depends on rules that simulate the human methods of deriving and analyzing words and a database of Arabic roots and irregular words. Depends on linguistic knowledge of the language as well as corpora. Broad-coverage lexicon and comprehensive lists of roots, clitics, affixes and patterns. Depends on a novel algorithm that classifies the word’s letters into letters belong to the affixes or original letters. An implementation of a novel computational model of the morphological processes in Modern Written Arabic. 2.4. Part-of-Speech Tagging Part-of-speech taggers are used to enrich a corpus by adding a part-of-speech category label to each word, showing the broad grammatical class of the word, and morphological features such as tense, number, gender, etc. The list of all grammatical category labels is called the tag set. The design of the tag set is an important prerequisite to this annotation task. The task requires a tagging scheme, where each tag or label is practically defined by showing the words and contexts where each tag applies; and a tagger, a program responsible for assigning a tag to each word in the corpus by implementing the tag set and tagging scheme in a tag-assignment algorithm (Atwell 2008). Automatic taggers have been used from the early years of Corpus Linguistics. TAGGIT in 1971 achieved an accuracy of 77% tested on the Brown corpus. In the late 1970s, CLAWS1, a data-driven statistical tagger was built to carry out the annotation of the Lancaster/ Oslo-Bergen corpus (LOB), and had an accuracy rate of 96-97%. Later tagger development included systems based on Hidden Markov Models (HMM); HMM taggers have been made for several languages. The Brill tagger (Brill 1995) is an example of data-driven symbolic tagger. The ENGCG and EngCG-2 are based on a framework known as Constraint Grammar (CG) (Voutilainen 2003). Recently, many new systems based on a variety of Markov Model and Machine Learning (ML) techniques have appeared for many languages. Hybrid solutions have also - 38 been investigated (Voutilainen 2003). ACOPOST15, A Collection of POS Taggers, consists of four taggers of different frameworks: Maximum Entropy Tagger (MET), Trigram Tagger (T3), Error-driven Transformation-Based Tagger (TBT) and Examplebased tagger (ET). The SNoW-based Part of Speech Tagger16 and LBJ Part of Speech Tagger17 make use of the Sequential Model. NLTK18, the Natural Language Toolkit, includes Python re-implementations of several POS taggers such as; Regexp Tagger, NGram Tagger, Brill Tagger and HMM Tagger; in addition NLTK includes tutorials and documentation on tagging. RelEx19 provides English-language part-of-speech tagging, entity tagging, as well as other types of tags (gender, date, money, etc.). Spejd20 - Shallow Parsing and Disambiguation Engine is a tool for simultaneous rule-based morphosyntactic disambiguation and partial parsing. VISL Constraint Grammar21 is an example of rule based disambiguation. Enriching the source text samples of corpora with part-of-speech information for each word, as a first level of linguistic enrichment, results in more useful research resources. English corpora have been developed for a long time and for a variety of formats, types and genres. Several English corpora have been enriched with Part-ofSpeech tagging, and a variety of different English corpus part-of-speech tag sets have been developed, including: the Brown corpus (BROWN), the Lancaster/ Oslo-Bergen corpus (LOB), the Spoken English Corpus (SEC), the Polytechnic of Wales corpus (PoW), the University of Pennsylvania corpus (UPenn), the London-Lund Corpus (LLC), the International Corpus of English (ICE), the British National Corpus (BNC), the Spoken Corpus Recordings In British English (SCRIBE), etc (Atwell 2008). The AMALGAM22 multi-tagged corpus amalgamates all these tagging schemes in a common collection of English texts: in the AMALGAM corpus, the different part-of-speech tag sets used in these English general-purpose corpora are applied to illustrate the range of rival English corpus tagging schemes, and the texts are also parsed according to a range of rival parsing schemes, so each sentence has more than one parse-tree, called “a forest” (Atwell et al. 2000). Part-of-speech tag sets and taggers have also been developed for other European languages. The EAGLES, European Advisory Group on Language Engineering Standards project, drew up standards for tag sets, morphological classes and codes for (western) European languages, including: EAGLES recommendations for the morphosyntactic 15 ACOPOST http://acopost.sourceforge.net/ 16 SNoW-based Part of Speech Tagger http://l2r.cs.uiuc.edu/~cogcomp/asoftware.php?skey=POS 17 LBJ Part of Speech Tagger http://l2r.cs.uiuc.edu/~cogcomp/asoftware.php?skey=FLBJPOS 18 NLTK http://www.nltk.org/ 19 RelEx http://opencog.org/wiki/RelEx 20 Spejd http://nlp.ipipan.waw.pl/Spejd/ 21 VISL Constraint Grammar http://beta.visl.sdu.dk/cg3.html 22 Automatic Mapping Among Lexico-Grammatical Annotation Models (AMALGAM) __http://www.comp.leeds.ac.uk/amalgam/amalgam/amalghome.htm - 39 annotation of corpora (Leech and Wilson 1999); a synopsis and comparison of morphosyntactic phenomena encoded in lexicons and corpora: a common proposal and applications to European languages (Monachini and Calzolari 1996); and an EAGLES study of the relation between tag sets and taggers (Teufel et al. 1996). The potential uses of a part-of-speech tagged corpus are key factors in deciding the range and number of part-of-speech tags. Many linguistic analyses use part-of-speech tagged corpora to analyze text and extract information, where part-of-speech tags play an essential role in classifying text and direct search to the actions, events, places, etc are described in the text. The most obvious applications are in lexicography and NLP/computational linguistics. Further applications include using the tags in data compression (Teahan 1998); and as a possible guide in the search for extra-terrestrial intelligence (Elliott and Atwell 2000). Other generic applications that make use of partof-speech tag information are: searching and concordancing, grammatical error detection in Word Processing, training Neural Networks for grammatical analysis of text, or training statistical language processing models (Atwell 2008). Part-of-Speech tagging is a key technology in discovering suspicious events from text. Part-of-speech tagging is required for partial parsing which is a first step for named entity (NE) recognition as one module of the Information Extraction (IE) pipeline. IE is the main text extraction methodology used for counter-terrorism text analysis tools (Zolfagharifard 2009), and processing Arabic is a key task in discovering these suspicious events. 2.4.1 Part-of-Speech Taggers for Arabic Text Arabic part-of-speech tagging development started more recently. A range of different techniques have been used to solve the problem of part-of-speech tagging of Arabic. The APT tagger uses a combination of both statistical Viterbi algorithm, and rulebased techniques (Khoja 2001). Brill’s “transformation-based” or “rule-based” part-ofspeech tagger has been applied for Arabic (Freeman 2001). Harmain (2004) developed a web-based Arabic tagger. Diab, Hacioglu et al. (2004) used Support Vector Machines (SVM), a supervised learning algorithm, to achieve an accuracy of 95%. Habash and Rambow (2005) developed another part-of-speech tagger that uses SVM and Viterbi decoding. HMM has been widely used in part-of-speech tagging for Arabic, with reported accuracy of 97% on LDC’s Arabic Treebank of Modern Standard Arabic (Al-Shamsi and Guessoum 2006) and 70% when tested on CallHome Egyptian Colloquial Arabic (ECA) and the LDC Levantine Arabic (Duh and Kirchhoff 2005). Applications of MemoryBased learning to morphological analysis and part-of-speech tagging of written Arabic have been explored (Marsi, Bosch and Soudi 2005). Also, combinations of rule based and machine learning methods for tagging Arabic words (Tlili-Guiassa 2006). A multi-agent architecture was developed to address the problem of part-of-speech tagging of Arabic - 40 text with vowel marks (Zibri, Torjmen and Ahmad 2006). A rule-based PoS tagging system, Arabic Morphosyntactic Tagger AMT (Alqrainy 2008), uses two different techniques: the pattern-based technique, which is based on using Pattern-Matching Algorithm (PMA), and lexical and contextual techniques. The AMT tagger makes use of the last diacritic mark of Arabic words to reduce the tagging ambiguity. The accuracy of the AMT tagger reported was 91%. Nearly all these Arabic part-of-speech taggers were developed by NLP research groups for their own internal use, and are not freely downloadable by other researchers. The taggers use different tag sets, and accuracies are reported on different test corpora. Appendix B compares between these part-of-speech taggers for Arabic text in terms of methodology, corpus used, tag set, evaluation methodology, and evaluations metrics. 2.5 Chapter Summary This chapter studied existing morphosyntactic analysis systems for text corpora in three dimensions. First, it explored Arabic text corpora as a background prerequisite for morphosyntactic analysis. Second, it studied morphological analysers for text corpora concentrating on methodologies, challenges, examples of existing morphological analysers, and evaluation standards. Third, it surveyed part-of-speech tagging technology and existing part-of-speech taggers for Arabic text. Arabic corpora started to appear in the late 1980s. Most of the existing Arabic corpora are of MSA written text, mainly newspaper text. Only two corpora are opensource and available to download. These are the Corpus of Contemporary Arabic (CCA) (Al-Sulaiti and Atwell 2006) and the Quranic Arabic Corpus (QAC) (Dukes et al. 2010; Dukes and Habash 2010). A new third open source corpus is the Corpus of Traditional Arabic Lexicons which is discussed in Chapter 4. Several morphological analysers for Arabic text exist. Morphological analysis is an important pre-processing step for many text analytics applications. The aim of morphological analysis is to define the morphosyntactic information of a corpus words. Automatic morphological analysis started in the 1950s. Finite-state methodology has dominated since the 1980s. It was originally investigated at Xerox and it has been used to develop wide-coverage morphological analysers for several languages. The four main methodologies used for Arabic morphological analysis are: Syllable-Based Morphology (SBM); Root-Pattern Methodology; Lexeme-based Morphology; and Stem-based Arabic lexicon with grammar and lexis specifications. A fifth methodology is using tagged corpora and computer algorithms to extract a morphological database of the tagged words. - 41 This chapter surveyed existing Arabic morphological analysers focusing on the morphological analysers that participated in the ALECSO/KACST competition. These surveyed morphological analysers are: (i) Xerox Arabic Finite-State Morphological Analysis and Generation System (1998); (ii) ElixirFM Functional Arabic Morphology (2007); (iii) Alkhalil Morpho Sys (2010); (iv) MORPH2: A Morphological Analyzer for Arabic Text (2006-2010); (v) MIDAD Morphological Analyzer for Arabic Text (2009); and (vi) Application Oriented Arabic Morphological Analyzer (2009). Community based approaches to develop and evaluate morphological analysers for Arabic text namely: the MorphoChallenge competition and the ALECSO/KACST initiative were discussed. More detailed discussion of them is presented in Chapter 8 and Chapter 9. Morphological analysers are designed to generate all possible analyses of the analysed words out of their context. Disambiguating the analysis suitable to the context is done by using part-of-speech taggers. Part-of-speech tagging technology was surveyed in this chapter. The survey listed state of the art part-of-speech taggers for English, the tagged corpora and the standards. Then, existing part-of-speech taggers for Arabic text were briefly listed focusing on their development approaches and their accuracy as reported by their developers. - 42 - Part II: Background Analysis and Design Summary of Part II Part II is an attempt to plan ahead for what is required for the full SALMA – Tagger in Chapter 8. Firstly, an analysis of the failings of morphological analyzers and stemmers is presented in Chapter 3. Secondly, development of a broad-coverage lexical resource, the SALMA – ABCLexicon, required by the development of the morphological analyzer is presented in Chapter 4. Finally, an analysis of existing tag sets as background to designing the SALMA –Tag Set, Chapters 3, 4 and 5 is a necessary prior step to develop the SALMA – Tagger. - 43 - Chapter 3 Comparative Evaluation of Arabic Morphological Analyzers and Stemmers This chapter is based on the following sections of published papers: Sections: 2, 3, 4, 5 and 6 are based on sections 1, 2, 3 and 4 in (Sawalha and Atwell 2008) Section 7 is based on section 3.1 in (Sawalha and Atwell 2009a) Chapter Summary Arabic morphological analysers and stemming algorithms have become a popular area of research. Several computational linguists have designed and developed algorithms to tactile the problem of morphology and syntax; but each researcher proposed an evaluation methodology based on different text corpora. Therefore, we cannot make comparisons between these algorithms. This chapter discusses four different fair and precise evaluation experiments using a gold standard for evaluation consisting of two 1000-words text documents from the Holy Qur’an and the Corpus of Contemporary Arabic. Secondly, it discusses a combination of the results of these morphological analysers and stemming algorithms to allow “voting” on analysis of each word. The evaluation of the algorithms shows that Arabic morphology is still a challenge. Finally, it presents an analytical study of the triliteral Arabic roots based on the Qur’an as corpus roots, and the triliteral roots of a broad-coverage lexical resource of traditional Arabic lexicons. The study shows that more than 25% of Arabic triliteral roots are hard to analyze. - 44 - 3.1 Introduction Stemming is the process of assigning morphological variants of words to equivalent classes, such that each class corresponds to a single stem. It is also defined as reducing inflected words to their stem, base, or root form23. For example words such as writing, write, writer and written are reduced to the root write. Stemming has been widely used in several fields of natural language processing such as data mining, information retrieval, text analytics applications (e.g. compression, spell checking, text searching, and text analysis), and multivariate analysis. A widely used simple stemming algorithm for English is the Porter Stemmer (Porter 1980). It is available as a freely distributed implementation written in several programming languages24. The stemmer is based on a series of simple cascaded rewrite rules which can be viewed as a lexicon-free finite state transducer FST stemmer. However, modern stemmers need to be more complicated than the Porter Stemmer. For instance the word Illustrator (i.e. a software package) does not share the stem illustrate with the word illustrator (i.e. one who gives or draws illustrations) (Jurafsky and Martin 2008). It also need to distinguish whether the part of the word is a suffix or looks like a suffix e.g. the –ion in lion looks like a suffix (Khoja 2003). The Natural Language Toolkit25 (NLTK) provides three stemmers for English namely: Porter Stemmer (nltk.stem.porter(PorterStemmer)), Lancaster Stemmer (nltk.stem.lancaster(LancasterStemmer)) and Regular Expression Stemmer (nltk.stem.regexp(RegexpStemmer)). The Porter and Lancaster stemmers are used as black boxes while the Regular Expression stemmer requires the user to provide the affixes that the stemmer should deal with. Many stemming algorithms have been developed for many languages including Arabic; see section 2.3.4. They attempt to reduce morphological variants of words which have similar semantic interpretations to their common stem. Arabic has a complex morphological structure. So, it is difficult to deal with. Arabic is considered to be a rootbased language: Arabic words are morphologically derived from roots following derivational templates called patterns, where many affixes (i.e. prefixes, infixes and suffixes) and clitics (i.e. proclitics and enclitics) can be attached to form surface words. These roots are made up of three, four or five consonants (Thabet 2004). The motivation for comparing between different stemming algorithms and morphological analysers is that such systems are prerequisites for Part-of-Speech tagging and then parsing. It is also considered an essential step in many computational linguistic applications. 23 Wikipedia definition, http://en.wikipedia.org/wiki/Stemming The Porter Stemmer implementation http://tartarus.org/~martin/PorterStemmer/ 25 The Natural Language Toolkit (NLTK) http://www.nltk.org 24 - 45 - 3.2 Three Stemming Algorithms Many stemming algorithms for Arabic already exist (Al-Sughaiyer and Al-Kharashi 2002; Al-Shalabi et al. 2003; Thabet 2004; Al-Shalabi 2005; AlSerhan and Ayesh 2006; Yusof, Zainuddin and Baba 2010; Hijjawi et al. 2011), but few are open-source or readily accessible. The selection of the stemming algorithms to be studied is limited to three stemming algorithms namely: Khoja’s stemmer (Khoja 2003), Buckwalter’s morphological Analyzer (BAMA) (Buckwalter 2002) and Al-Shalabi et. al, triliteral root extraction algorithm (Al-Shalabi et al. 2003) for which a ready access to the implementation and/or results is available. These three stemmers are freely available online or through personal communication with the authors. A fact about the selected systems worth mentioning here is that these stemmers differ in the implementation methodology used in their development. This means that our comparative evaluation compares between three different stemming methodologies as well as three existing stemmers and morphological analyzers. 3.2.1 Shereen Khoja’s Stemmer We obtained a Java implementation of Shereen Khoja’s stemmer26. Khoja’s stemmer is the rule-based component of her Arabic part-of-speech tagger (APT). It removes the longest suffix and the longest prefix. Then, it matches the remaining word with verbal and noun patterns to extract the root. It deals with language specific variation to the general rules of the language to produce the correct root such as: weak letters (’alif, wāw, and yā’) and hamzah that change their form during derivation, deleted root letters during derivation, and stop words (function words) that do not have roots. The stemming algorithm restores the weak root letter to wāw as default solution. It does not deal with the orthographic issues of writing the hamzah and it always places the hamzah on ’alif (Khoja 2003). The stemmer makes use of several linguistic data files such as a list of all diacritic characters (7), punctuation characters (38), definite articles (5), stop words (168), prefixes (11), suffixes (28), triliteral roots (3,822), quadriliteral roots (926) and triliteral root patterns (46) (Larkey and Connell 2001). The purpose of constructing the stemmer was to identify the affixes and to find the pattern of the word, because the affixes and the pattern of the word provide linguistic information useful to guess the tag of the word. Khoja’s reported accuracy of her stemmer is 96% using newspaper text on the assumption it was evaluated on the developed corpus. The errors are mainly proper nouns and borrowings from foreign languages (Khoja 2003). However, there is not any detail of 26 Java version of Khoja’s stemmer is available to download from http://zeus.cs.pacificu.edu/shereen/research.htm - 46 the evaluation methodology, text used in evaluation and accuracy metrics. Figures 3.4 and 3.6 in section 3.5, shows sample output of Khoja’s stemmer. 3.2.2 Tim Buckwalter’s Morphological Analyzer Tim Buckwalter developed a morphological analyzer for Arabic (BAMA) (Buckwalter 2002). Buckwalter compiled three Arabic-English lexicon files; the prefixes file contains 299 entries, the suffixes file contains 618 entries, and the stems file contains 82,185 entries representing 38,600 lemmas. To control prefix-stem-suffix combinations, the analyzer is provided with three morphological compatibility tables which consist of 1,648 prefix-stem combinations, 1,285 stem-suffix combinations and 598 prefix-suffix combinations. Short vowels and diacritics were included in the lexicons27 (Maamouri and Bies 2004; Maamouri et al. 2004). BAMA was used to morphologically annotate the Penn Arabic Treebank distributed by the Linguistic Data Consortium (LDC). The results of the Arabic Treebank part 1 v 2.0, part 2 v 2.0 and part 3 v 1.0 were recycled through the system to modify the system and update the lexicon. With each cycle, the accuracy of the morphological analyzer and the coverage of the lexicon were improved from 90.63% for part 1 v 2.0 and 99.24% for part 2 v 2.0 to 99.25% for part 3 v 1.0. The most frequent accuracy problems were the absence of non-Arabic proper names (i.e. geographical and organizational names) which caused 38% of errors, false-positives (i.e. foreign names recognized as valid Arabic words), missing Arabic proper names (15% of errors), incorrect vocalization (21% of errors), plus the total cases where the analyzer failed to identify the passive voice or provide the proper verbal prefix or suffix (Maamouri and Bies 2004; Maamouri et al. 2004). Figures 3.4 and 3.6 in section 3.5, shows sample output of BAMA. 3.2.3 Triliteral Root Extraction Algorithm Al-Shalabi, Kanaan and Al-Serhan developed a root extraction algorithm which does not use any dictionary. It depends on assigning weights for a word’s letters multiplied by the letter’s position, Consonants were assigned a weight of zero and different weights were assigned to the augmented letters of ( ‫ أ‬hamzah, ‫’ ا‬alif, ‫ ت‬tā’, ‫س‬ sῑn, ‫ ل‬lām, ‫ م‬mῑm, ‫ ن‬nūn, ‫ ھـ‬hā’, ‫ و‬wāw, ‫ ي‬yā’) where all affixes are formed by combinations of these letters. The algorithm selects the letters with the lowest weights as root letters. The algorithm achieved an accuracy rate of about 93% texted on a sample of modern standard Arabic text comprising 242 non-vowelized Arabic abstracts chosen randomly from the proceedings of the Saudi Arabian National Computer Conference (AlShalabi et al. 2003). Figures 4 and 6 show a sample output of the triliteral root extraction algorithm. 27 Tim Buckwalter’s web site: http://www.qamus.org - 47 - 3.3 Stemming by Ensemble or Voting Natural language engineering aims to design systems that make as few errors as possible with as little effort and cost as possible. There are many ways to reduce errors. First, a better representation of the problem will reduce errors. Second, spending more time on encoding language knowledge of hand-crafted systems, or on finding more training data for data-driven systems, will reduce errors of the system as well. However, these solutions are not always available because of lack of resources (Chan and Stolfo 1995; Atwell et al. 2000; Borin 2000; Dˇzeroski, Erjavec and Zavrel 2000; Escudero, Mhrquez and Rigau 2000; Banko and Brill 2001; Halteren, Zavrel and Daelemans 2001; Marques and Lopes 2001; Hu and Atwell 2003; Banko and Moore 2004; Glass and Bangay 2005; Yonghui et al. 2006). Rather than giving better representation of the problem or spending more time in encoding language knowledge and finding more training data; combining different systems of known representation will, hopefully, reduce errors of a system. The idea behind combining different systems is that systems designed differently in terms of using different formalism or containing different knowledge will produce different types of errors. Provided that these differences are (i) complementary (i.e. systems produce different types of errors, where a system’s errors are not the same as the other system or not a subset of the other systems errors) and (ii) systematic (i.e. errors are not random). So, fixing some types of errors generated will reduce the errors of the combined system. By employing these disagreements of systems we might get better results and fewer errors of the combined system (Borin 2000; Halteren et al. 2001). Much research has been done in the field of machine learning to find ways to improve the accuracy of supervised classifiers. An ensemble of classifiers that generate uncorrelated decisions can be more accurate than any of its component classifiers. There are many varieties of ensemble classifiers in terms of selecting individual classifiers or in the way they are combined (Halteren et al. 2001). If the classifiers are accurate and diverse, then the ensemble of classifiers will be more accurate than any of its individual members. An accurate classifier has an error rate of better than random guessing on new values. Diversity means that two classifiers make different errors on new data points (Dietterich, 2000). A question raised is: Is it possible in practice to build an ensemble that outperforms any of its individual members? There are three sources of evidence for the possibility of building a good ensemble. The first is statistical. Suppose that H is the search space of hypotheses to identify the best hypothesis of a learning algorithm. If the amount of training data is too small, compared to the size of hypothesis space, then the learning algorithm can find many different hypotheses in H. All of them give the same accuracy. - 48 The ensemble that combines all of these accurate classifiers can “average” their votes, and reduces the risk of choosing the wrong classifiers. The second reason is computational; many learning algorithms get stuck in local optima while performing some form of local search. Constructing an ensemble that runs the search from different starting points may provide a better approximation to the true unknown function than any of the individual classifiers. The final reason is representational; the true function f in most machine learning applications cannot be represented by any hypothesis in H. It may be possible to expand the space of representable functions by forming weighted sums of hypotheses drawn from H. Figure 3.1 below depicts the three reasons (Dietterich 2000). Figure 3.1 The statistical, computational and representational methods for better and more accurate ensemble (Dietterich 2000) The reuse of existing components is an established principle in software engineering. A voting program is developed to allow “voting” on the analysis, of procured results from several candidate systems, of each word: for each word, examine the set of candidate analyses. Where all systems are in agreement, the common analysis is copied; but where contributing systems disagree on the analysis; take the “majority vote”, the analysis given by most systems. If there is a tie, take the result produced by the system with the highest accuracy (Atwell and Roberts 2007) The output analysis of the stemming algorithms is considered as input for the “voting” program. The program reads in these files, tokenizes them, and stores the words and the roots extracted by each stemming algorithm in temporary lists to be used by the voting procedures. The temporary lists work as a bag of words that contains all the result analysis of the stemming algorithms. These roots are ranked in best-first order according to accuracy - 49 results; see section 3.6. Khoja’s stemmer results are inserted to the list first then the results from triliteral stemming algorithm and finally the results of BAMA. After the construction of the lists of all words and their roots, a majority voting procedure is applied to it to select the most common root among the list. If the systems disagree on the analysis, the voting algorithm selects “Majority Vote” root as the root of the word. If there is a tie, where each stemming algorithm generates a different root analysis then the voting algorithm selects the root by two ways. • In experiment 1, the algorithm simply selects the root randomly from the list using the FreqDist() Python function. • In experiment 2, the algorithm selects the root generated from the highest accuracy stemming algorithm which is simply placed in the first position of the list as the candidate roots of the word are inserted to the list using the best-first in terms of accuracy strategy. Figures 3.4 and 3.6 in section 3.5, show sample output of the voting algorithm for both experiments. 3.4 Gold standard for Evaluation A gold standard for evaluating morphological analyzer and stemming algorithms for Arabic text was built using a randomly selected chapter of the Qur’an; chapter number 29 tu `'?; )=# ; ?\' ; ? sūra al-ankabūt “The Spider”, consisting of about 1000 words and representing classical Arabic text; see figure 3.2. Also, a modern standard Arabic (MSA) text sample of the Corpus of Contemporary Arabic28 CCA (Al-Sulaiti and Atwell 2006) was used consisting of about 1000 words. The MSA text sample is selected from three genres; politics, sports and economics section, of newspaper and magazine articles; see figure 3.2. The gold standard is constructed by manually extracting the root of each word of the test documents. The manually extracted roots have been checked by Arabic language experts. Figures 3.4 and 3.6 in section 3.5, show samples of the gold standard’s roots for both text types. Table 3.1 shows number of word tokens, number of word types and detailed frequency of 4 texts: the gold standard’s Qur’an text document, the full Qur’an as a corpus, the gold standard’s CCA text document and a daily MSA newspaper article from Al-Rai daily newspaper29 published in Jordan. The analysis also shows that function words such as * fῑ “in”, C min “from”, n%4 ‘alā “on” and  ’allāh “GOD” are the most frequent words in any Arabic text. On the other hand, non-function words with high 28 29 The Corpus of Contemporary Arabic http://www.comp.leeds.ac.uk/eric/latifa/research.htm Al-Rai daily newspaper http://www.alrai.com/ - 50 frequency such as `2#2o al-ğāmi‘āt “Universities” and d!'al-kuwayt “Kuwait” give a general idea about the main topic or the theme of the article. Simple tokenization is applied for the text of the gold standard documents. This will ensure that test documents can be used to test any stemming algorithm smoothly and correctly. > = ; ;; k' ; ?)G;-S= ?G! r; u= ?; 2.)]; '?'? ;G! k;: '?;G-=?G! k;: v2 ? .) ; i; ;: w >. . > > > > > . 2.)G;-G;H x ; >"y2; = C. R; ;%#= G;;; '?5; / ; C! ;  ?% C. R; ;%#= G;;%G;H u= F%=G;5 C C! ; > > > > . C; k' R  z 2  12  2 <'  i ! k ; : `2 { i k' % R # G ! C!   ; ? ? =; ; ; ; ? = ; ;| . ; ? ; = ; ; ; i; M= ;: _ ~; > .% +t;: k. >;H > .% 12;> 'tG! k2 C;; u? >%#; = }? R> i . '; ?; ` ? =; ; ; ;; > > > >> . ; R> ;2#; = C> 4; € '?)]; C! j >;$; ;.% k. >Z iS= G;) ? 2;? 2;‚.>;H ; 2 ; t; ; ; x > > 2( > > . C; i; = ;: u= F? G.);G!l> ƒ= ;);; u= >>„2;{|; u= F? G)=4; k. ;S| ; ?); `2 ; . '?%R4; ; > > ; i . ;; k' K> ;> @= ?-> ; ; 2 ; ?%R; #= ;G! '?<2; ; t; k>Z; 2^)i= ? =!; ';>" k2 ; = 2;)G=/ u= ?-)? 2;†> u? ?{|;G= ; O .; >Z 2R; F? #= e> ?8 ; ;H uD =%4> >>" ‡ ; ; ˆ ; =; 2; > > >> . *> uFG.);%‰> = ?); `2 > 2( > . ; ?%R#= G;8 x ; 2( ; . '?%R4; ; '?)]; C! =? ; ; ; k'  * "'t %{Š \‹  -Œ d5 qZ m'# n - \H ) `:" %{  Ž#" *  R4 +E8 S5 c2m +@ 22 -42" m'#% 8 `2"2- C \!t t' C ‚ 2) L! e +7 * @  \2 !o `2H2  h4 Q#- L! R)% ! `2"2-  p \!o  <' ! * \2 ,% +  R) S/'" +A2 `2'%#m u‘< `/'m `r2(8r +A2 d"25 @ " \‘)  J2,/: @ ! S%-’m Alt: x" M4N k2- '"'“' @" !t ”'< C Figure 3.2 Sample from Gold Standard first document taken from Chapter 29 of the Qur’an (left) and the CCA (right). Table 3.1 Summary of detailed analysis of the Arabic text documents used in the experiments Qur’an as Corpus Tokens Word Types 77,787 19,278 Token Freq. *> C> 1179 832 8 2; >. C! ;  n;%4; 2;; k. >Z > Y% 9 k;: 499 10 c2 ; ;5 416 1 2 3 4 5 6 7 872 808 652 640 605 464 Gold standard document 1 Chapter 29 Gold standard document 2 CCA Document Al-Rai newspaper article 987 616 Token 1005 710 Token 977 678 Token *> > .% C> ?.% 2;; r.>Z >. C! ;  2; ;.% '?<2; Freq. Freq. Freq. 21 * 35 * 39 17 C 21 C 16 14 n%4 12 n%4 13 12 p 12 p 10 12 d!' 11 qZ 9 12 k: 10 s m 8 11  10 `2#2o 8 8 qZ 8 k: 7 8 M2 8 Mi 7 8 C4 7 -t 7 - 51 - 3.5 Four Experiments and Results In order to compare fairly between different stemming algorithms, four different experiments were applied to compute the accuracy of each algorithm. The accuracy of each experiment is measured using f-score; see formula 1. Each time the experiment is done, a comparison of the results with the gold standard is performed. Accuracy =         /      ∗ 100% …….. (1) The first experiment compares each token’s root output by the three stemming algorithms separately against the token’s roots in the gold standard. The second experiment excludes stop words (function words). The third experiment compares all word-type roots. Finally, word-type roots excluding the stop words (function words) are compared to the gold standard roots. The evaluation is done by comparing roots of the three algorithms according to the four experimental specifications against the manually extracted gold standard roots. Then the accuracy rate of each algorithm is computed using formula (1). Table 3.2 and figure 3.3 show the accuracy rates resulting from the four different experiments for the Qur’an test document. Table 3.3 and figure 3.5 show the accuracy rates resulting from the four different experiments for the CCA test document. Figure 3.4 and 3.6 show sample outputs of the stemming algorithms and the gold standard. Table 3.2 Results of the four evaluation experiments of the 3 stemming algorithms tested using the Qur’an text sample Algorithm Experiment 1: All Tokens (978 tokens) Errors Khoja’s Stemmer BAMA Triliteral Voting Exp.1 Voting Exp.2 Khoja’s Stemmer BAMA Triliteral Voting Exp.1 Voting Exp.2 311 419 394 434 405 Fault Rate 31.8% 42.8% 40.3% 44.4% 41.4% Experiment 3: All Word Types (616 word types) Accuracy 68.2% 57.16% 59.71% 55.6% 58.6% Errors 224 267 266 242 219 Fault Rate Accuracy 36.36% 43.34% 43.18% 39.3% 35.6% 63.64% 56.66% 56.82% 60.7% 64.4% Experiment 2: Tokens excluding Stop words (554 tokens) Experiment 4: Word Types excluding Stop words (451word types) 209 325 279 266 229 155 251 214 174 151 37.73% 58.66% 50.36% 48.0% 41.3% 62.27% 41.34% 49.64% 52.0% 58.7% 34.37% 55.65% 47.45% 38.6% 33.5% 65.63% 44.34% 52.55% 61.4% 66.5% - 52 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% Khoja’s Stemmer BAMA Triliteral Voting Exp.1 Voting Exp.2 Exp1: All Tokens Exp. 2: Exp. 3: All Exp. 4: Word Tokens - Stop Word Types Types - Stop words words Figure 3.3 Accuracy rates resulting from the four different experiments for the Qur’an test document Word w > ; i; ;: v2 ? .) Khoja's stemmer w: i BAMA Triliteral w: w i i Voting Exp. 1 w: i Voting Exp. 2 w: i Gold Standard w: Stop word i v'< v2< v2< v2< v2< v2< k;: k: kZ k: k: k: k: '?;G-=?G! 8 8 ' 8 8 8 Stop word k;: k: kZ k: k: k: k: '?'? ;G! c'5 c25 ''! ''! c'5 c'5 2) C] C] C] C] C] u= ?; r; k' ; ?)G;-S= ?G! u u u u u u Stop word r r r r r r Stop word •H •H C)H •H •H •H 2.)]; Stop word Figure 3.4 Sample output of the three algorithms, the voting experiments and the gold standard of the Qur’an test document The results shown in table 3.2 and figure 3.3 are computed by running the four experiments using the Qur’an text sample. The results of each stemming and voting algorithm in the four experiments are compared against the gold standard roots, and then accuracy rates are computed. In experiment 1 containing all word tokens, Khoja’s stemmer achieved the highest accuracy of 68.2%. The triliteral root extraction algorithm and BAMA achieved quite similar results of 59.71% and 57.16% respectively. Neither voting experiment achieved better accuracy rates: 55.6% for voting experiment 1 and 58.6% for voting experiment 2. In the second experiments excluding stop words, Khoja’s stemmer scored the highest accuracy at 62.27%, then the triliteral root extraction algorithm at 49.64%, and finally BAMA at 41.34%. The voting algorithm scored 58.7% in voting experiment 1 and 55.6% in voting experiment 2. - 53 The third experiment compares the results of each algorithm with respect to wordtype roots. Khoja’s stemmer achieved the highest accuracy at 63.64%. Triliteral root extraction algorithm and BAMA achieved similar accuracy rates of 56.82% and 56.66% respectively. The voting algorithm in this experiment performed better and achieved an accuracy of 64.40% for voting experiment 2 and 60.70% for voting experiment 1. Voting experiment 2 outperforms the best algorithm results by 0.76%. The final experiment evaluates word-type accuracy excluding stop words. Khoja’s stemmer achieved the highest accuracy rate at 65.63%. The triliteral root extraction algorithm achieved 52.55%, and finally BAMA achieved 44.34%. The voting algorithm achieved better results at 66.5% and 61.4% for voting experiment 2 and voting experiment 1 respectively. Voting experiment 2 outperforms the best algorithm results by 0.87%. In summary, Khoja’s stemmer achieved the highest accuracy rate at 68.2% in experiment 1. The rank of the stemming algorithms is Khoja’s stemmer, then triliteral root extraction algorithm, and finally BAMA. The voting algorithm of the voting experiment 2 outperforms the best algorithm results by about 0.8% in experiments 3 and 4. Table 3.3 Tokens and word types accuracy of the 3 stemming algorithms and voting algorithms tested on CCA sample Experiment 1: All Tokens (1005 tokens) Algorithm Errors Fault Rate Accuracy Khoja’s Stemmer BAMA Triliteral Voting Exp.1 Voting Exp.2 231 596 234 303 266 22.99% 59.30% 23.28% 30.15% 26.47% 77.01% 40.70% 76.72% 69.85% 73.53% Khoja’s Stemmer BAMA Triliteral Voting Exp.1 Voting Exp.2 Experiment 3: All Word Types (710 word types) Errors 232 431 253 248 215 Fault Rate Accuracy 32.68% 60.70% 35.63% 34.93% 30.28% 67.32% 39.30% 64.37% 65.07% 69.71% Experiment 2: Tokens excluding Stop words (766 tokens) Experiment 4: Word Types excluding Stop words ( 640 word types) 212 431 253 303 266 184 423 224 252 195 27.7% 60.70% 35.63% 39.56% 34.73% 72.3% 39.30% 64.37% 60.44% 65.27% 28.75% 66.09% 35.00% 39.4% 30.5% 71.25% 33.91% 65.00% 60.6% 69.5% - 54 100.00% 90.00% 80.00% 70.00% 60.00% Khoja’s Stemmer 50.00% BAMA 40.00% Triliteral 30.00% Voting Exp.1 20.00% Voting Exp.2 10.00% 0.00% Exp 1: All Exp 2: All Exp 3: All Exp 4: All Tokens Tokens - Stop Word Types Word Types words Stop words Figure 3.5 Accuracy rates results of the four different experiments for the CCA test document Word Khoja's stemmer BAMA Triliteral roots alg. Voting Exper. 1 Voting Exper. 2 Gold Standard n - L" L" n" L" L" n" m'# m'4 m'4 u%4 m'4 m'4 u%4 qZ qZ qZ qZ qZ qZ d5 d5 d5 d5 d5 d5 -Œ -Œ -Œ - -Œ -Œ .  \‹  '– ‹  '– '– 2– %{Š c' c— +: c— c' cE "'t J't J't t: t: J't J't qZ Stop Word d5 Figure 3.6 Sample output of the three algorithms, the voting experiments and the gold standard of the CCA test document The results shown in table 3.3 and figure 3.5 are computed by running the four experiments using the CCA text sample. The results of each stemming and voting algorithm in the four experiments are compared against the gold standard’s roots, and then accuracy rates are computed. In experiment 1 containing all tokens, Khoja’s stemmer achieved the highest accuracy at 77.01%. The triliteral root extraction algorithm achieved 76.72%, and finally BAMA achieved 40.70%. Neither voting experiments achieved better accuracy rates: 69.85% for voting experiment 1 and 73.53% for voting experiment 2. In the second experiment excluding stop words, Khoja’s stemmer scored the highest accuracy at 72.30%, then the triliteral root extraction algorithm at 64.37%, and finally - 55 BAMA at 39.30%. The voting algorithm scored 60.44% in voting experiment 1 and 65.27% in voting experiment 2. The third experiment compares the results of each algorithm by word-type, Khoja’s stemmer achieved the highest accuracy at 67.32%, then the triliteral root extraction algorithm at 64.37%, then BAMA at 39.30%. The voting algorithm in this experiment performed better and achieved 69.71% for voting experiment 2 and 65.07% for voting experiment 1. Voting experiment 2 outperforms the best algorithm results by 2.39%. The final experiment excludes stop words when comparing word-type roots, Khoja’s stemmer achieved the highest accuracy rate at 71.25%, then the triliteral root extraction algorithm at 65.00%, and finally BAMA at 33.91%. The voting algorithm achieved better accuracy rates, 69.50% and 60.60%, for voting experiment 2 and voting experiment 1 respectively. In summary, Khoja’s stemmer achieved the highest accuracy rate at 77.01% in experiment 1. The rank of the stemming algorithms is Khoja’s stemmer, then triliteral root extraction algorithm, and finally BAMA. The voting algorithm of voting experiment 2 outperforms the best algorithm results by 2.39% in experiment 3. 3.6 Comparative Evaluation Conclusions This study compared three existing stemming algorithms: Khoja’s stemmer, BAMA and the Triliteral root extraction algorithm. Results of the stemming algorithms were compared with the gold standard of classical and MSA text samples of 1,000 words each. Four experiments were performed to fairly and accurately compare the outputs of the three different stemming algorithms and morphological analysis for Arabic text. The four experiments on both text samples show the same accuracy rank for the stemming algorithms: Khoja’s stemmer achieved the highest accuracy then the triliteral root extraction algorithm and finally BAMA. Khoja’s and the triliteral stemming algorithms generate only one result analysis for each input word, while BAMA generates one or more result analysis. The voting algorithm achieves about 62% average accuracy for Qur’an text and about 70% average accuracy for newspaper text. The results show that the stemming algorithms used in the experiments work better on MSA text (i.e. newspaper text) than classical Arabic (i.e. Qur’an text), not unexpectedly as they were originally designed for stemming MSA text (i.e. newspaper text). All stemming algorithms involved in the experiments agreed and generate correct analysis for simple roots that do not require detailed analysis. So, more detailed analysis and enhancements are recommended as future work. - 56 Most stemming algorithms are designed for information retrieval systems where accuracy of the stemmers is not such an important issue. On the other hand, accuracy is vital for natural language processing. The accuracy rates show that even the best algorithm failed to achieve accuracy of more than 75%. This proves that more research is required, as Part-of-Speech tagging and then Parsing cannot rely on such stemming algorithms because errors from the stemming algorithms will propagate to such systems. The experiments are limited to the three stemming algorithms. Other algorithms are not available freely on the web, and it is hard to acquire them from the authors. Opensource development of resources is important to advance research on Arabic NLP. 3.7 Analytical Study of Arabic Triliteral Roots To understand the nature of Arabic roots, and the derivation process of words, triliteral roots are classified into 22 groups depending on the internal structure of the root itself; whether it contains only consonant letters, hamzah, or defective letters (Dahdah 1987; Wright 1996; Al-Ghalayyni 2005; Ryding 2005). Section 6.2.21 discusses the classification of triliteral roots. Arabic triliteral root distribution is studied over the 22 categories by analyzing real text corpora: the Qur’an as corpus, which contains 45,534 triliteral-root words (i.e. not including function words which do not have triliteral roots such as demonstrative pronouns e.g. ; ; hāḏā “this”, and words with quadriliteral roots such as u> ; Q; darāhim “dirhams” from the root MgGggQ d-r-h-m, or quinquilitiral roots). This is an example of a natural corpus where words are repeated in different contexts; and 376,167 word types, derived from triliteral roots, an example of a dictionary of Arabic where each word of the test sample occurs once. Chapter 4 will discuss the processing steps, statistics and evaluation of the broad-coverage lexical resource the SALMA – ABCLexicon. 3.7.1 A Study of Triliteral Roots in the Qur’an In general it is said that an Arabic word has a root of 3 consonants. However, there are many exceptions which cause problems for analysis. hamzah is a special letter which is not a normal consonant but can appear in a root. Also, a few roots include vowels, and these are called “defective”. Sometimes a consonant is doubled, and this also cause ambiguity in analysis. The results show that 68% of the triliteral roots of Qur’an and 61% of the Qur’an words are derived from triliteral roots, mainly intact roots which are represented in categories 1 to 5 in table 3.4. 29% of the triliteral roots of Qur’an are defective roots (i.e. they contain one or two vowels in - their root) represented in categories 6-11 in table 3.4.The percentage of the words belonging to this category is 32% of the words of the Qur’an. The third category contains one or two vowels and hamzah in its root, represented - 57 in categories 12-22 in table 3.4. The percentage of such triliteral roots of the Qur’an is 3%, and 7% of the words of the Qur’an belong to this category. Table 3.5 and figure 3.7 show the distribution of the Qur’an’s words and roots into the three main root categories. Table 3.4 Category distribution of Roots-Types and Word-Tokens extracted from the Qur’an Roots-Types Category Word-Tokens count Percentage count Percentage 1 2 3 Sound Doubled Initially-hamzated C1 C1 H C2 C2 C2 C3 C2 C3 870 136 44 54.04% 8.45% 2.73% 20,007 3,814 3,243 43.94% 8.38% 7.12% 4 5 Medially-hamzated Finally-hamzated C1 C1 H C2 C3 H 15 32 0.93% 1.99% 281 459 0.62% 1.01% 6 7 8 Initially-defective Medially-defective Finally-defective V C1 C1 C2 V C2 C3 C3 V 70 198 167 4.35% 12.30% 10.37% 1,252 8,162 3,584 2.75% 17.93% 7.87% 9 10 11 Separated doubly-weak Finally-adjacent doubly-weak Initially-adjacent doubly-weak V C1 V1 C2 V1 V2 V V2 C3 12 19 2 0.12% 1.18% 0.12% 710 473 445 1.56% 1.04% 0.98% 12 13 Initially-hamzated and doubled Initially-defective and Doubled H V C2 C2 C2 C2 7 2 0.43% 0.12% 175 40 0.38% 0.09% H C2 V 13 0.81% 958 2.10% H V C3 6 0.37% 153 0.34% H V1 V2 2 0.12% 418 0.92% C1 H V 2 0.12% 330 0.72% V1 H V2 0 0.00% 0 0.00% V H C3 3 0.19% 15 0.03% C1 V H 8 0.50% 998 2.19% V C2 H 2 0.12% 17 0.04% V1 V2 H 0 0.00% 0 0.00% 1610 100.00% 45,534 100.00% 14 Initially-hamzated and finallydefective 15 Initially-hamzated and mediallydefective 16 Adjacent doubly-weak and initially-hamzated 17 Finally-defective and mediallyhamzated 18 Separated doubly-weak and medially-hamzated 19 Initially-defective and mediallyhamza 20 Medially-defective and finallyhamzated 21 Initially-defective and finallyhamzated 22 Adjacent doubly-weak and finally-hamzated Totals Table 3.5 Summary of category distribution of root and tokens of the Qur’an Root Tokens Category Total Percentage Total Percentage Intact Defective Defective and hamzated Totals 1097 468 45 1610 68.14% 29.07% 2.80% 100.00% 27,804 14,626 3,104 45,534 61.06% 32.12% 6.82% 100.00% - 58 - Defective and hamzated, 2.80% Defective and hamzated, 6.82% Defective, 29.07% Defective, 32.12% Intact, 61.06% Intact, 68.14% Qur'an Roots Intact Defective Defective and hamzated Qur'an Tokens Intact Defective Defective and hamzated Figure 3.7 Root distribution (left) and word distribution (right) of the Qur’an 3.7.2. A Study of Triliteral Roots in Traditional Arabic Lexicons Similar root and word distributions were obtained from the roots and the word types stored in the broad-coverage lexical resource. About 63% of the roots stored in the broadcoverage lexical resource are intact words, categories 1-5 in table 3.6, and slightly more than 68% of the word types belong to this category. Defective roots represented by categories 6-11 in table 3.6, form about 33% of the roots of the broad-coverage lexical resource and 29% of the word types belong to this category. Finally, defective and hamzated roots, represented by categories 12-22 in table 3.6, of the broad-coverage lexical resource are approximately 4% of roots, and about 2% of the word types belong to this category. Figure 3.8 and table 3.7 show the root and word types distribution after analyzing the broad-coverage lexical resource. - 59 Table 3.6 Category distribution of Root and Word type extracted from the lexicon 1 2 3 Sound Doubled Initially-hamzated C1 C1 H C2 C2 C2 C3 C2 C3 Root Count 4147 446 289 4 5 Medially-hamzated Finally-hamzated C1 C1 H C2 C3 H 216 270 2.54% 3.18% 3,909 8,985 1.04% 2.39% 6 7 8 Initially-defective Medially-defective Finally-defective V C1 C1 C2 V C2 C3 C3 V 386 1115 1151 4.54% 13.11% 13.54% 19,219 43,512 41,295 5.11% 11.57% 10.98% 9 Separated doublyweak Finally-adjacent doubly-weak Initially-adjacent doubly-weak Initially-hamzated and doubled Initially-defective and Doubled Initially-hamzated and finally-defective Initially-hamzated and mediallydefective Adjacent doublyweak and initiallyhamzated Finally-defective and medially-hamzated Separated doublyweak and mediallyhamzated Initially-defective and medially-hamza Medially-defective and finally-hamzated Initially-defective and finally-hamzated Adjacent doublyweak and finallyhamzated V C2 V 45 0.08% 2,372 0.63% C1 V1 V2 106 1.25% 4,057 1.08% V1 V2 C3 22 0.26% 211 0.06% H C2 C2 30 0.35% 888 0.24% V C2 C2 29 0.34% 463 0.12% H C2 V 74 0.87% 2,111 0.56% H V C3 47 0.55% 892 0.24% H V1 V2 7 0.08% 135 0.04% C1 H V 42 0.49% 1,041 0.28% V1 H V2 2 0.02% 52 0.01% V H C3 15 0.18% 292 0.08% C1 V H 42 0.49% 1,590 0.42% V C2 H 21 0.25% 1,302 0.35% V1 V2 H 0 0.00% 0 0.00% 8502 100.00% 376,167 100.00% Category 10 11 12 13 14 15 16 17 18 19 20 21 22 Totals Percentage 48.78% 5.25% 3.40% Word Type Types 201,385 32,007 10,449 Percentage 53.54% 8.51% 2.78% Table 3.7 Summary of category distribution of root and word types of the lexicons Root Word Types Category Total Percentage Total Percentage 5368 63.30% 256,735 68.25% Intact Defective Defective and hamzated Totals 2803 33.05% 110,666 29.42% 309 3.64% 8,766 2.33% 8480 100.00% 376,167 100.00% - 60 - Defective and hamzated, 3.64% Defective, 29.42% Defective, 33.05% Defective and hamzated, 2.33% Intact, 68.25% Intact, 63.30% Lexicons' Word Types Lexicons' Roots Intact Defective Defective and hamzated Intact Defective Defective and hamzated Figure 3.8 Root distribution (left) and Word type distribution (right) of the broad-lexical resource 3.7.3 Discussion of the Analytical Study of Arabic Triliteral Roots The above analysis gives a clear picture of the distribution of the 22 categories and 3 broad categories of triliteral roots, words and word types. The study clearly shows that about a third of any Arabic text words have roots belonging to defective or defective and hamzated root categories. Words belonging to these two root categories are hard to analyze and the root extraction process for such words always has higher error rates than words belonging to the intact root category. Stemming and morphological analyzers are subject to mistakes when analyzing words belonging to these two broad categories. Similar distribution results were obtained by analyzing the Qur’an’s roots and words and the broad-coverage lexicon roots and word types. About 65% of roots, words and word types belong to intact triliteral roots. About 35% of the roots, words and word types are classified into the defective triliteral root category. Finally, 5% of the roots, words and word types belong to the defective and hamzated triliteral root category. These figures prove that any successful stemming and morphological analysis system has to deal with issues specific to Arabic word derivation such as: incorporation, substitution and deletion of a weak vowel letter. Moreover, dealing with orthographic issues such as hamzah in writing is critical for stemming and morphological analysis of Arabic text. Root extraction accuracy of any stemming or morphological analysis which does not deal with these special language specifications will not achieve an accuracy rate more than 65% in the best case. A question raised in this context is: how to improve stemming and morphological analysis so the algorithm can deal successfully with the hard cases of the 35% of words belonging to defective and defective and hamzated triliteral root categories? Two methodologies can be followed; either building a sophisticated algorithm that deals with - 61 the hard cases or simply by providing the algorithm with a prior-knowledge broadcoverage lexical resource that contains most of the hard case words and their triliteral roots. Then the stemming algorithm will look up the word to be analyzed in the lexicon and get the correct analysis for that word. A look-up methodology is needed here. Chapter 4 discusses the motivation and the processing steps in constructing the prior-knowledge broad-coverage lexical resource the SALMA-ABCLexicon30. The lexicon was constructed by analyzing the text of 23 traditional Arabic lexicons which are freely available open-source documents (PDF and MS-Word files). The main purpose of constructing the SALMA-ABCLexicon was to improve the morphological analysis of Arabic text. Constructing a broad-coverage lexical resource to improve the accuracy of Arabic morphological analysis has advantages over developing a sophisticated stemming algorithm. These advantages are discussed in detail in section 4.4. The constructed lexicon has about half a million different Arabic words which covers 85% or more of any Arabic text. 3.8 Summary and Conclusions Arabic morphological analysers and stemming algorithms have become a popular area of research. Several computational linguists have designed and developed algorithms to solve the problems of morphology and syntax. Stemming algorithms have been developed for many languages including Arabic. Several stemming algorithms for Arabic already exist, but each researcher proposed an evaluation methodology based on different text corpora. Therefore, we cannot make direct comparisons between these evaluations. This chapter discussed four different fair and precise evaluation experiments using a gold standard for evaluation consisting of two 1000-word text documents from the Holy Qur’an and the Corpus of Contemporary Arabic. The selection of the stemming algorithms was limited to the algorithms where we have ready access to the implementation and/or results. The three selected algorithms are Khoja’s stemmer (Khoja 2003), Buckwalter’s morphological Analyzer (BAMA) (Buckwalter 2002) and AlShalabi et. al, triliteral root extraction algorithm (Al-Shalabi et al. 2003). A reuse of the results of the three algorithms in a voting program was developed to allow “voting” on the analysis of the three stemming algorithms. The four experiments on both text samples show the same accuracy rank for the stemming algorithms: Khoja’s stemmer achieved the highest accuracy then the triliteral root extraction algorithm and finally BAMA. The results show that the stemming algorithms used in the experiments work better on MSA text (i.e. newspaper text) than 30 SALMA-ABCLexicon (Sawalha Atwell Leeds Morphological Analysis – Arabic Broad-Coverage Lexicon) http://www.comp.leeds.ac.uk/cgi-bin/scmss/arabic_roots.py - 62 classical Arabic (i.e. Qur’an text), not unexpectedly as they were originally designed for stemming MSA text (i.e. newspaper text). All stemming algorithms involved in the experiments agreed and generated correct analyses for simple roots that do not require detailed analysis. So, more detailed analysis and enhancements are recommended as future work. Most stemming algorithms are designed for information retrieval systems where accuracy of the stemmers is not such an important issue. On the other hand, accuracy is vital for natural language processing. The accuracy rates show that even the best algorithm failed to achieve accuracy rate of more than 75%. This proves that more research is required, as Part-of-Speech tagging and then Parsing cannot rely on such stemming algorithms because errors from the stemming algorithms will propagate to such systems. A clear image of the percentage of triliteral roots, words and word types distribution on 22 categories of triliteral roots was presented. The study clearly showed that about one third of Arabic text words have roots belonging to the defective or defective and hamzated root categories. Words belonging to these two root categories are hard to analyze and the root extraction process of such words always has higher error rates than for words belonging to the intact root category. Existing stemming and morphological analyzers are subject to mistakes when analyzing words belonging to these two categories. The construction of a broad-coverage lexical resource to improve the accuracy of Arabic morphological analysis was proposed as a practical solution. Chapter 4 will discuss the motivation and the processing steps in constructing the prior-knowledge broad-coverage lexical resource, the SALMA-ABCLexicon. The lexicon is constructed by analyzing the text of 23 traditional Arabic lexicons which are freely available opensource documents. The main purpose of constructing the SALMA-ABCLexicon is to improve morphological analysis of Arabic text. The constructed lexicon has about half a million different Arabic words, which covers about 85% of any Arabic text. - 63 - Chapter 4 The SALMA-ABCLexicon: Prior-Knowledge Broad-Coverage Lexical Resource to Improve Morphological Analyses This chapter is based on the following sections of published papers: Sections 1, 2, 3, 4, 5 and 6 are based on section 1, 2, 3, 4, 5, 6, and 7 in (Sawalha and Atwell 2010a) Chapter Summary Broad-coverage language resources which provide prior linguistic knowledge must improve the accuracy and the performance of NLP applications. A broad-coverage lexical resource, the SALMA ABCLexicon (Sawalha Atwell Leeds Morphological Analysis Arabic Broad-Coverage Lexicon) was constructed to improve the accuracy of morphological analyzers and part-of-speech taggers of Arabic text. Over the past 1200 years, many different kinds of Arabic language lexicons have been constructed; these lexicons are different in ordering, size and aim of construction. 23 machine-readable lexicons, which are freely available on the web as portable document format (.pdf) or MS-Word (.doc) documents, were collected. Lexical resources were combined into one large broad-coverage lexical resource, the SALMA-ABCLexicon, by extracting information from disparate formats and merging traditional Arabic lexicons. The construction process followed agreed criteria for constructing morphological lexical resources from raw text. To evaluate the broad-coverage lexical resource, coverage was computed over the Qur’an, the Corpus of Contemporary Arabic, and a sample from the Arabic Internet Corpus, using two methods. Counting exact word matches between test corpora and lexicon scored about 65-68%; Arabic has a rich morphology with many combinations of roots, affixes and clitics, so about a third of words in the corpora did not have an exact match in the lexicon. The second approach is to compute coverage in terms of use in a lemmatizer program, which strips clitics to look for a match for the underlying lexeme; this scored about 82-85%. - 64 - 4.1 Introduction Lexicography is the applied part of lexicology. It is concerned with collating, ordering of entries, derivations and their meaning depending on the aim of the lexicon to be constructed and its size. Lexicography is defined as “…the branch of applied linguistics concerned with the design and construction of lexica for practical use.” (Eynde and Gibbon 2000). On the other hand, lexicology is defined as “…the branch of descriptive linguistics concerned with the linguistic theory and methodology for describing lexical information, often focusing specifically on issues of meaning.” (Eynde and Gibbon 2000). Long-term efforts in lexicographic projects have greatly accelerated since the advent and use of computers: this is known as computational lexicography. However, constructing a large-scale broad-coverage lexicon involves time-consuming development of specifications, design, collection of lexical data, information structuring, and user-oriented presentation formatting (Eynde and Gibbon 2000). A realistic and useful lexicon for NLP requires an efficiently stored machinereadable database with a large number of words with associated syntactic and semantic information (Russell et al. 1986). Morphological lexicons are based on the idea of generating all possible combinations of morphemes. But filtering out the non-established, yet theoretically possible combinations of morphemes is the major problem of lexicon generation (Tadi and Fulgosi 2003). Morphological lexicons are useful for many natural language applications such as: spelling and syntactic checkers integrated to word processing applications, development of morphological and syntactic analyzers, search engines, machine translation, information filtering and extraction systems, etc. (Petasis et al. 2001). Morphosyntactic lexicons are valuable resources for many NLP applications. However, these lexicons need to meet certain specifications such as high coverage; high level of quality; directly reusable in NLP tools; and freely-available to potential users (Sagot 2010). 4.1.1 Morphological Lexicons of Other Languages Morphological lexicons exist for many languages. The Special Interest Group on the Lexicon of the Association for Computational Linguistics (ACL SIGLEX) maintains an online comprehensive list of lexical resources31. The lists and files with linguistic information include: Brown Corpus Lexicon of 52,000 words; the XTAG project with an associated 300,000 word English lexicalized grammar; COMLEX (COMmon LEXicon) a monolingual English Dictionary consisting of 38,000 head words; the Oxford Text Archive (OTA) of machine readable dictionaries for many languages; Adam Kilgarriff’s list of 6,318 most frequent lemmas extracted from the British National Corpus; The Moby 31 Online lexical resources by ACL SIGLEX http://www.clres.com/online.html - 65 lexicon project consisting of sub-lexicons including Moby Hyphenator (185,000 entries), Moby Part-of-Speech (230,000 entries), Moby Thesaurus (30,000 entries) and Moby Words (610,000 words and phrases); Upper Cyc Ontology containing about 3,000 words capturing the most general concepts of human consensus reality. Russell, Pulman et al. (1986) developed a dictionary and morphological analyzer for English. They assumed that correct syntactic analyses are built in to the lexical entries, but allowing adaptation by users to suit different analyses. The morphological lexicon itself consists of a sequence of entries, each in the form of a Lisp s-expression which consists of five elements: first, the head word in written form; second, the head word in phonological transcription; third, a syntactic field consisting of a syntactic category; fourth, a semantic field providing the facility for users and any Lisp s-expression to be inserted in it; and finally, a user field which allows users to include additional information they desire. The prototype lexicon contains about 3,500 entries. MULTEXT lexicons32 are part of the MULTEXT project, which aims to develop tools, corpora, and linguistic resources for a wide variety of languages. The MULTEXT lexicons include four developed lexicons for German, Italian, Spanish and French. The lexicons are stored in tab separated column files where the first column represents the word form, the second column represents the lemma and the last column represents the lexical tag. MULTEXT-East33 language resources are multilingual datasets for language engineering focused on the morphosyntactic level of linguistic description. These resources cover 16 languages of mainly central and eastern Europe and include the EAGLES-based morphosyntactic specifications and morphosyntactic lexica. MULTEXTEast followed the same lexicon format as the original MULTEXT lexicons. The size of MULTEXT-East lexicons ranges from 13,006 entries for Persian to 2,461,491 entries for Slovak (Erjavec 2010). The Croatian Morphological Lexicon (CML) is a lexicon developed to make a model of the Croatian morphological system. The CML has two sub-lexicons: derivative/compositional (i.e. a list of lexical and a list of derivational morphemes with rules for combining) and inflectional (i.e. a list of generated stems and a list of inflectional morphemes with rules for combining) which are produced by two morphological generators according to morphotactic rules. The CML followed the same lexicon format as MUTEXT-East. The CML contains 36,000 lemmas extracted from the Croatian dictionary. Then the generation of word forms generated 171,308 nouns, 232,276 verbs, 1,207,786 adjectives and 11,706 adverbs (Tadi and Fulgosi 2003). 32 33 MULTEXT Lexicons http://aune.lpl.univ-aix.fr/projects/multext/MUL5.html MULTEXT-East http://nl.ijs.si/ME/V4/ - 66 A large-scale Greek morphological lexicon was developed by the Software and Knowledge Engineering Laboratory (SKEL) to be used to develop a lemmatizer and morphological analyzer in a controlled language checker for Greek. The SKEL lexicon is organized into two components: the query component which aims to facilitate the query of the lexicon about specific form and retrieve the associated linguistic information; and the generation component responsible for generating all possible word forms for a given lemma. The generation component also utilizes language specific rules regarding syllabication and accentuation. The morphological database consists of a fixed number of pages, where each page contains a set of morphological entries. Each entry contains a fixed number of morphological features such as lemma, stem, suffix, syllabication, partof-speech and other morphological features such as number, inflectional type, gender, case, inflection, tense, person, voice, mood, etc. The SKEL lexicon contains 60,000 unique lemmas which generate 710,000 word forms. The morphological database contains about 2,500,000 morphological entries (Petasis et al. 2001). A Latvian lexicon was developed as part of a lexicon-based morphological analyzer for Latvian which is an implementation of word inflection based on a stem and its properties already stored in the lexicon. The lexicon’s core data are the dictionary’s lexical units, which contain word stems, their morphological types and any other linguistic information related to the stems. The lexicon contains about 27,000 stems. The coverage of the lexicon is scored at 85%-90% after analyzing an unrestricted text corpus. A heuristic, based on last letter of the analyzed word, is integrated with the morphological analyzer for guessing the part-of-speech of the remaining uncovered percentage of words. XML files are used to store the lexicon and other data files (Paikens 2007). A freely-available and wide-coverage morphosyntactic lexicon for French Lefff34 (Lexique des formes fléchies du français – Lexicon of French inflected forms) is used in many NLP tools including large-coverage parsers. The Lefff uses the Alexina framework to ensure reusability of the lexicon in many NLP tools. Alexina is a lexical modelling and acquisition framework for both the morphological and syntactic levels, which is a language and grammatical formalism independent and compatible with Lexical Markup Framework (LMF) standards. The Alexina lexicon consists of entries (i.e. lexemes) where each entry is associated with a lemma, a category and an inflectional class. The Lefff (3.0.1) contains 536,375 entries corresponding to 110,477 lemmas covering the grammatical categories of verbs, verbal idioms, nouns, adjectives, adverbs, prepositions, proper nouns and others. The Lefff is evaluated by a quantitative comparison with other existing lexical resources for French. It has also been evaluated in terms of its use in POS tagger and deep parser. Integrating Lefff in a maximum-entropy-based part-of-speech 34 Lefff http://www.labri.fr/perso/clement/lefff/ - 67 tagger for French trained on the French Treebank increased the accuracy from 97.0% (86.1% for unknown words) to 97.7% (90.1% for unknown words) (Sagot et al. 2006; Nicolas et al. 2008; Sagot 2010). Sagot (2005) developed a lexicon for Slovak from a raw corpus and a morphological description of the language. Both inflectional and derivational morphology are used to enhance the accuracy (recall and precision) and to acquire the derivational relations in the lexicon. A three-step procedure is followed for the acquisition of the lexicon. First, given the morphological description of the language, build all possible lemmas that can possibly explain the inflected forms in the lexicon. Second, rank the lemmas according to their likelihood in the corpus. Finally, best ranked lemmas are manually validated. A claim is stated that this methodology can be used for morphologically rich languages. The acquired lexicon following this methodology contains 2,000 lemmas generating more than 50,000 inflected forms (Sagot 2005). A morphological analyzer and language specific web crawler (i.e. a tool used to collect a list of word types) have a potential to enhance lexical resources for morphologically rich but resource-poor languages such as Tigrinya. Tigrinya is an EthioSemitic language spoken by about 6 million people in the Tigray region of northern Ethiopia and in central Eritrea. The web crawler collected a list of 227,984 word types. Then, the list was filtered and passed to the morphological analyzer. 65,732 words succeed the lexical analysis, and 46,979 words have at least one analysis generated by the guesser analyzer (Gasser 2010). In summary, many existing morphological lexicons were constructed from raw text (Sagot 2005). The general requirements for constructing a morphological lexicon from raw text are: • A representative corpus. • A generation program or a morphological description of the language. • A Lexical Markup Framework (LMF) for providing compatible structure to store the lexical entries to ensure reusability of the lexicon in many NLP tools. • A searching facility over the lexical entries (querying the constructed lexicon). • An evaluation methodology for the morphological lexicons, by computing the coverage of the lexicon, and by measuring the accuracy gained after integrating the lexicon to a NLP application such as part-of-speech tagger or syntactic parser. - 68 - 4.1.2 Morphological Lexicons for Arabic A morphological analyzer for Arabic (BAMA) (Buckwalter 2002; Buckwalter 2004) contains three Arabic-English lexicon files: a prefixes file containing 299 entries, a suffixes file containing 618 entries, and a stems file containing 82,185 entries representing 38,600 lemmas; see section 3.2.2. The lexicon component of BAMA is reused in other Arabic NLP tools such as the large-scale lexeme-based Arabic morphological generation Aragen (Habash 2004), and spell checking lexicons such as Duali35, Baghdad36 and Arabic-spell37. The AyaSpell38 project aims to develop open-source resources for Arabic NPL including Arabic spell checker. The shortage of existing Arabic spell checkers comes from the lexicon they depend on. A lexicon is developed to support the AyaSpell checker. The lexicon consists of two components: the vocabulary list built by analyzing 5 traditional Arabic lexicons; and the affixes and morphological rules list. Each entry in the vocabulary list has its morphological description associated with it. The vocabulary list contains more than 50,000 entries distributed on more than 10,000 verbs and more than 40,000 nouns, particles and residuals (Zarrouki and Kebdani 2009; Zerrouki and Balla 2009). WordNet is a broad coverage lexical resource which is developed to support many information retrieval applications. The basic idea behind WordNet is that knowledge of words is represented by meanings and the context in which they occur. The desired conceptual information is provided by linking words to appropriate concepts. Concepts in the WordNet are the organizational units. They can be single words, compounds, collocations, idiomatic phrases and phrasal verbs. The foundation of the Global WordNet Association and the Global WordNet project coordinates the production and the linkage of wordnets for all languages of the world including Arabic (Elkateb, Black and Farwell 2006). Arabic WordNet (AWN) is a lexical resource for MSA which is based on the design and the contents of the Princeton WordNet (PWN) for English. The AWN is constructed following the same methods developed for Euro WordNet, which is compatible with other wordnets and focuses on manual encoding of the most complicated and important concepts. The AWN structure consists of four principal structures. First, the items represent conceptual entities including synsets, ontology classes and instances. Second, a word entity represents a word sense. Third, a form entity contains lexical information. 35 Duali Arabic spell-checker http://www.arabeyes.org/project.php?proj=Duali Bahghdad Arabic spell checker http://home.foolab.org/cgi-bin/viewcvs.cgi/projects/baghdad/ 37 Arabic-spell http://sourceforge.net/projects/arabic-spell/ 38 AyaSpell Arabic spell checker http://ayaspell.sourceforge.net/index.php 36 - 69 Fourth, a link connects in a relation two items. The AWN is stored using XML files and relational database implemented by MySQL. 1,000 terms and 4,000 definition statements are the contents of the large ontology which is built to provide the semantic background for the AWN (Elkateb and Black 2001; Black and El-Kateb 2004; Elkateb et al. 2006; Rodríguez et al. 2008). Arabic Verbnet is a large coverage verb taxonomy for Arabic, a lexicon for Arabic verbs. Arabic Verbnet provides key element information about the syntax and semantics of Arabic verbs using the notion of verb-classes similar to the Verbnet for English. Arabic Verbnet contains verb entries where each entry is a third person masculine singular perfect verb. Each verb entry contains four child nodes of the verb, its root, verbal noun(s), and participle(s). It uses 23 thematic roles which have been already used in the English Verbnet. It has 173 classes which contain 4,392 verbs and 498 frames. These frames provide the four verb entry child nodes information besides information about subcategorization frames and syntactic and semantic description of each verb. The Arabic Verbnet uses XML fromat to store its frames (Mousser 2010). In summary, the surveyed Arabic lexicons are common morphological and linguistic lists that are specific to a certain Arabic NLP application. They are not general purpose and they are small in size. Moreover, all of them only deal with modern standard Arabic (MSA). Arabic WordNet and Verbnet are based on models for English and IndoEuropean languages, rather than on Semitic templatic root-based lexical principles. 4.2 Traditional Arabic Lexicons and Lexicography Traditional Arabic lexicons are not available in computerized lexicographic databases. Moreover, traditional Arabic lexicons have different arrangement methodologies than modern English dictionaries. Common English dictionaries list lexical entries, which are words (i.e. lexical entries in form of lemmas), arranged alphabetically; followed by the meaning of that word, while Arabic lexicons are mainly arranged by selecting the root as main lexical entry. The roots are followed by a definition part which may span several pages. The definition part is written as a unit or an article (i.e. encyclopaedia entry) which defines all the derived words of a certain root. These lexical entries are not arranged or distinguished with special formatting. A study of a traditional Arabic lexicon called al-qāmūs al-muḥῑṭ  ˜ v'2 “The comprehensive lexicon” showed three major drawbacks of traditional Arabic lexicons. First, they do not represent language development periods in different times. Second, there are ambiguities in defining and explaining lexical meaning of the derived words. Third, the ordering methodology of the derived words is unorganized and lacks the reference of the origin of the derivations. Khalil (1998) highlighted the importance of - 70 ordering the derivations of each lexical entry to directly access the meaning of the derivations, and to show the origin of the Arabic word and its specifications. Arabic lexicography is one of the original and deep-rooted arts of Arabic literature. The first lexicon constructed was kitāb al-‘ayn x# J2- ‘al-‘ayn lexicon’ by al-farāhῑdῑ (died in 791). Over the past 1300 years, many different kinds of Arabic language lexicons were constructed; these lexicons are different in ordering, size and goal of construction. Many Arabic language linguists and lexicographers studied the construction, development and the different methodologies used to construct these lexicons. Several traditional Arabic lexicons have been scanned and put online as portable document format (.pdf) files. A few have been key-boarded and put online as MS-Word (.doc) or HTML text files. Figures 4.1 and 4.4 show samples of text taken from traditional Arabic lexicons; the target lexical entries are underlined and highlighted in blue. Figure 4.2 shows the human translation of the sample of figure 4.1, the target lexical entries are highlighted by square brackets. Figure 4.3 is a sample of the Arabic-English lexicon by Edward Lane (Lane 1968) volume 7, pages 117-119; the target lexical entries are underlined. Figure 4.5 shows a sample of the original manuscript of the traditional Arabic lexicon aṣ-ṣiḥāḥ fῑ al-luḡah $% * a2,( ‘The Correct Language’. > > > > )=4> C d ? =%G;G=5;: :uƒ) '";: c25 š.e‰; :;.-; ^"2- ^2"2- ^2 -=;  ?-= ;! ;1L@ ; ;-; . D -=? D ?-? }Ro 3# :J2?  :  > Ÿ!> .e * k2 > ž œ t>  _“ > |->>8 > i) Ž#" * d!;: :c25 6 > |-; ?8 6 ;1;F= ;G" $ L 12- i" k2 = >;T2 Q2! = ;: Mr = %;-=?›  ; ; = B ?; 3 ; > > š^24'R? =;¤ -? 2m u J2:^2£!;: J2}; ;G=8;: ¡ k' ; 32 ; R? ;%#= >8 :k'' H 12- ki> = ;! ; ?  : “; .¢2 ,% C4 ur ? .12- \i ? >  .2 > G>T ¥2 > ( >  š( J2- > ;: ^2<H kH | +  ^42)/>  k'8 D ; ;-;-= :c2! .’i)8 ^2"2-> ‡?"2->-= :? =- ? C= RG; > ?"2- ?  > R= ;-=  :;;-;-=  š.e‰; ;;-; :+ 5 .;;- ; ;;-;-= :  C" . ;?-= ;! k;: ;E ;: ;1L@  ;-= ;- =  .t2 * ^2"2-  ; ?-= ;! k;: ;E > = ;: š^ G/;: ^\= ?"  %4 n%=?¦ LFH 2F;;-;-= :l!l# +!l)- * .?-=;-; :-=;-;-=   ;-; : ;-;-=  .;;-= ;- = ‡ ; ;-;-= :c2! .2F;;-= ;- > > > > d . d B k'!Q * iS< = t; ;‰; I;: k. >Z +t ? =;-; ;: š  \l¥ * d=-?-= ¢>Z ^t2 D  c25 :§!  * .k2e=%i ; ; ;-; yZ +t ? >  .L%4 2F>%;: ;: ;\ (  €> >-= ;: :c'8 .\l?$ %¨ * L© >= > > * ;‘;< C :§!  * . H >-? 2 :J2- Z ‹$"  ‰;: J2. = = ? ; ; ; > 2 q>Z ?‘)=G! 2‚;E 2)# + 5 :c25 } )(  ; , G%= G;H 2) z; 2R ;: + ¦  :‹–; C" c25 š2) * ?‘)=G! 2‚;EH  %4 ? t'! = = =; ; ; ; ? ; > > > c'R0 D §!   :c25 šk' ; 2 ?  u M'5 qZ }R;- = yZ }Ri ? ? ;52#?! 2R ) !2)o k; (; ;"'4? Q; ;: <;: +R-z :c25 š2) > n%4 >  .J2- + * jM24 ' :+ 5 š %4 };%.e?! k;: ?2/ ;= ;! <2;: j>  H  J2Figure 4.1 A sample of text from the traditional Arabic lexicons corpus “lisān al-‘arab”, the target lexical entries are underlined and highlighted in blue. - 71 k t b: [Alkitab] the book; is well known. The plural forms are [kutubun] and [kutbun]. [kataba Alshay’] He wrote something. [yaktubuhu] the action of writing something. [katban], [kitaban] and [kitabatan] means the art of writing. And [kattabahu] writing it means draw it up. Abu Al-Najim said: I returned back from Ziyad’s house [after meeting him] and behaved demented, my legs drawn up differently (means walking in a different way). They wrote [tukattibani] on the road the letters of Lam Alif (describing how he was walking crazily and in a different way). He said: I saw in a different version, the word “they wrote” [tikittibani] using the short vowel kasrah on the first letter [taa], as it is used by Bahraa’ [Arab tribe] dialect. They say: [ti’lamuwn] (you know). Then the short vowel kasrah is propagated to the following letter (kaf). Moreover, [Alkitab] the book is a noun. Al-lihyani Al-Azhari definition is: [Alkitab] The book is the name of a collection of what has been written (a collection of written materials or texts). And the book has gerund [Alkitabatu] writing (art of writing) for whoever has a profession, similar to drafting and sewing. And [Alkitabatu]: is copying a book [copying a book in several copies]. It is said: [iktataba] someone subscribed another means; he asked to write him a letter in something. [istaktabahu] He dictated someone something means to write him something. Ibn Sayyedah: [Iktatabahu] is similar to [katabahu]. It is said: [katabahu] write something down means draw up. And [Iktatabahu] writing something down means dictate someone something, which is the same meaning of [Istaktabahu]. [Iktatabahu] registering (masculine), and [Iktatabathu] registing (feminine). In the Qur’an: [Iktatabaha] He registered it, he has dictated it every sunrise and sunset, which means dictating it. It is said: [Iktataba Al-rajul] The man registered, if he registered himself in the Sultan’s office. In Hadith: a man said to him ( the prophet): my wife is pilgrimaging (to Mecca), and I have registered [Oktutibtu] in a conquest, which means that I have written my name among the conquerors. And you say: [Aktibny] let me copy this poem, means dictate me the poem. Also, [Alkitab] the book is something which has been written on. And in Hadith: who looks at his brother’s book without permission is as looking to hell. Ibn Al-Atheer said: it is a similarity; which means as he avoids hell, he should avoid doing this. He said: the meaning (of the Hadith) is the punishment by hell will be applied if someone looks at a book without permission. He said: it might be the punishment of visual explorers as the crime is done by sight. Hearing explorer is punished if someone intentionally listened to other people who do not like anyone to listen to them. He said: this Hadith is specific for books of secrets and secure books, whose owners hate anybody to look at these books. It is also said: the Hadith is general; applied to any type of books. Figure 4.2 A Human translation of the sample of text from the traditional Arabic lexicons “lisān al-‘arab”, the target lexical entries are highlighted in blue and square brackets. Figure 4.3 A Sample of the definition of the root ktb from an Arabic-English Lexicon by Edward Lane (Lane 1968), http://www.tyndalearchive.com/TABS/Lane/ , the target lexical entries are underlined. - 72 :(J ` ) > > > > > > > > > > > ;->" u?  _ =;G; ) J' 2;† = ;: .% J2 . d = ;<2; y; >Z; ??'= G;5; ^;"2;-; 2^"2;-; ^;-= (?;;-; ) D ?-= ; = ;: J2 D ;- 2F; H ˆ ? ?;5> i = = ; (ˆ ; =; 2S^ ,? / > > k' ? ;" 2; ] - u; .%; ; >>]; > =;%4; ?.% n.%/ . (?)=>; ) ?­ ; ?>; @= ;! M_ '; G=5;: c2 ; ;G;H; ?;t; =;: ;yZ ; ; =;%4; ; ;-; C=  ?.% ¬ ; ;G;H ; g ?;'= G;5 2.;:; ?;"'?-= R; = ?\; ( > > > > > > > > > > ;- *> d u= F? G;i; ;< 2R; ; u= F> '; ; qZ; u= F? G;i; ;< ?.<;:  H u= ? '; ;; c2 ; ;5 k= ;: qZ; (u= F> A2;"~ u= ' ; #; G;8 '= G;5 Q? ;R? = +; ;H [q2 ; #; G;8 .% J2 = i; =; 2^?¯? ? ?4Q=?) q2 > > > > > > -> >" QG! k= ;: “'; 12> >;= C4 l; w; 12> "~= C4 c? 'B ,.- l; w; 2R. ;%G;H uF> >A2"] qZ; ; ; ;? ? ? ; ; = = ; = ? = ; = ; ; = ? = ; ;5 .% J2 ?.% n.%/ ; g .% c'? ; k2i;  n;%4; ?R? = ? ; ?P2? £ = ; > > > >> > _ > . . . > | | .    T u % #  qZ ; :   qZ    ¥ u %  (  )   ) J2    R % 4 (    M  $   ; :  ) Ÿ 4 ; : C R  1 r '  kZ g u %   . | = = = = ; ; ; ; ; ; = = ? ? ; ? ; ; ? = ; ; ; ? ; ; ? ; ;; ; ? ; ; ; ; ; = = ; ; ; ; ? ; ;? ; ; ; ]; =;%4; > > ? ; R;H J2.-? = ( -= R= 2.;:) ! >> > > > *> ^ ;! ‡ ; ;5 2^"2;->; ^;G;82; ? ?; =4; ( ; ;82;; ) k2 | J2 ? ;G=( ; ?8=.; ?; c2 ? .-? = +; 5; u> %#= G.- k2 ; ? ; ? ; ; ; @= .-; 6 S’= .-2" > > > > > ; ?$G- G! C!> .;) q2 > > 2; = t> ;: u= ;%G;H 22 ;=; ; ; ; #; G;8 '= G;5 (?=); ) c2> R; = 1Q; ;: ; )=4 ^;G;5;; c2> ;=  ; ;)#= ; *> ?;"2;-= 2.;:; ^;G;82; ? ;"2;-= c? ; ;" nR. i; ?! = ;5; (J2 ; ;-= k' > > '= (“T > > ; ;= *> rZ. = n;%4; > >= G.- ? ; ;; 2^G;82; ? 2; / D;=-? \;  ; yZ; ? =#; = ; ;82; ;8 ; ;; v2 ; ? ;?= ? =-? =; ) 2;“;;‰; (;;"= =; +; #= G.) ; ;-; ?)=; ) }> R= ;o > > ( ? ;: (L|©? 2³>) ^#R> ;-¤=? ² > =;o ? = #; = ; ; L; |©? (u= ?´?'= G;5; ) ;G;=‰; k' = C= > ?S;>A2.e (?; >-; =; ) _ ; =%;±> 2F; G=G;8;S= ¯; x ; = ;G" }; ;¨; yZ; (;;%=$G;= ; ;-; ?=); ) ? ? ; ; ;; ; > ^ 42 > (;H x > > . >.!| qZ; > = ;.!| uB ­ .<;> ^ ;G82;  ^=;: > i> S= ;G< n;%4; ; ;-; 2R; F? G)=> ¶ ; k. ;: J . 2;‚.>Z; ¶ t> 6 D #­ ; ; = ;G" }; ;¨; ?.<; =;: ;G;5 ; > = R; =µ; x ? '; ( ? ; ? ; ? ; ? .;1Q; ;= ; ;; ;12;H'; = ; ; Figure 4.4 A sample of text from the traditional Arabic lexicon “al-muğrib fῑ tartῑb almu‘rib”, the target lexical entries are underlined and highlighted in blue. Figure 4.5 A sample of a traditional Arabic lexicon aṣ-ṣiḥāḥ fῑ al-luḡah $% * a2,(‘The Correct Language’, the original manuscript. - 73 - 4.3 Methodologies for Ordering Lexical Entries in the Traditional Arabic Lexicons Traditional Arabic lexicons distinguish between four classes of ordering lexical entries in the lexicon. First, the al-ẖalῑl methodology was developed by  S : C" + %T al-ẖalῑl bin aḥmad al-farāhῑdῑ (died in 791). Second, the abū ‘ubayd methodology was developed by abū ‘ubayd al-qāsim bin sallām M. C" u2 _ ?4 '":(died in 838). Third, the al-ğawharῑ methodology was developed by ’ismā’ῑl bin ḥammād al-ğawharῑ (died in 1002). Finally, the al-barmakῑ methodology was developed by abū al-ma‘ālῑ moḥammad bin tamῑm al-barmakῑ L· u ¦ C" R0 O2#m '":, who lived in the same time period as alğawharῑ. al-barmakῑ did not construct a new lexicon; but he alphabetically re-arranged a lexicon called aṣ-ṣiḥāḥ fῑ al-luḡah $% * a2,(‘The Correct Language’ by al-ğawharῑ. He added little information to that lexicon. 4.3.1 The al-ẖalῑl Methodology The al-ẖalῑl methodology was developed by  S : C" + %T al-ẖalῑl bin aḥmad alfarāhῑdῑ (died in 791). His lexicon called x# J2- kitāb al-‘ayn “al-‘ayn lexicon” was the first traditional Arabic lexicon. ‘The al-‘ayn’ lexicon lists the lexical entries phonologically according to places of articulation of phonemes from the mouth and throat, working forwards from glottal through to labial regions. He divided the lexicon into books, with one book for one letter. The books were then divided into 4 sections according to their internal structure: doubled biliteral roots; intact triliteral roots; doublydefective roots; quadriliteral and quinquetiliteral roots. Many lexicons followed al-ẖalῑl’s methodology with slight changes in ordering. The following traditional Arabic lexicons followed this ordering methodology: 1. x# J2- kitābu al-‘ayn “al-‘ayn Lexicon” by  S : C" + %T al-ẖalῑl bin aḥmad al-farāhῑdῑ died in 175H / 791AD. 2. > ;$% *> > ,> R? = u? ƒ; #= ? mu’ğam al-muḥῑṭ fῑ al- luḡah “The Comprehensive Language” by Q2 4 C" 2( aṣ-ṣāḥib bin ‘abbād died in 385H / 995AD. 3. u£4  ˜ u˜ al-muḥkam wa al-muḥῑṭ al-’a‘aẓam “The Greatest Verified and Comprehensive Lexicon” by Li< '$% ',.) + 42© C" L%4 Ci  '": (\  C") ’ibn sayyidah, abū al-ḥasan bin ‘’ismā ‘ῑl an-naḥawῑ al-laḡawῑ al-’andalusῑ died in 458H / 1065AD. 4. J# k2ilisān al-‘rab “Arab tongue” by '‘) C" R0 C! c2¨ ğamāl ad-dῑn moḥammed bin manẓūr died in 629H / 1311AD. 5. $% !„ uƒ# mu’ğam tahḍῑb al-luḡah “The Lexicon of Refined Language” by '": “ '() abū manṣūr al-’azharῑ died in 1205H / 1790AD. - 74 - 4.3.2 The abū ‘ubayd Methodology The abū ‘ubayd methodology was developed by abū ‘ubayd al-qāsim bin sallām M. C" u2 _ 4? '": (died in 838). The first constructed lexicon which followed this methodology was $% * 6Y)(RG?  !$ al-ḡarῑb al-muṣannaf fῑ al-luḡah “The Irregular Classified Language”. This methodology arranges lexical entries according to their concepts or topics. The lexicon consists of many small books, each of which describes a topic or a concept, such as books describing horses, milk, honey, flies, insects, palms, and human creation. Then these small books are collated into one large lexicon. That lexicon consists of more than thirty small books. The following traditional Arabic lexicons followed abῑ ‘ubayd methodology: 6. $% * 6Y)(RG?  !$ al-ḡarῑb al-muṣannaf fῑ al-luḡah “The Irregular Classified Language” by M. C" u2 _ 4? K: ’abi ‘ubayd al-qāsim bin sallām died in 223H / 838AD. 7. $% * ƒ. ;)RG?  al-munağğad fῑ al-luḡah “The Decorated Language” by LA2)´ Ci C" L%4 Q“ali bin ḥasan al-hunā’ῑ al-’azdῑ died in 310H / 922AD. 8. $% * ¸(’m al-muẖaṣṣaṣ fῑ al-luḡah “The Specified Language” by L%4 Ci  '": (\  C") Li< '$% ',.) + 42© C" ’ibn sayyidah, abū al-ḥasan bin ’ismā‘ῑl an-naḥawῑ allaḡawῑ al-’andalusῑ died in 458H / 1065AD. 4.3.3 The al-ğawharῑ Methodology The al-ğawharῑ methodology was developed by ’ismā’ῑl bin ḥammād al-ğawharῑ (died in 1002). The first lexicon which followed this methodology is called $% * a2,( aṣṣiḥāḥ fῑ al-luḡah ‘The Correct Language’. This methodology was based on the alphabetical order for ordering the lexical entries. However, the lexical entries were arranged in this lexicon depending on the last letter of the word, and then the first letter. The lexicon was organized into chapters where each chapter corresponds to the last letter of the word. Each chapter includes sections corresponding to the first letter of the word, then the second letter of triliteral roots, then the third letter of quadriliteral roots, then the fourth letter in quinquitiliteral roots. For example, the word  ; i; ;"baṣaṭ “spread” is found in chapter ṭ representing the last letter of the word, then by looking to section b as it represents the first letter. The following lexicons followed this ordering methodology: 9. $% * a2,( aṣ-ṣiḥāḥ fῑ al-luḡah “The correct language” by 'o Q2 C" + 42©Z (< '": KSabū naṣr ’ismā‘ῑl bin ḥammād al-ğawharῑ al-farābῑ died in 400H / 1009AD. 10. $% * ‰l J2 # al-‘ibāb az-zāẖir fῑ al-luḡah “The High Flood Water of Language” by ¢2$( R0 C" Ci  al-ḥasan bin muḥammad aṣ-ṣaḡānῑ died in 650H / 1252AD. - 75 11. v'2 't C v# 28 tağ al-‘arūs min ğawāhir al-qāmūs “Bridal Crown Jewel of Dictionaries” by  "l az-zubaydῑ died in 1205H / 1790AD. 12.  ˜ v'2 al-qāmūs al-muḥῑṭ “The Comprehensive Dictionary” by R0 2 '": C! ¤ Q2"]“‹S J'#! C" mağd ad-dῑn abū ṭāhir muḥammad bin ya‘qūb al-fayrūz’ābādῑ died in 817H / 1414AD. 4.3.4 The al-barmakῑ Methodology The al-barmakῑ methodology was developed by abū al-ma‘ālῑ muḥammad bin tamῑm al-barmakῑ L· u ¦ C" R0 O2#m '":, who lived in the same time period as al-ğawharῑ. The al-barmakῑ methodology is based on arranging lexical entries alphabetically starting from the first root letter. al-barmakῑ did not construct a new lexicon. Rather, he rearranged, using this ordering methodology, the lexical entries of $% * a2,( aṣ-ṣiḥāḥ fῑ al-luḡah, which was developed by al-ğawharῑ ordered using al-ğawharῑ methodology. Little information was added to this reordered version of the lexicon. After that, @›l az-zamaẖšarῑ (died in 1143) followed the same methodology and constructing a lexicon called ¥  v2: asās al-balāḡah “Fundamentals of Rhetoric”. This methodology of ordering lexical entries in an Arabic lexicon become the most widely used ordering methodology. The following lexicons followed this ordering methodology: 13. u o uƒ# mu‘ğam al-ğῑm “The jῑm Lexicon” by ¢2 @ R4 '": abū ‘amr aš-šῑbānῑ died in 206H / 821AD. 14. $% \F¨ ğamharat al-luḡah “The Gathering of the Language” by =!; Q? C" ’ibn durayd died in 256H / 869AD. 15. $% ˆ !2 uƒ# mu‘ğam maqāyῑs al-luḡah “The Lexicon of the Standard Language” > C" : xi  K: ’abῑ al-ḥusayn aḥmad bin fāris bin zakaryyiā died in by 2Y!> ;“ C" v2H 395H / 1004AD. 16. uƒ#- 2 uƒ# mu‘ğam mā ’ista‘ğam “A Lexicon of Foreign Words” by Li > #= R= > >8G;8 *> J > > $= R= al-muğrib fῑ tartῑb al-mu‘rib “Irregular Declinable Words” by '": 19. J = ? ? . /2< b-S ’abū al-fatḥ nāṣir ad-dῑn al-muṭrazῑ died in 610H / 1213AD. “em C! 20. a2,( 2-› muẖtār aṣ-ṣiḥāḥ “The Selected of the Correct Language” by abū bakr ar-rāzῑ died in 666H / 1267AD. “ " '": - 76 21. ‹  a@ !¥ * ‹)m a2 (m al-muṣbāḥ al-munῑr fῑ ḡarῑb aš-šarḥ al-kabῑr “The Illuminating Light on the Irregularity of the Great Explanations” by L%4 C" R0 C" : v2 # '":  'R  ¡ L' S aḥmad bin muḥammad ‘alῑ al-fayyūmῑ ṯumma al-ḥamawῑ, abū al-‘abbās died in 538H / 1143AD. 22.  ' uƒ#m al-mu’ğam al-wasῑṭ “The Intermediary Lexicon” by G `2!l : G nSe( u " 2ƒ) R0 G Q2  4 2 ibrāhῑm muṣṭafā, aḥmad az-zayyāt, ḥāmid ‘abdul-qādir, muḥammad an-nağğār published in 1960. 23. 3± !#-m c2#H uƒ# mu‘ğam al-’af‘āl al-muta‘adyyah bi ḥarf “The Lexicon of Transitive Verbs” by  ¢2 %m C" R0 C" n' mūsā bin muḥammad al-malyānῑ al- ’aḥmadῑ published in 1979. 4.4 Constructing the SALMA-ABCLexicon Many existing morphological lexicons were constructed from raw text (Sagot 2005). The general requirements for constructing a morphological lexicon from raw text are: a corpus; a generation program or a morphological description of the language; a Lexical Markup Framework (LMF) for providing compatible structure to store the lexical entries; searching facility over the lexical entries (querying the constructed lexicon); and an evaluation methodology of the lexicon (Russell et al. 1986; Petasis et al. 2001; Tadi and Fulgosi 2003; Sagot 2005; Sagot et al. 2006; Paikens 2007; Nicolas et al. 2008; Erjavec 2010; Sagot 2010). Broad-coverage language resources which provide prior linguistic knowledge must improve the accuracy and the performance of NLP applications. The main aim in constructing a broad-coverage lexical resource is to improve the accuracy of morphological analyzers and part-of-speech taggers of Arabic text. Chapter 3 discussed the shortcomings of the existing stemming algorithms for Arabic text. Constructing a broad-coverage lexical resource to improve the accuracy of Arabic morphological analysis has advantages over developing a sophisticated stemming algorithm. These advantages are: • A prior-knowledge lexical resource will improve the Arabic morphological analysis. • A lexical resource can be integrated to different stemming algorithms to give prior knowledge about the analyzed words. • It can help in enhancing the performance of the morphological analyzers by reducing the complex analysis steps to a simpler look up procedure. - 77 - • The broad-coverage lexical resource can be a standalone resource which can be integrated in different Arabic natural language processing systems and benefits of integration can be gained. • It is easier to update the lexical resource by adding new contents to it and correcting it than updating a sophisticated algorithm which needs specialized developers. • It can also be used as a teaching material resource to help in assisting both teachers and students in a teaching-learning process. The SALMA-ABCLexicon (Sawalha Atwell Leeds Morphological Analyses – Arabic Broad-Coverage Lexicon) was developed following the general requirements for constructing morphological lexicons from raw text. However, the absence of open-source Arabic corpora and the absence of a generation program led to the use of traditional Arabic lexicons as a corpus. The generation program for Arabic can generate verbs and derived nouns, but its major shortcomings are both over-generation and under-generation. The over-generation problem results in many lexical entries which are correctly structured but are not part of the real language vocabulary, while the under-generation problem happens when the generation cannot generate all possible vocabulary of the language. In theory, any morphological generation program for Arabic will suffer from both over-generation and under-generation problems unless it has been provided with a comprehensive database that contains all the non-generated vocabulary (i.e. non-inflected words, primitive nouns and non-conjugated verbs) and comprehensive morphological descriptions of language encoded within the generation program. Both the dataset and the morphological descriptions of the language need huge amounts of manual work. As an alternative, the selection of traditional Arabic lexicons as a text corpus for constructing the SALMA-ABCLexicon will provide; first, a wide coverage of Arabic vocabulary (derived and non-derived words) where most of them appear in the lexicons in different forms as they are defined in the lexical entry. Second, the lexicons cover a range of the past 13 centuries (i.e. from 800 to 2000), a wide range of both classical and modern Arabic vocabulary and their development. Third, they provide a basic and comprehensive morphological dataset by mapping between the words and their roots; especially for words of hard cases where stemming algorithms and morphological analyzers fail to analyze them. This morphological dataset can be re-used by different text analytics applications. This section discusses the construction steps for the SALMA-ABCLexicon following the three general requirements, mentioned above, for constructing morphological lexicons from raw text. Section 4.4.1 describes the text corpus used to construct the lexicon. Section 4.4.2 discusses the morphological knowledge used to - 78 extract the lexical entries and their basic morphological information. Section 4.4.3 describes the process of combining the lexical entries into one large lexical resource. Section 4.4.4 discusses the format of the lexicon. Section 4.4.5 explains the querying of the lexicon and the retrieval of its information. 4.4.1 The Text Corpus As mentioned above, due to the absence of an open-source representative Arabic corpus and the absence of a generation program, the selection of a corpus to build the morphological lexicon was directed to select, as a corpus, the traditional Arabic lexicons. Twenty three freely available lexicons were collected from different resources from the web. These lexicons are listed in section 4.3. Meshkat Islamic Network39  r \2@  ¯ šabakat miškāt al-’islāmiyyah provides most of these lexicons which are written in machine readable format using MS Word files or HTML web pages. Common processing steps were applied to all lexicons. First, all lexicon files were converted from MS Word or HTML web pages into standard text files in Unicode ‘utf-8’ encoding. Second, a statistical analysis computed the word frequency and the vocabulary size for both vowelized and non-vowelized text of each lexicon. The complete corpus of 23 lexicon texts contains 14,369,570 words, 2,184,315 vowelized word types and 569,412 non-vowelized word types. Table 4.1 shows the summary of the statistical analyses of the lexicon texts used to construct the SALMA-ABCLexicon. Section 4.6 discusses the corpus of traditional Arabic lexicons. Table 4.1 statistical analysis of the lexicon text used to construct the broad-coverage lexical resource Number of files 247 Size 178.32 MB Number of words 14,369,570 Vowelized word analysis Number of word types 2,184,315 Number of words 14,369,570 Non-vowelized word analysis Number of word types 569,412 4.4.2 Morphological Knowledge Used to Extract the Lexical Entries Each lexicon was constructed following one of four ordering methodologies of their lexical entries, although most of them used the root as main lexical entry. Moreover, the 23 lexicons were typed into machine-readable files in different formats but without using any computerized lexicographic representations. These factors add more processing challenges. Therefore, each lexicon was processed separately using specialized programs. An important preprocessing step converts each lexicon text into a unified format by choosing the most common format for all the root entries in the lexicon. This step was 39  N \2@  ¯Meshkat Islamic Network http://www.almeshkat.net - 79 done manually, which involves going through all the text in the lexicon files and reformatting the root entries that do not follow the selected format. The common basic structure of all lexicons is root-definition structure, where each root entry in the lexicon is followed by the definition part that groups all the derived words and their meanings. After that, a program was written to extract the roots and words derived from that root. The tokenizing module in the program must specify the root entries and their definition parts. Then, a bag of words was extracted from the definition text. The bag of words stores word-root pairs, where each word appearing in the definition part is associated with the root of that part. The definition parts of the roots are written as encyclopaedia articles that define each root and define the lexical entries derived from a certain root. The writing style of the definition part connects the lexical entries and their meanings together without following any structure or ordering methodology. The writing style of the definition parts show the lexical entries conjoined with all kinds of clitics and affixes. Clitics, such as conjunctions and pronouns, are used to connect the definitions of the lexical entries together as one unit. Although the use of clitics and affixes adds a greater challenge to the construction of the broad-coverage lexical resource, they substitute and compensate for the generation program where derived words from a given root (i.e. lexical entry) appear in different shapes and formats. Moreover, the use of different lexicons, which share most of their lexical entries but differ in defining them, increases the potential for gathering a wider range of forms and shapes of the same derived words. Finally, because the definition part of the lexical entry is written as natural language text, the different forms of a derived word counted as a valid part of the language vocabulary, but excluded over-generated words; see figure 4.7. Non-derived words related to certain root lexical entries are also gathered and included in the lexicon. Many words appearing in the definition part are not relevant to the root associated with that definition. Such words are found in the bag of words of that root. A normalization analysis that verifies the word-root pairs works by applying linguistic knowledge that governs the derivation process of words from their roots. These conditions are simply described as the following: • Condition 1 (check consonants): If all consonant letters forming the root appear in the analyzed word, then check condition 2. • Condition 2 (consonants order): If all root letters appear in the same order as the word’s letters, then word-root combination is a candidate analysis, and can be inserted to the lexicon. - 80 In the first condition (check consonants), we classified Arabic letters into four groups, letters that appear in clitics or affixes, vowels, hamzah and letters that might be changed in derivation due to substitution  ’iqlāb to simplify the pronunciation of the word. Then, a procedure is applied to verify each letter of the word. Another procedure is applied to match the order of the letters of both the analyzed word and its root. The analyses that meet the two conditions are candidate analyses and are stored in the lexicon database. The information about clitics, affixes and stem is also stored with the word-root combination. Figure 4.6 shows the process of selecting word-root pairs. Table 4.2 shows the number of words and the percentage of words extracted from the original text of the lexicons. ( ( ( ( > - , 6 = %;-=?›) > |-; ?8) - , k2 - , *) - , Ÿ!> .e) ( - , Mr ; ) > ( - , 6 = ;:) ( ( ( ( ( ( > - , 6 = %;-=?›) > |-; ?8) - , k2 - , *) - , Ÿ!> .e) - , Mr ; ) > - , 6 = ;:) Bag of words of the root k-t-b “worte” ( - , > )4 ( - , .e‰; ) ( -,;1L@) =) _ ( - , Q2!“) ( - , c25) ( -,  ?-= ;! ) > ( - , 3  T2 ) ( - , '" ; : ) ( -, ^2 -=; ) = ; ( - , B ?; ) ( - , uƒ)) ( - , ^2"2->) ( - , ; t= > ) ( - , d ( - , ^"2->) ? %= G;G=5;:) ( - , ž ;œ) ( - , C) ( - , ;.-;) Selected word-root pairs that satisfy the 2 linguistic conditions ( - , > )4 ( - , .e‰; ) ( -,;1L@) =) _ “) ( - , Q2! ( - , c25) ( -,  ?-= ;! ) ( - , 3 ( - , '";:) ( -, ^2 -=; ) = >;T2) B ( - , ?; ) ( - , uƒ)) ( - , ^2"2->) ( - , ; t= > ) ( - , d ( - , ^"2->) ? %= G;G=5;:) ( - , ž ;œ) ( - , C) ( - , ;.-;) ( ( ( ( ( ( ( ( ( ( ( ( > - , J2? ) - , 3#) - , }Ro) - , D ?-? ) - , D -=? ) - , ; ;-; ) > - , J2? ) - , 3#) - , }Ro) - , D ?-? ) - , D -=? ) - , ; ;-; ) Figure 4.6 Using linguistic knowledge to select word-root pairs from traditional Arabic lexicons. The selected word-root pairs are underlined and highlighted in blue Table 4.2 Statistics of the traditional Arabic lexicons and morphological databases used to construct the SALMA-ABCLexicon Lexicon name Word types Words extracted Roots extracted tağ al-‘arūs min ğawāhir 1 831,504 474,351 57.05% 11,101 2 al-qāmūs lisān al-‘rab 274,305 54.01% 9,355 66,763 39.54% 6,411 4 mu’ğam al-muḥῑṭ fῑ al168,870 luḡah kitābu al-‘ayn 141,098 54,970 38.96% 5,826 5 al-mu’ğam al-wasῑṭ 112,164 45,614 40.67% 6,489 6 al-muṣbāḥ al-munῑr ḡarῑb aš-šarḥ al-kabῑr muẖtār aṣ-ṣiḥāḥ 61,422 29,742 48.42% 2,947 40,295 17,636 43.77% 3,420 al-muğrab fῑ tartῑb al39,930 mu‘rab Arabic WordNet Buckwalter’s Lexicon - 13,798 34.56% 2,322 16,998 82,158 - 3 7 8 9 10 507,860 fῑ 2,589 - - 81 - 4.4.3 Combining the Processed Lexicons into the SALMA-ABCLexicon After manually converting each lexicon text into a unified format by choosing the most common format for all the root entries in the lexicon, information such as roots, words and meaning is automatically extracted using specialized programmes. The results are stored in separate dictionary files which include roots, words, and meanings. A combination algorithm combines the disparate lexicon information into one large broadcoverage lexical resource. A combination algorithm is applied to construct the SALMA-ABCLexicon. The algorithm starts by selecting a large lexicon called J# k2i lisān al-‘rab ‘Arab tongue’ as a seed to the SALMA-ABCLexicon. Then, the lexicons are combined one by one. Figure 4.7 shows the first 60 lexical entries of the root - k-t-b ‘wrote’ stored in the SALMAABCLexicon. After combining each lexicon the percentage of records added to the SALMA-ABCLexicon is computed. The percentage starts with 100% for the seed lexicon and decreases during the combination process. The percentage will tell us when the combination process should stop, and which lexicons are better to construct the SALMAABCLexicon. Table 4.3 shows the number of records extracted from 4.7 analyzed lexicons, and the number and percentage of records combined to form the SALMAABCLexicon. The SALMA-ABCLexicon contains 2,774,866 word-root pairs, which represent 509,506 different words representing 261,125 different non-vowelized words. It contains 12 different biliteral roots; 8,585 different triliteral roots; 4,038 different quadriliteral roots; 63 different quinqueliteral roots; and 31 different sexiliteral roots. Word types of the lexicon are distributed into; 117 word types of biliteral roots; 483,356 word types of triliteral roots; 30,873 word types of quadriliteral roots; 615 word types of quinqueliteral; and 335 word types of sexiliteral roots. Table 4.3 Number of records extracted from 7 analyzed lexicons, and the number and the percentage of records combined to the SALMA-ABCLexicon. # Lexicon Word types Records Percentage [B] inserted [A] (A/B)% (A/C)% 1 lisān al-‘rab 207,992 207,992 100.00% 47.80% h mu’ğam al-muḥῑṭ fῑ alluḡa 2 74,507 61,113 82.02% 14.04% 3 tağ al-‘arūs min ğawāhir al- 128,119 95,415 74.47% 21.93% 4 5 6 7 qāmūs muẖtār aṣ-ṣiḥāḥ al-muğrib fῑ tartῑb al-mu‘rib kitābu al-‘ayn al-mu’ğam al-wasῑṭ Totals 19,540 12,396 30,292 36,660 509,506 16,573 9,805 18,878 25,364 435,140 [C] 84.82% 79.10% 62.32% 69.19% 85.40% 3.81% 2.25% 4.34% 5.83% 100.00% - 82  -: ; ;-= ;: d ? =;-= ;: €> =>-= ;: ^2"2-= >Z  -- ;;-= ;- = 2F;;-= ;- = -- ; ;-;-= ;;-;-= 2F;;-;-= = ?-= d=>-?-= ‡?"2->-= ‡ ; ?"2->-= >= J2? -r 82- 82 > ? 82 ’aktabahu ’aktaba ’aktabtu ’aktibnῑ ’iktāban ’istaktabahu ’istaktabahu ’istaktabahā ’iktataba ’iktataba ’iktatabahu ’iktatabahā ’uktub ’uktutibtu ’iktitābuk ’iktitābuka al-’iktitābu at-takātubu al-kātib al-kātibu J2- "2- ;"2- ;"2- 82-  -  -  -; > ; ; A2- > ; ? A2- ; ? - ; ; A2;-  ;- ; ; ? =- > -= ; ? ? ;- ? ? =G;- J2 ? ; .- > .- J2 ?  =- ? al-kitāb al-kitābat al-kitābata al-kitābat al-katātῑb al-kitbat al-katῑbat wa katῑbat al-katā’iba al-katā’ibu al-katῑbata al-katā’iba al-katabat al-katbu al-katbi al-kutabu al-kutaybatu al-kuttāba al-kuttābi al-kutbat Figure 4.7 The first 60 lexical entries of the root – ABCLexicon ? ? =- ? ?;=- > J2- > ?"2- > J2;  > ?;"2- > J2?  > >  J282m  82m -m  -m "'-m J2 ? .-? = > J2 ; ;-= ?;"2;-> = >;"2;-> = ? ;-= R; = ?;"'?-= R; = > ; ;-= ;-=  al-kutbatu al-kutbatu al-kitāb al-kitābatu al-kitāba al-kitābatu al-kitābu al-kitābi al-mukātib al-mukātibat al-maktab al-maktabat al-maktūbat al-kuttābu al-kitāba al-kitābatu al-kitābati al-maktabu al-maktūbatu ’istaktaba - k-t-b ‘wrote’ stored in the SALMA 4.4.4 Format of the SALMA-ABCLexicon Modern English dictionaries are stored using computerized lexicographic databases. The most widely accepted lexicographic database representation is lexical text markup using SGML (Standard Generalised Markup Language) such as XML. Other Database Management Systems (DBMS) can be used such as relational databases, object-oriented DBMS with inheritance mechanisms, and hybrid object-oriented/relational databases (Eynde and Gibbon 2000). The Russell, Pulman et al. (1986) English morphological dictionary is stored as a sequence of entries, each in the form of a Lisp s-expression. MULTEXT, MULTEXTEast and CML is stored in tab separated column files (Erjavec 2010). SKEL lexicon is organized as a fixed number of pages, where each page contains a set of morphological entries (Petasis et al. 2001). The Latvian lexicon is stored in XML files (Paikens 2007). Lefff and the Slovak lexicons use Alexina framework (Sagot 2005; Sagot et al. 2006; Nicolas et al. 2008; Sagot 2010). Buckwalter’s lexicon is stored as a relational database (Maamouri and Bies 2004; Maamouri et al. 2004). - 83 Of these disparate formats, the SALAMA-ABCLexicon is stored as XML (Extensible Markup Language) files, as a relational database and tab separated column files. The three formats are used to ensure wider re-use of the lexicon in different text analytics applications for Arabic. Figure 4.8 shows the XML and tab separated column files. Figure 4.9 shows the entity diagram of the SALMA-ABCLexicon. ": Q2"] 2 ": Q2 D ;"] 1 ": ? ?"] 2 …. Word  -: ; ;-= ;: d ? =;-= ;: €> =>-= ;: ^2"2-= >Z  -- Root - - - - - - ;;-= ;- = 2F;;-= ;- = -- ; ;-;-= - - - - Figure 4.8 XML and tab separated column files formats of the SALMA-ABCLexicon Figure 4.9 The entity relationship diagram of the SALMA-ABCLexicon - 84 The first format uses XML to store the lexical entries of the SALMA-ABCLexicon. Each lexical entry has three pieces of information: Root, Word and Count. The Count is the number of times the word-root pair appeared in the lexicons text. The Count represents a verification criterion of the lexical entries. The second format uses a tabseparated column file where the first column represents the word and the second column represents the root. The last format uses relational databases to store the SALMAABCLexicon. The lexicon_words table represents the combined lexicon table. The lexicon_words table stores the Root, the Word and the Count. Simple SQLite340 was used to store and manage the lexicon database tables. SQLite is an open-source embedded SQL database engine which does not have a separate server process. SQLite reads and writes directly to ordinary disk files (i.e. is contained in a single disk file), which makes it a suitable choice for distributing the lexicon database file as a downloadable morphological database for Arabic. 4.4.5 Retrieval of the Lexical Entries The lexicon has a searching facility that enables searching for a certain lexical entry in the lexicon, and returns back a Python object of type LexiconEntry. The LexiconEntry object represents an encapsulation of the word and its root as a unit of information; see figure 4.10. A specialized interface is provided to enable the morphological analyzer to communicate with the lexicon file; see section 8.3.2. This communication allows the morphological analyzer to retrieve the root(s) of the analyzed words. The constructLexicon function reads the tab separated column file and stores the lexicon in a dictionary data structure where the key of the dictionary is the nonvowelized word in string data type and the values of the dictionary are lists of LexiconEntry objects. The dictionary data structure of the lexicon is in this format Lexicon = [nv_word:[LexiconEntry,...],...]. The Lexicon class interface represents the actual lexicon data and the communication facility between the lexicon and the morphological analyzer. Both isLexiconEntry and getLexiconEntry check whether the passed nonvowelized Arabic word is found in the lexicon and returns a list of LexiconEntry objects for the non-vowelized words found. Figure 4.10 shows the lexicon Python classes interface and the lexicon construction method – the implementation of the class methods is not included. 40 SQLite http://www.sqlite.org/ - 85 - class LexiconEntry(object): def __init__(self, word, root): self.word = ArabicWord(word) self.root = ArabicWord(root) def __str__(self): def printLexEntry(self): def constructLexicon(): ''' This procedude reads the lexicon file and constructs the lexiocn dictionary of the following format {nv_word:[LexiconEntry,...],..., }''' return lexicon class Lexicon(object): '''Lexicon class constructs the lexicon dictionary''' LexDict = constructLexicon() def printLexicon(cls): def isLexiconEntry(cls, nv_word): # return True or False def getLexiconEntry(cls, nv_word): return Lexicon.LexDict[nv_word] Figure 4.10 Lexicon Python Classes interface – implementation of the methods is not included A web interface41 was developed to allow users to access the contents of the lexicon, to search for a given root. The interface searches the lexicon’s relational database tables for the entered root and displays the definition parts from the analyzed lexicons. Figure 4.11 shows the web interface of the 7 analyzed traditional Arabic lexicons. Figure 4.11 Web interface for searching the traditional Arabic lexicons 41 A web interface for searching the traditional Arabic lexicons for a certain root http://www.comp.leeds.ac.uk/cgi-bin/scmss/arabic_roots.py - 86 - 4.5 Evaluation of the SALMA-ABCLexicon The SALMA-ABCLexicon was evaluated by computing the coverage of the lexicon on different types of text corpora: the Qur’an; the Arabic Internet Corpus42; and the Corpus of Contemporary Arabic (CCA). Two experiments were carried out compute the coverage of the SALMA-ABCLexicon. First, exact match where each non-vowelized word in the test corpora is searched for in the lexicon. The results showed that the coverage of the three corpora is 65.5% - 67.5%. The highest coverage of 67.53% was achieved from the Qur’an. The coverage of both the Internet Arabic corpus and the CCA achieved 65.58% and 65.44% respectively. Table 4.4 and figure 4.12 show the coverage percentage of the SALMA-ABCLexicon using exact match. Table 4.4 shows the number of tokens and words in each corpus. Some tokens are not words (i.e. Arabic words) but numbers, dates, currency symbols, punctuations, HTML or XML tags and English words. Only Arabic words were selected to compute the coverage of the SALMA-ABCLexicon. Table 4.4 The coverage of the lexicon using exact word-match method Corpus Tokens Words Covered words Coverage % 77,800 77,799 52,536 67.53% Qur’an 684,726 594,664 389,133 65.44% CCA 1,128,114 833,916 546,880 65.58% Internet Figure 4.12 The coverage of the SALMA-ABCLexicon using exact match method An Arabic word in any text may appear with many different forms of clitics attached to it, which makes the matching process of the word and the lexical entries not an easy task and decreases the coverage. The second experiment to compute the coverage of the SALMA-ABCLexicon is through an application that depends on it. The lemmatizer (Sawalha and Atwell 2011a) for Arabic text is used to process large-scale real data; the 42 Leeds collection of Internet corpora: Arabic Internet Corpus http://corpus.leeds.ac.uk/internet.html - 87 Arabic Internet Corpus which consists of 176 million words of Arabic collected from web pages. The lemmatizer depends on the SALMA-ABCLexicon to extract the root and generate the lemma of the word. Each word is tokenized into different forms consisting of proclitics, stem and enclitics, and then each stem is searched in the lexicon. If the stem is found in the lexicon then the root and the vowelized stems stored in the SALMAABCLexicon are retrieved. More details about the lemmatizer are given in chapters 8 and 10. When a correct analysis is retrieved from the lexicon then it is counted as a valid lexicon reference. The coverage of the SALMA-ABCLexicon is computed by the percentage of valid lexicon references to the number of words in the test sample. The lemmatizer uses three other linguistic lists; a list of function words (stop words) which have fixed syntactic analysis in any context (Diwan, 2004), a named entities list (Benajiba, Diab and Rosso 2008) and a list of broken plurals43 (Elghamry 2010). The coverage of the SALMA-ABCLexicon was computed one time with the inclusion of these function word lists (i.e. function words list, named entities list and broken plurals), and another time without including the function word lists. Tables 4.5 and 4.6 show the coverage percentage of the lexicon computed using the lemmatizer program. Figure 4.13 shows a summary of the coverage of the SALMA-ABCLexicon using the lemmatizer. Table 4.5 Coverage including function words Corpus Qur’an CCA Internet Tokens 77,804 685,161 1,128,624 Words 77,803 595,099 834,426 Covered words 64,065 507,943 708,101 Coverage % 82.34% 85.35% 84.86% Table 4.6 Coverage excluding function words Corpus Qur’an CCA Internet Tokens 77,804 685,161 1,128,624 Words 54,004 411,482 576,407 Covered words 42,532 338,790 476,190 Coverage % 78.76% 82.33% 82.61% Figure 4.13 Coverage percentage of the SALMA-ABCLexicon using the lemmatizer 43 Broken plural list source http://sites.google.com/site/elghamryk/arabiclanguageresources - 88 The coverage is about 85% of the words, including function words, and about 82% of the words excluding function words. Both the CCA and the Arabic Internet Corpus achieved similar results when testing using the lemmatizer program and including function words. The coverage for them was 85.35% and 84.86% respectively. A coverage of 82.34% was achieved when analysing the Qur’an words. The second part of the experiment excluded the function words. Similar results were achieved. The Arabic Internet Corpus and the CCA scored 82.61% and 82.33% respectively. The coverage resulted from analyzing the Qur’an text was 78.76%. Common words which are not covered by the SALMA-ABCLexicon include: function words (stop words); new Arabic terms; relative nouns; and borrowed words > (Arabized words). Functional words (stop words)such as ‡ ; ;y ḏālika “that”; q; >Z; wa-’ilā “and to”; u= F? .G<>Z’innahum “they are”; and p allatī “which”, can be easily added to the lexicon along with their syntactical and morphological analysis by collecting them from traditional Arabic grammar books such as (Diwan 2004). New Arabic terms such as ¯QQ dardašat “chat”; <’unqur “click” and `2"2’- ‡ ; ;y > 2Ri ` ; ;.  u= F? .G<>Z > .%2>" u= F? G)=4; Ÿ| ;= 2>" > ‡ ; {;=?E;H | ;E>;H q; >Z; 3 ; '= i; ;H p \,-m ḏālika That assamāwāti Skies ’innahum They are billāhi Swear to God ʿanhum After them bilḥaqqi By the right fa’ulā’ika And those fabi’ayyi In what wa-’ilā And to fasawfa It will which allatī t United '- ad-duktūr t  2 i as-siyāḥiyya Doctor al-muttaḥida t  "$ al-ḡarbiyya Tourism Western !Q2(-5r k2i - > > 3‰> n%4 i2". + 5: ¢2 ,% > | | -: ( ?;;-; )  ? ?-= ;! ( ^2 -=; )  ˆ ; ( ^2"2->) v2  = ;m b-=S2" = '. + 5: Y ; = C4  v2 %2 uD  ? ;m ? ; ( > ?- ¡  (m ?%/;:. 2)’ ¯ 25. : 2RF H i2"  ^ -=>  ^"2->: ( .e‰) u> ƒ)  <2# C IE=  2R H +; R# = >;T2 Q2;!“> > )=4> C= > d Y ? ? %= G;G=5;: ? ? ; = . '";: c25 : B ?; 3 ; ; > > M;r Ÿ!> > .e * k2 > > > > > > > c25  J# k2i * 6  8 6 %›  œ  t  : i B ) Ž#" * d! ; :   : ( k2  8 ) k' '  H  12 - k  i  !  1  F G "  $  L  12 - i" : . | | ; ? = =? ; ; ; = ; Y = ; ? ; ; ? = ; ; = ; ? ;? ; > > > k' };G=8;: Y¡( ?;.-; ; )  ^2S#. £ ; 32 ; R? ;%#= 8.  12Y- \i ; ; ? ( ) = ;  C" C4: ( ;;-;-= )  ;;-; ; ( ?;.-; =;:) : y>Z( ?.e‰; ) . ( ?;;-;-= ) : y>Z(   ? R= ;- = > > > > > > > > ; ; ) . 2"2  kH   :   ! k ; : c E ; : . 1 L @     :    ! k ; :   E  ; : . l! l # +! l ) G  * : {  / ; :  \   "  % 4 n % ¦ n F H 2 F G  } ( ? ;;-= ;-2 . . = = = = ; ; ; ; =? = = D = ^ ^ ; = ^ ; ; ; ; ; ; ? ; ; = ; ; ; ? ; ; ; ? =; ; ; ? ; = ?; = ; ;; k25S: 5 ) ;: : 2F;;-= ;-… > bµ: ( uY £2" . ?S;.‘) “?'= S =  a2ƒ; .): 1> L= @2" ; . 5( d Y b? ƒ= .)  b-S2" = ,; µ;= ;:  }; ;)R = ;:. ( q2#8 ?.% 2F; ,; µ; ;:) ;  d,; ;µ; ) ‡ 2F; G?-,µ ;  ?t2 _ > b t2 > > > > : 2F; >Q"> S; #; = ;:. ( D =!“; b; ;µ;:: b_ µ= y 2/ ? ,= ;µ;: 5. 2F)4  L­ @A24  e‰ * ; ; ;); ) .  2F-=£5 yZ  ;-t2; d ; . C=   bD ƒ)=? ') M'5( b; t2;) > > : ( u?-G=! ; . v2; `2#; ƒ; ; C: b? ƒ)=G;-;: 2Y!>Z  b? >-S;8;: 2". ( b? ƒ> .): ; ;: y= >Z b; ;µ= ;:) . ( 2F; ,; ƒ; )=G;- ;t2 = ;µ ;  »; ƒ. ;)G;8) yZ ( 2l;ƒ. ;);G8) L d, > > > > '( > > _> >  > =: ? | z? J2 D ; < 5=E; '‰;: QD '; t; bD µ; ( v2;) : bƒ)=? D+t ? . ) b? ƒ.): ( v2Y) C b? ƒ)=?m) v ? D =;: c25  `2t2  b? ƒ)=? ;: : * A2$2" ¼ Y C;  J > . ) c2! :  ^2 ¯ ;:  ^2, µ>; ^‹ kH > >; bt2< ‹ : ‡ : bµ=? y. ( ) “2½ C: b ƒ.): ( ‹= i . C> ! D ¯> ; ... ? @ D D; ; D 2 ; ; ( b> t2.)2) bD µ … Figure 4.16 XML structure of The Corpus of Traditional Arabic Lexicons 4.7 Discussion of the Results, Limitations and Improvement The SALMA-ABCLexicon contains a large number of entries representing a wide coverage of Arabic words, word types and roots. The evaluation proved that the lexicon has wide coverage, where about 85% of the test corpora words have a valid reference to the lexicon entries. Despite the time span of 13 centuries of the traditional Arabic lexicons from which the SALMA-ABCLexicon has been derived, 15% of the test corpora words are not captured. The latest analyzed Arabic lexicon is  ' uƒ#m al-mu‘ğam alwasῑṭ which appeared in 1960s; so, new vocabulary items added to Arabic in the past 50 - 92 years is not included in the lexicon. Moreover, the use of borrowed words from foreign languages which do not have a proper translation in Arabic, but are written using Arabic letters (transliterated) has increased due to the technological advances. Advances in technology and communication means new products and their names have entered Arab countries, where these products keep their original names which have been widely used and become part of contemporary Arabic vocabulary. Moreover, the use of dialectical Arabic has increased in the written language due to open systems such as chat rooms, blogs and forums, which allow people to write text without restrictions on the web where they use dialectical words quite frequently. The lexicon did not involve any manual correction due to the limitations of funding the correction process and voluntary work to correct the lexicon. However, the methodology followed to verify part of the lexicon was done by counting how many times the word-root pairs appear in the analyzed traditional Arabic lexicons. 976,427 word-root pairs representing 35.19% of the lexicon’s word-root pairs scored a count of 2 or more. This means that these word-root pairs appeared in different lexicons and satisfied the linguistic knowledge of the two extraction conditions. Therefore, these wordroot pairs have high potential to be valid and correct. This is the first version of the SALMA-ABCLexicon. It can be extended to include the full morphological analyses of the lexical entries and other useful information that will enhance the accuracy of NLP applications. Special linguistic lists such as compounds, collocations, idiomatic phrases, phrasal verbs and named entities can be added to extend the lexicon. Moreover, morphological lists such as broken plurals, intransitive and transitive verbs, rational and irrational words and primitive nouns can be another extension to the lexicon. Chapter 8 will discuss the extension of the SALMAABCLexicon by adding special linguistic and morphological lists to enhance the guessing of the morphological features of the words by the developed morphological analyzer. The SALMA-ABCLexicon can also be extended by adding modern and dialect vocabulary from Corpus of Contemporary Arabic and Arabic Internet Corpus. But these corpora can only extend the vocabulary; the corpus does not provide a root for each word. Manual correction of the word-roots pairs can be done in the future to make the SALMA-Lexicon an authenticated resource which can be used as a gold standard for stemming algorithms to be evaluated against a wide-coverage gold standard. The SALMA-ABCLexicon is an open-source lexicon. There is also an online access method to its contents and searching facilities44. 44 SALMA-ABCLexicon http://www.comp.leeds.ac.uk/sawalha/SALMA-ABCLexicon.html - 93 - 4.8 Chapter Summary This chapter showed the process of constructing the SALMA-ABCLexicon to be used in Arabic text analytics applications such as lemmatizers, morphological analyzers and part-of-speech taggers. The motivations for constructing the SALMA-ABCLexicon are: the poor results achieved by comparing the outputs of existing morphological analyzers and stemmers discussed in chapter 3; the benefits gained by developing a morphological resource over developing a sophisticated stemming algorithm; the ability to reuse the SALMA-ABCLexicon in different Arabic text analytics applications; and the use of the text to construct the Corpus of Traditional Arabic Lexicons. The chapter started by surveying morphological lexicons especially for Arabic and morphologically rich languages (mainly east European languages). The survey focused on the language of the lexicon, the construction methodology, the size and the evaluation of the lexicons. This was followed by the study of traditional Arabic lexicons focusing on the arrangement methodologies and the challenges and drawbacks of these lexicons. The focus of the survey was to investigate the agreed standard requirements for constructing morphological lexicons from raw text. The development of constructing the SALMA-ABCLexicon followed the agreed standard for constructing a morphological lexicon from raw text. However, the absence of a large open-source representative Arabic corpus, the absence of an open-source generation programme and the generation programme problems of over-generation and under-generation, directed the selection of the raw text corpus to be the text of the traditional Arabic lexicons to substitute for the corpus and the generation program requirements. The major advantages of using the traditional Arabic lexicons text as a corpus are: the corpus contains a large number of words and word types and the possibility of finding the different forms of the derived words of a given root. The SALMA-ABCLexicon is constructed by combining extracted information from disparate lexical resource formats and merging Arabic lexicons. The processing steps in constructing the SALMA-ABCLexicon involve; first, analyzing lexicon texts separately by manually converting each lexicon text into a unified format by choosing the most common format for all root entries. Then, for each lexicon a specialized program extracts the root and the words derived from that root depending on linguistic knowledge that governs the derivation of words from their roots. Second, a combination algorithm merges the information extracted from the previous step into one large broad-coverage lexical resource, the SALMA-ABCLexicon. The evaluation of the SALMA-ABCLexicon was done by computing the coverage, using two methods: the first methodology computed the coverage by matching the words - 94 of the test corpora to the words in the lexicon, which scored about 67%. The second methodology used a lemmatizer program to compute the coverage, and scored about 82%. The SALMA-ABCLexicon contains 2,781,796 vowelized word-root pairs which represent 509,506 different non-vowelized words. The lexicon is stored in three different formats: tab-separated column files; XML files; and relational database. It is also provided with access and searching facilities and a web interface that provide searching for a certain root and retrieving the original root definitions of the analyzed traditional Arabic lexicons. The different formats and the access and search facilities will increase the reusability of the lexicon in different Arabic text analytics applications. The SALMAABCLexicon is an open-source morphological resource. The Corpus of Traditional Arabic Lexicons is a special corpus which is constructed from the text of 23 traditional Arabic lexicons. The corpus contains 14,369,570 words and 2,184,315 word types. The corpus is stored using three formats: text files encoded using Unicode utf-8; XML files; and a relational database. The corpus is an open-source resource for Arabic. - 95 - Chapter 5 Survey of Arabic Morphosyntactic Tag Sets and Standards; Background to Designing the SALMA Tag Set This chapter is based on the following sections of published papers: Sections 2, 3, 4, and 5 are based on sections 1.3, 1.4, 2 and 3 from (Sawalha and Atwell Under review) Chapter Summary A range of existing Arabic Part-of-Speech tag sets are illustrated and compared, and generic design criteria for corpus part-of-speech tag sets is reviewed in this chapter. Eight existing morphosyntactic annotation schemes for Arabic are compared in terms of the purpose of design, tag set characteristics, tag set size, and their applications. The main characteristics of the SALMA – Tag Set are to be: general purpose; reusable; and adhering to standards. The SALMA – Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora. Sophisticated morphological and syntactic knowledge was extracted from traditional Arabic grammar books, then classified and used as a standard for the design of the SALMA – Tag Set. Tag set design criteria proposed by Atwell (2008) were applied and design decisions were investigated to handle each design dimension. - 96 - 5.1 Introduction The prerequisite for Part-of-speech annotation of corpora is a previously defined part-of-speech annotation scheme (Hardie 2004). The annotation scheme describes the morphosyntactic categories and enables annotators (human or computers) to label the corpus words by giving each word a label from the list of morphosyntactic categories according to its context; this is called a tag set. Since the development of the Brown Corpus in 1963-1964, tag sets for English evolved. The Brown Corpus tagset has 87 tags. A smaller tagset for English is the 45-tag Penn Treebank tagset used to tag the Penn Treebank. A middle size of 61 tags for English is the C5 tagset used by the Lancaster UCREL project’s CLAWS (The Constituent Likelihood Automatic Word Tagging System) to tag the British National Corpus (BNC). The current standard tagset for CLAWS is the 164-tag C7 tagset (Jurafsky and Martin 2008). AMALGAM45 (Automatic Mapping Among Lexico-Grammatical Annotation Models) multi-tagged corpus is pos-tagged according to a range of rival English corpus tagging schemes. These tagging schemes include: Brown corpus; ICE (International Corpus of English); LLC (London-Lund Corpus); LOB (Lancaster-Oslo/Bergen Corpus); PARTS (i.e. tag set used to tag the Spoken Corpus Recordings In British English SCRIBE); PoW (Polytechnic of Wales corpus); SEC (Spoken English Corpus); and UPenn (University of Pennsylvania corpus). Figure 5.1 shows an example of a sentence from the AMLGAM multi-tagged corpus illustrating the 8 tagging schemes used to tag the same sentence (Atwell 2007; Atwell 2008). Brown VB select AT the NN text PPSS you VB want TO to protect VB . . ICE LLC LOB PARTS PoW SEC UPenn V(montr,imp) ART(def) N(com,sing) PRON(pers) V(montr,pres) PRTCL(to) V(montr,infin) PUNC(per) VA+0 TA NC RC VA+0 PD VA+0 . VB ATI NN PP2 VB TO VB . adj art noun pron verb verb verb . M DD H HP M I M . VB ATI NN PP2 VB TO VB . VB DT NN PRP VBP TO VB . Figure 5.1 Example sentence illustrating rival English part-of-speech tagging (from the ALMAGAM multi-tagged corpus) Besides the evolution of the part-of-speech tag sets, standards and guidelines for morphosyntatic annotation of text corpora appeared. These standards and guidelines provide sophisticated knowledge of morphology and syntax where various heuristics are 45 The AMALGAM project http://www.comp.leeds.ac.uk/amalgam/amalgam/amalghome.htm - 97 given in the tagging manuals to help humans and computers to make decisions in postagging the corpus (Jurafsky and Martin 2008). EAGLES (Expert Advisory Group on Language Engineering Standards) has become a widely used and most important recent standard for morphosyntactic annotation for Indo-European languages. The EAGLES guidelines were proposed in the interest of comparability, interchangeability and reusability of annotated corpora (Leech and Wilson 1996). Many morphosyntactic schemes for different languages applied the EAGLES guidelines. Example projects are: the MULTEXT project; the GRACE project; the CRATER project; and the morphosyntactic tag set of Urdu. The four projects and the tag set of Urdu are discussed in Hardie (2003 and 2004). This chapter provides a background review of existing Arabic tag sets and discusses the design standards and guidelines applied in designing the morphological features tag set of Arabic, the SALMA Tag Set. The chapter starts by introducing traditional Arabic grammar in section 5.2. A survey and a comparative evaluation of existing Arabic partof-speech tag sets are made in section 5.3. Section 5.4 discusses the design criteria proposed by Atwell (2008), which is applied in the design of the SALMA Tag Set. Finally, the complex morphology of Arabic is discussed in section 5.5. 5.2 Traditional Arabic Part-of-Speech Classification Arabic, unlike English and modern European languages, has a long traditional of scholarly research into its grammatical description, spanning over a millennium. Most traditional Arabic grammar studies follow the order established by =!'; G; > Sῑbawayh, about fourteen hundred years ago. It starts with syntax '¾ naḥw, followed by morphology 6!(8 taṣrῑf, and phonology `'/ u%4 ‘ilm al-’aṣwāt. The grammarian’s main preoccupation was the explanation of the case ending of the words in the sentence, called J4Z ’i‘rāb. The term originally meant the correct use of Arabic according to the language of the Bedouins but came to mean declension. Classical Arabic linguists classify words into three main parts of speech: Noun, name of a person, place, or object which does not have any tense; Verb, a word which indicates an action and has tense; and Particle, a word which cannot be understood without joining with a noun or a verb or both. However, there are also morphological criteria for this classification: a verb can be defined as a word derived from a specified morphological pattern, and has morphological features such as person and mood; while a noun can be definite or indefinite and has number and gender features. Derived nouns, which are derived from verbs, may have the same pattern with verbs. Particles are considered the most idiosyncratic words in Arabic, as these particles might span several grammatical categories. For example the particle wa ; can indicate a an an conjunction between two adjectives > ;%S= ;=  * ^2#-Œ? ; ^ # ^2-=5; d ? £;5 qaḍaytu waqt sa‘ῑd wa - 98 mumti‘an fῑ al-ḥaflati ‘I spent an interesting and happy time at the party’. While, in another case, the same particle wa ; functions as locative preposition in the sentence F ? @; ; mašaytu wa an-nahra ‘I walked along the river’(Al-Ghalayyni 2005). ; .) ; d Arabic is a highly inflectional language, and the traditional classification into nouns, verbs and particles does not say much about word structure. Arabic has many morphological and grammatical features, including sub-categories, person, number, gender, case, mood, etc. (Atwell 2008). A more fine-grained tag set is more appropriate for morphology research. The additional information may also help to disambiguate the base grammatical class (Schmid and Laws 2008). We aim to develop a part-of-speech tagger for annotating general-purpose Arabic corpus resources, in a wide range of text formats, domains and genres, including both vowelized and non-vowelized text; enriching the text with linguistic analysis will maximize the potential for corpus re-use in a wide range of applications. We foresee an advantage in enriching the text with part-of-speech tags showing very fine-grained grammatical distinctions, which reflect expert interest in syntax and morphology, rather than specific needs of end-users, because end-user applications are not known in advance. Very fine-grain distinctions may cause problems for automatic tagging if some words can change grammatical tag depending on function and context (Atwell 2008); on the other hand, fine-grained distinctions may actually help to disambiguate other words in the local context. Practical experiments using a fine-grain morphological tag set were reported by (Schmid and Laws 2008). Their experiments were carried out using German and Czech as examples of highly inflectional languages. Their HMM part-of-speech tagger makes good use of the fine-grain tag set; it splits the part-of-speech into attribute vectors and estimates the conditional probabilities of the attribute with decision trees. This method achieved a higher tagging accuracy than two state-of-the-art general-purpose part-of-speech taggers (TnT and SVMTool). We believe that this kind of approach may yield better results for an Arabic part-of-speech tag set including fine-grained morphological features. 5.3 Existing Arabic Part-of-Speech Tag Sets This section covers the most important Arabic tag sets and tag set design methodologies. These tag sets are; (1) Khoja’s Arabic tag set, (2) Penn Arabic Treebank tag set, (3) ARBTAGS, (4) The Quranic Arabic Corpus morphological tag set, (5) The MorphoChallenge 2009 Qur’an Gold Standard tag set and (6) CATiB part-of-speech tag set. The section describes each tag set and their characteristics, and a comparison table illustrates the differences between the different Arabic tag sets. The tag sets range from a small set of short tags analogous to BNC or LOB tag sets for English on one hand, to - 99 longer more detailed morphological tag sets (e.g. Penn Arabic Treebank (FULL) tag set) which are analogous to the ICE tag set for English. 5.3.1 Khoja’s Arabic Tag Set During early research on developing a part-of-speech tagger for Arabic text, (Khoja, Garside and Knowles 2001; Khoja 2003) developed a tag set for Arabic which is based on traditional Arabic grammar categories rather than modern European EAGLES standards. The reasons for not following EAGLES morphosyntactic guidelines were: Arabic belongs to the Semitic language family while EAGLES guidelines were designed for European languages; and following EAGLES guidelines would not capture some Arabic morphosyntactic information such as imperative or jussive mood, dual number and inheritance. Inheritance is an important aspect of Arabic, where all subclasses of words inherit properties from the classes they are derived from. Khoja’s tag set contains 177 tags; 103 types of noun, 57 verbs, 9 particles, 7 residuals and 1 punctuation. Khoja’s tag set included the morphological features of gender, number, person, case, definiteness and mood. Figure 5.2 shows an example of a part-of-speech annotated sentence MQ2‰ `2F t'- ^ S)8 xS!@ x  tanfῑḏan li-tawjῑhāt ẖādim al-ḥaramayn aš-šarῑfayn “Implementation of the directives of the Custodian of the Two Holy Mosques”, taken from the training corpus of the APT tagger (Khoja 2003). Word  Khoja’s part-of-speech tag tanfῑḏ an Implementation NCSgMI   li-tawjῑhāt directives PPr’NCSgMI !"# ẖādim Custodian NCSgMI $% al-ḥaramayn Two Mosques NCDuMD $&%' aš-šarῑfayn Holy NCDuMD Figure 5.2 Example of tagged sentence using Khoja’s tag set 5.3.2 Penn Arabic Treebank (PATB) Part-of-Speech Tag Set The most widely used tag set for Arabic is the Penn Arabic Treebank tag set used to annotate the Penn Arabic Treebank (PATB) with part-of-speech tags. Tim Buckwalter’s morphological analyser was used to compute a set of candidate solutions or analyses for each word, and then Arabic linguists selected the solution which best fitted the context. The Penn Arabic Treebank model postulates a FULL tag set which comprises over 2200 tag types (Diab 2007; Habash, Faraj and Roth 2009). This includes combinations of 114 basic tags listed in the Linguistic Data Consortium (LDC) Arabic part-of- - 100 speech/morphological tagging documentation46 (Maamouri and Bies 2004; Maamouri et al. 2004; Habash 2010). Figure 5.3 shows these basic tags. The FULL tag set exhibits a wider range of morphological features: case, gender, number, definiteness, mood, person, voice, tense and aspect. The LDC also introduced the reduced tag set (RTS) of 25 tags which is designed to maximize the performance of Arabic syntactic parsing. The RTS follows the tag set designed for the English Wall Street Journal. The morphological features marked by the RTS tag set are case, mood, gender, person and definiteness (Diab 2007). ABBREV ADJ ADV CONJ DEM_PRON_F DEM_PRON_FD DEM_PRON_FS DEM_PRON_MD DEM_PRON_MP DEM_PRON_MS DET EMPHATIC_PARTICLE EXCEPT_PART FUNC_WORD FUT INTERJ INTERROG_PART IV1P IV1S IV2D IV2FS IV2MP IV2MS IV3FD IV3FP IV3FS IV3MD IV3MP IV3MS IVSUFF_DO:1P IVSUFF_DO:1S IVSUFF_DO:2MP IVSUFF_DO:2MS IVSUFF_DO:3D IVSUFF_DO:3FS IVSUFF_DO:3MP IVSUFF_DO:3MS IVSUFF_SUBJ:2FS_MOOD:SJ IVSUFF_SUBJ:D_MOOD:I IVSUFF_SUBJ:D_MOOD:SJ IVSUFF_SUBJ:FP IVSUFF_SUBJ:MP_MOOD:I IVSUFF_SUBJ:MP_MOOD:SJ NEG_PART NO_FUNC NON_ALPHABETIC NON_ARABIC NOUN NOUN_PROP NSUFF_FEM_DU_ACCGEN NSUFF_FEM_DU_ACCGEN_POSS NSUFF_FEM_DU_NOM NSUFF_FEM_DU_NOM_POSS NSUFF_FEM_PL NSUFF_FEM_SG NSUFF_MASC_DU_ACCGEN NSUFF_MASC_DU_ACCGEN_POSS NSUFF_MASC_DU_NOM NSUFF_MASC_DU_NOM_POSS NSUFF_MASC_PL_ACCGEN NSUFF_MASC_PL_ACCGEN_POSS NSUFF_MASC_PL_NOM NSUFF_MASC_PL_NOM_POSS NSUFF_MASC_SG_ACC_INDEF NUM NUMERIC_COMMA PART POSS_PRON_1P POSS_PRON_1S POSS_PRON_2FS POSS_PRON_2MP POSS_PRON_2MS POSS_PRON_3D RESULT_CLAUSE_PARTICLE POSS_PRON_3FP POSS_PRON_3FS POSS_PRON_3MP POSS_PRON_3MS PREP PRON_1P PRON_1S PRON_2FS PRON_2MP PRON_2MS PRON_3D PRON_3FP PRON_3FS PRON_3MP PRON_3MS PUNC PVSUFF_DO:1P PVSUFF_DO:1S PVSUFF_DO:3D PVSUFF_DO:3FS PVSUFF_DO:3MP PVSUFF_DO:3MS PVSUFF_SUBJ:1P PVSUFF_SUBJ:1S PVSUFF_SUBJ:2FS PVSUFF_SUBJ:2MP PVSUFF_SUBJ:3FD PVSUFF_SUBJ:3FP PVSUFF_SUBJ:3FS PVSUFF_SUBJ:3MD PVSUFF_SUBJ:3MP PVSUFF_SUBJ:3MS REL_PRON REL_ADV SUBJUNC VERB_IMPERFECT VERB_PERFECT VERB_PASSIVE Figure 5.3 The Penn Arabic Treebank Tag Set; basic tags, which can be combined 46 LDC Arabic POS tagging documentation http://www.ircs.upenn.edu/arabic/Jan03release/POS-info.txt - 101 INPUT STRING: LOOK-UP WORD: Comment: * SOLUTION 1: (GLOSS): INPUT STRING: LOOK-UP WORD: Comment: SOLUTION 1: (GLOSS): * SOLUTION 2: (GLOSS): INPUT STRING: LOOK-UP WORD: Comment: * SOLUTION 1: (GLOSS): INPUT STRING: LOOK-UP WORD: Comment: * SOLUTION 1: (GLOSS): INPUT STRING: LOOK-UP WORD: Comment: * SOLUTION 1: (GLOSS): INPUT STRING: LOOK-UP WORD: Comment: * SOLUTION 1: (GLOSS): INPUT STRING: LOOK-UP WORD: Comment: * SOLUTION 1: (GLOSS): SOLUTION 2: (GLOSS): SOLUTION 3: (GLOSS): INPUT STRING: LOOK-UP WORD: Comment: SOLUTION 1: (GLOSS): * SOLUTION 2: (GLOSS): SOLUTION 3: (GLOSS): INPUT STRING: LOOK-UP WORD: Comment: * SOLUTION 1: (GLOSS): INPUT STRING: LOOK-UP WORD: Comment: * SOLUTION 1: (GLOSS): INPUT STRING: LOOK-UP WORD: Comment: SOLUTION 1: (GLOSS): * SOLUTION 2: (GLOSS): INPUT STRING: LOOK-UP WORD: Comment: * SOLUTION 1: (GLOSS): SOLUTION 2: (GLOSS): INPUT STRING: LOOK-UP WORD: Comment: * SOLUTION 1: (GLOSS): INPUT STRING: LOOK-UP WORD: Comment:  tm (tam~) tam~/VERB_PERFECT + conclude/take place + ‫ااد‬ AEdAd (>aEodAd) >aEodAd/NOUN + numbers/issues + (wl (>aw~al) >aw~al/VERB_PERFECT + explain/interpret + (>aw~al) >aw~al/ADJ + first + (>uwal) >uwal/ADJ + first + ‫ر‬ rHlp (riHolap) riHol/NOUN+ap/NSUFF_FEM_SG + journey/career + [fem.sg.] ‫ان‬ TyrAn (TayarAn) TayarAn/NOUN + airline/aviation +   EvmAnyp (EuvomAniy~ap) EuvomAniy~/NOUN+ap/NSUFF_FEM_SG + Ottoman + [fem.sg.] (EuvomAniy~ap) EuvomAniy~/ADJ+ap/NSUFF_FEM_SG + Ottoman + [fem.sg.] ‫ ق‬ fwq (fawoq) fawoq/PREP + above/over + (fawoq) fawoq/NOUN + top/upper part + ‫ا "!د‬ AlblAd (AlbilAd) Al/DET+bilAd/NOUN the + (native) country/countries + #$ ‫ا‬ AlErbyp Figure 5.4 Buckwalter morphological analysis of a sentence from the Arabic Treebank - 102 -  (tam~) tam~/VERB_PERFECT ‫<(ااد‬iEodAd) (أول‬aw~al) >aw~al/ADJ ‫( ر‬riHolap) riHol/NOUN+ap/NSUFF_FEM_SG ‫( ان‬TayarAn) TayarAn/NOUN  (EuvomAniy~ap) EuvomAniy~/ADJ+ap/NSUFF_FEM_SG ‫(  ق‬fawoq) fawoq/PREP ‫(ا "!د‬AlbilAd) Al/DET+bilAd/NOUN #$ ‫(ا‬AlEarabiy~ap) Al/DET+Earabiy~/ADJ+ap/NSUFF_FEM_SG Figure 5.5 Disambiguated sentence from the Arabic Treebank using FULL tag set INPUT STRING: LOOK-UP WORD: * SOLUTION 1: (GLOSS): SOLUTION 2: (GLOSS): &ْ * 'َ ‫َو‬ ‫و‬ َ wwSynA (wawaS~ayonA) [waS~aY_1] wa/CONJ+waS~ay/VERB_PERFECT+nA/PVSUFF_SUBJ:1P and + recommend/advise + we (wawaSiy~nA) [waSiy~_1] wa/CONJ+waSiy~/NOUN+nA/POSS_PRON_1P and + authorized agent/trustee + our INPUT STRING: LOOK-UP WORD: * SOLUTION 1: (GLOSS): ‫ن‬ ,‫ا‬ َ َ+ِ ْ Al SOLUTION 2: (HasunA) [Hasun-u_1] Hasun/VERB_PERFECT+A/PVSUFF_SUBJ:3MD (GLOSS): + be beautiful/be good + they (both) SOLUTION 3: (Has~an~A) [Has~an_1] Has~an/VERB_PERFECT+nA/PVSUFF_SUBJ:1P (GLOSS): + improve/decorate + we SOLUTION 4: (Has~anA) [Has~an_1] Has~an/VERB_PERFECT+A/PVSUFF_SUBJ:3MD (GLOSS): + improve/decorate + they (both) * SOLUTION 5: (HusonAF) [Huson_1] Huson/NOUN+AF/NSUFF_MASC_SG_ACC_INDEF (GLOSS): + good/beauty + [acc.indef.] SOLUTION 6: (HasanAF) [Hasan_2] Hasan/NOUN+AF/NSUFF_MASC_SG_ACC_INDEF (GLOSS): + good + [acc.indef.] SOLUTION 7: (HasanA) [Hasan_2] Hasan/NOUN+A/NSUFF_MASC_DU_NOM_POSS (GLOSS): + good + two SOLUTION 8: (HasanAF) [Hasan_2] Hasan/ADV+AF/NSUFF_MASC_SG_ACC_INDEF (GLOSS): + well + [acc.indef.] SOLUTION 9: (Has~anA) [Has~-i_1] Has~/VERB_PERFECT+a/PVSUFF_SUBJ:3MS+nA/PVSUFF_DO:1P (GLOSS): + feel + he/it us SOLUTION 10: (Has~nA) [Has~_1] Has~/NOUN+nA/POSS_PRON_1P (GLOSS): + perception/feeling + our SOLUTION 11: (His~nA) [His~_1] His~/NOUN+nA/POSS_PRON_1P (GLOSS): + sensation/perception + our Figure 5.6 Buckwalter morphological analysis of a sentence from the Quran &ْ * 'َ ‫َو‬ ‫( و‬wawaS~ayonA) wa/CONJ+waS~ay/VERB_PERFECT+nA/PVSUFF_SUBJ:1P َ ‫ن‬ ,‫ا‬ َ َ+ِ ْ (Al . ;; wa waṣṣaynā al-‘insāna biwālidayhi from the Qur’an (chapter 29): 2^)i= ? >=!; >';>" k2 ; i< ; N= 2;)G=/ ḥusnan ‘We have enjoined on man kindness to parents’. Figures 5.4 and 5.6 show the full outputs of the Buckwalter morphological analyser including several possible solutions for some words; and Figures 5.5 and 5.7 show the correct disambiguated solution for each word in context. Diab (2007) compared the FULL and RTS tag sets introduced by the LDC to PoStag the Arabic Treebank. The study is about designing the optimal part-of-speech tag set for Arabic. By analyzing the Arabic Treebank data, the RTS tag set is extended from 25 tags to 75 tags. Only morphological features, which are explicitly marked on the words, are added to the RTS. The new tag set is called the ERTS (extended reduced tag set). The ERTS has only the explicit or marked morphological features of gender, number and definiteness on nominals while maintaining the existing features from RTS. Figure 5.8 illustrates some differences between the three tag sets: FULL, RTS and ERTS from (Diab 2007). Word HSylp % ( ‘result’ FULL RTS ERTS NN NNF JJ JJF NN NNM  A2À nhA}yp ‘final’ ¼Q2 HAdv ‘accident’ NOUN+ NSUFF_FEM_SG+ CASE_IND_NOM ADJ+ NSUFF_FEM_SG+ CASE_IND_NOM NOUN+ CASE_DEF_ACC 2) AlnAr ‘the-fire’ DET+ NOUN+ CASE_DEF_GEN NN DNNM L42Ro AlimAEy ‘group’ DET+ ADJ+ CASE_DEF_GEN JJ DJJM x(’¯ $xSyn ‘two-persons’ NOUN+ NSUFF_MASC_DU_GEN NN NNMDu Figure 5.8 A sample of tagged sentence using the FULL, RTS and ERTS tag sets 5.3.3 ARBTAGS Tag Set Alqrainy (2008) developed a new part-of-speech tag set called ARBTAGS to be used in the development of a part-of-speech tagger. The tag set design followed the criteria proposed by Atwell (2008). Like Khoja, Alqrainy built on traditional Arabic grammar books to design the tag set. Six morphological features of Arabic words were included: gender, number, case, mood, person and state. ARBTAGS contains 161 detailed tags and 28 general tags to cover the main part-of-speech classes and sub-classes. The 161 detailed tags are divided into 101 nouns, 50 verbs, 9 particles and 1 punctuation mark. Figure 5.9 shows the 28 general tags of the ARBTAGS tag set. - 104 TAG VePe VePi VePm NuPo NuCn NuAj NuIf NuRe NuDm NuIs NuPn NuTn NuPs NuCv DESCRIPTION Perfect verb Imperfect verb Imperative verb Proper noun Common noun Adjective noun Infinitive noun Relative noun Diminutive noun Instrument noun Noun of Place Noun of Time Pronoun Conjunctive noun TAG NuCd NuDe NuIn NuAd NuNn Fw Pun PrPp PrVo PrCo PrEx PrAn PrSb PrJs DESCRIPTION Conditional noun Demonstrative noun Interrogrative noun Adverb Numeral noun Foreign noun Punctuation mark Preposition Vocative Particle Conjunction Particle Exception Particle Annulment Particle Subjunctive Particle Jussive Particle Figure 5.9 The 28 general tags of the ARBTAGS tag set 5.3.4 MorphoChallenge 2009 Qur’an Gold Standard Part-of-Speech Tag Set MorphoChallenge200947 Qur’an gold standard was developed using the data of Morphological Tagging of the Qur’an database (Talmon and Wintner 2003; Dror et al. 2004). It was developed to be used to evaluate morphological analyzers in the Morphochallenge 2009 competition (Kurimo et al. 2009), which aimed to develop an unsupervised morphological analyzer to be used for different languages including Arabic. It contains the full morphological analysis for each word, according to the Tagged database of the Qur’an but reformatted to match other Morphochallenge test sets in other languages. The word’s morphological analysis is shown after each word where the morphological features are separated by space and “+” sign. These features include the part-of-speech of the word, number, gender, person, case, definiteness, voice and others. Figure 5.10 shows a sample of the Qur’an gold standard. This tag set was called a “gold standard” for the purpose of the MorphoChallenge 2009 contest, as it was the “target” or “solution” which the competitor system had to try to produce. The tagged text in other languages (i.e. English, German, French, Finish and Turkish) were also “gold standards” for the purposes of the MorphoChallenge contest. The term “gold standard” does not imply the tag set is better than others reviewed in the chapter. 47 MorphoChallenge 2009 Qur’an Gold Standard http://www.cis.hut.fi/morphochallenge2009/datasets.shtml - 105 &ْ * 'َ ‫َو‬ ‫و‬ َ 2'‫ُ و‬ 35 $َ 4/ ُ ‫ن‬ ,‫ا‬ َ َ+ِ ْ ِ ./ # َِ‫َِا‬ ْ 7‫!ن ء‬$  َ ِ  ‫ و‬3ِ  َ   &ْ+ُ ً 9+ 3$ُ  ‫ و‬+Particle +Conjunction َ &َ 6'َ ‫ و‬+Verb +Perf َ +Act +1P +Pl +Masc/Fem ِ +Noun +Triptotic +Sg +Masc +Acc +Def ‫َ ن‬+‫ء‬ ‫ ب‬+Prep  ِ‫َا‬ ‫ و‬+Noun +Triptotic +Dual +Masc +Obliquus +Pron +Dependent +3P +Sg +Masc 9+ُ  +Noun +Triptotic +Sg +Masc +Acc +Tanwiin wawaS~ayonaA wSy yufaE~ilu wa +Particle +Conjunction waSSaynaA +Verb +Perf +Act +1P +Pl +Masc/Fem Alo? sā’iḥ *@A zārū C &< ’aylūl D: al-māḍῑ CATIB ANNOTATION September NOM Past NOM Figure 5.12 Example of part-of-speech tagged sentence using CATiB tag set 5.3.7 Comparison of Arabic Part-of-Speech Tag Sets Table 5.1 shows a comparison of the eight Arabic tag sets studied in this section. The comparison summarizes the characteristics of each tag set and helps to show the differences between them clearly. The drawbacks of the existing tag sets for Arabic were found to be: • Existing Arabic tag sets vary in size from 6 tags to 2000 or more tags. • Some of these tag sets follow standards for tag set design for English such as the PATB tag sets, and these may not always be appropriate for Arabic. • The tag sets share common morphological features such as gender, number, person, case, mood and definiteness, but the attributes of the morphological feature categories are not standardized. - 108 • These tag sets lack standardization in defining a suitable scheme for tokenizing Arabic words into their morphemes and they mix morpheme tagging with whole word tagging. • They also lack suitable documentation that illustrates the decision made for each design dimension of the tag set. • The tags assigned to words in a corpus are not consistent in either presentation of the tag itself or the morphological features which are encoded within the tag. Moreover, the most widely used and important morphosyntactic annotation standards and guidelines, namely EAGLES, are designed for Indo-European languages. These guidelines are not entirely suitable for Arabic. These drawbacks of existing tag sets are the motivation behind desining the SALMA (Sawalha Atwell Leeds Morphological Analysis) Tag Set for Arabic. The comparison of the morphological features used in the different tag sets of Arabic shows shared common features such as gender, number, person, case, mood and definiteness. Features such as voice, tense and aspect are included in the PATB FULL tag set. State is included in the ARBTAGS tag set. Diptotic is a feature of the MorphoChallenge 2009 tag set, and verb form and derivation are features of the QAC tag set. Chapter 6 discusses the 22 morphological features of the SALMA Tag Set. Table 5.1 Comparison of Arabic part-of-speech tag sets 1. Khoja’s Tag set Purpose of design Compiling a tag set as a standard tag set Main Based on traditional Arabic grammar rather than being based on characteristics an Indo-European one. Only the main classes and subclasses have been chosen. Tag set size 177 tags (103 types of noun, 57 verbs, 9 particles, 7 residuals,1 punctuation) Morphological Gender, Number, Case, Definiteness , Person, Mood features Applications Used in the design of the APT tagger, and in the annotation of the training data of the APT tagger. 2. Penn Arabic Treebank (PATB) Part-of-Speech Tag Set (FULL) Purpose of design Annotating the Arabic Treebank with part-of-speech tags Main Aims to cover detailed grammar features. characteristics Tag set size The FULL tag set comprises over 2000 tag types. This includes combinations of 114 basic tags. Morphological Case, Gender, Number, Definiteness, Mood, Person, Voice, Tense, features Aspect Applications Used in Tim Buckwalter’s morphological analyser to annotate the Penn Arabic Treebank with part-of-speech tags. - 109 3. Penn Arabic Treebank (PATB) Reduced Part-of-Speech Tag Set (RTS) Purpose of design Maximizing the performance of Arabic syntactic parsing. Main Follows the tag set designed for the English Wall Street Journal. characteristics Tag set size 25 tags Morphological Case, Mood, Gender, Person, Definiteness features Applications Used in the syntactic annotation of the Penn Arabic Treebank 4. Penn Arabic Treebank (PATB) Extended Reduced Part-of-Speech Tag Set (ERTS) Purpose of design To be used for higher order processing of the language Main Is an extension of the RTS tag set which has only the explicit or characteristics marked morphological features of gender, number and definiteness on nominals. Tag set size 75 tags Morphological Gender, Number, Definiteness on nominals features Applications To be used for parsing 5. ARBTAGS Purpose of design Standardizing and building a comprehensive Arabic tag set. Main The tag set hierarchy follows the tradition of Arabic grammar. characteristics Tag set size 161 detailed tags (101 nouns, 50 verbs, 9 particles, 1 punctuation mark including 28 different POS general tags to cover the main part-of-speech classes and sub-classes. Morphological Gender, Number, Case, Mood, Person, State features Applications Used in the Arabic Morphosyntactic Tagger AMT 6. MorphoChallenge 2009 Qur’an gold standard tag set Purpose of design To annotate the Qur’an as a gold standard to be used to evaluate morphological analyzers in the MorphoChallenge 2009 competition. Main It was developed using the data for Morphological Tagging of the characteristics Qur’an database. Tag set size The tag set is combinations of the POS main and sub classes and the morphological features of the analysed words. Morphological Gender, Number, Person, Case, Mood, Aspect, Voice, features Definiteness, Diptotic Applications Used to construct the Qur’an gold standard for evaluating morphological analyzers in the MorphoChallenge 2009 competition. 7. Quranic Arabic Corpus POS tag set Purpose of design To Annotate the Qur’an with morphological and part-of-speech tagging information. Main Used Tim Buckwalter’s morphological analyzer as initial tagging, characteristics then a mapping from Buckwalter’s tag set to the Quranic Arabic Corpus tag set. It adapts traditional Arabic grammar. Tag set size The tag set involves combinations of the POS main and sub classes and the morphological features of the analysed words. - 110 Morphological features Applications Person, Gender, Number, Aspect, Mood, Voice, Verb form, Derivation, State Used in the morphological and part-of-speech annotation of the Quranic Arabic Corpus 8. Columbia Arabic Treebank POS tag set Purpose of design To be used for the part-of-speech annotation of Columbia Arabic Treebank CATiB. Main CATiB avoids the annotation of redundant linguistic information characteristics that is determinable automatically from syntax and morphological analysis, e.g., nominal case. CATiB uses linguistic representation and terminology inspired by the long tradition of Arabic syntactic studies. Tag set size 6 part-of-speech tags (VRB – all verbs, VRB-PASS – passivevoice verbs, NOM – all nominals, PROP – proper nouns, PRT – particles, PNX – all punctuation marks) Morphological No morphological features are encoded in the part-of-speech tag features set of Columbia Arabic Treebank CATiB Applications Used in the part-of-speech annotation of Columbia Arabic Treebank CATiB. 5.4 Morphological Features in Tag Set Design Criteria EAGLES48 (Leech and Wilson 1996) proposed recommendations (guidelines) for morphosyntactic categories for European languages. The aim of the EAGLES guidelines is to propose standards in developing tag sets for morphosyntactic tagging, in the interest of comparability, interchangeability and reusability of annotated corpora. In addition to preferred standards, EAGLES guidelines also cater for extensibility, allowing specifications to extend to language-specific phenomena. The guidelines proposed standardisation in three important areas: • Representation/Encoding: unambiguity. transparency, processability, brevity and • Identifying categories/ subcategories/ structure: agreement on common categories and allowance for variation: obligatory, recommended and optional specification. • Annotation schemes and their application to text: detailed annotation schemes should be made available to end-users and to annotators. EAGLES recognizes four degrees of constraint in the description of word categories for morphosyntactic tags. First, obligatory; attributes have to be included in any morphosyntactic tag set: main categories of part-of-speech Noun, Verb, Adjective, 48 EAGLES Recommendations for the Morphosyntactic Annotation of Corpora. EAGLES document EAG-TCWG-MAC/R. http://www.ilc.cnr.it/EAGLES96/pub/eagles/corpora/annotate.ps.gz - 111 Pronoun/Determiner, Article, Adverb, Adposition, Conjunction, Interjection, Unique/Unassigned, Residual, Punctuation. Second, recommended: attributes and values of widely-recognized grammatical categories which occur in conventional grammatical description (e.g. Gender, Number, Person). Third, generic special extensions: attributes and values which are not usually encoded, but might be included for particular purposes, for example semantic classes such as temporal nouns, manner adverbs, place names, etc. Finally, language-specific special extensions: additional attributes or values which may be important for a particular language. Khoja et al (2001) compared their Arabic tag set against the EAGLES guidelines. The comparison showed: first, EAGLES tag set guidelines are based on Latin as a common ancestor, while Arabic has some novel features not found in Latin, for example certain categories and subcategories that inherit properties from the parent categories. Second, a Classical Arabic tag set has three main categories (nouns, verbs and particles), while EAGLES has eleven major part-of-speech categories. Third, apart from nouns and verbs, other major categories in EAGLES such as pronouns, numerals and adjectives are described as subcategories of major categories in a classical Arabic tag set. Fourth, Arabic, not only has singular and plural numbers, but it also has dual number. Moreover, Arabic verbs are classified as being perfect, imperfect and imperative, which differs from EAGLES classification of past, present and future tenses. Finally, the mood morphological feature is not covered by the EAGLES guidelines. Atwell (2008) proposed criteria for tag set development, and stated that there are dimensions (choices) to be made by developers of a new part-of-speech tag set. Developers must decide on the set of grammatical tags or categories, and their definitions and boundaries. These criteria were applied to Arabic when the ARBTAGS tag set (Alqrainy 2008) was designed. We followed the same criteria as Atwell (2008) in designing the general-purpose morphological features tag set. Sections 5.4.1 - 5.4.12 explain the criteria and how they are applied in the SALMA – Tag set. 5.4.1 Mnemonic Tag Names Generally, tag names for English PoS tag sets are chosen to help linguists to remember the grammatical categories such as CC for Coordinating Conjunction and VB for VerB. The SALMA Tag Set for Arabic has to encode much richer morphology: the tag is represented by a string of 22 characters. Each character represents a value or attribute which belongs to a morphological feature category. The position of the character in the tag string is important as it identifies the morphological feature category. The value of the feature is represented by one lowercase character, which is intended to remain readable, such as: v in the first position to indicate verb, n in the second position to indicate name, gender category values in the seventh position where masculine is represented by m, - 112 feminine is represented by f and common gender is represented by x. If the value of a certain feature is not applicable for the tagged word then dash “-” is used to indicate this. A question mark “?” indicates “unknown”: a certain feature normally belongs to the word but at the moment is not available or the automatic tagger could not guess it. The interpretation of the tag is handled by referring to the attribute value and its position in the tag string. The position of the attribute in the tag string identifies the morphological feature category, while the attribute value is identified by searching the morphological feature category for the specified symbol. Then, all these single interpretations of attributes are grouped together to represent the full tag of the word. The tag is still readable by linguists. Moreover, the tag is straightforwardly readable by software, for example by a search tool matching specified feature-value(s). 5.4.2 Underlying Linguistic Theory Linguists who develop new tag sets will inevitably be swayed by the linguistic theories they espouse. In the case of English, there is disagreement between grammar theories on the range of grammatical categories and features to be tagged, and more complicated structural issues. It is difficult to have theory-neutral annotation, because every tagging scheme makes some theoretical assumptions (Atwell 2008). Khoja’s mophosyntactic tag set was derived from classical Arabic grammar (Khoja et al. 2001; Khoja 2003). ARBTAGS also tried to follow the Arabic grammatical system, which is based upon main three part-of-speech classes: verbs, nouns and particles, and enriched with inflectional features (Alqrainy 2008). The Arabic Penn Treebank tag set follows the same criteria used to develop the English Treebank (Maamouri and Bies 2004). ERTS (extended reduced tag set) extends the LDC reduced tag set (RTS) by adding morphological features namely (case, mood, definiteness, gender, number and person). This extends the 25 RTS tag set to 75 tag set of ERTS (Diab 2007). The proposed SALMA – Tag Set adds more fine-grained details to the existing tag sets. The tag set follows traditional Arabic grammar theory (Dahdah 1987; Dahdah 1993; Wright 1996; Al-Ghalayyni 2005; Ryding 2005) in specifying 22 morphological features categories and their attributes or values. Section 6.2.1 justifies of the SALMA Tags in terms of this underlying theory. 5.4.3 Classification by Form or Function For English an ambiguous word like ‘open’ is tagged according to its function, and only its inflected forms are tagged by their form. Arabic words are highly inflected and hence word classification tends to be dependent on form. Classification by form is dependent on the word, while classification by function is dependent on the function of the word in context. For Arabic, the word class is heavily constrained by form, but if - 113 there is only one analysis, then it is determined by function. If there are two analyses, one needs to take context into account which means it is partially determined by function. In this case the function has to be taken into account for classification. Arabic word-class is dependent on form. Traditional Arabic grammar groups words according to their inflexional behaviour. A challenging characteristic of Arabic is the treatment of short vowels, which are normally omitted in written Arabic. These short vowels can help in specifying some morphological feature information of grammatical categories. The Qur’an is fully vowelized to ensure it is pronounced correctly. This makes the Qur’an a potential “Gold Standard” corpus for Arabic tagging and NLP research (Atwell 2008). Another challenge of Arabic words can appear when classifying words according to certain morphological feature such as gender. Classifying nouns into masculine or feminine can be viewed from two perspectives. First, according to the word’s structure or morphologically; masculine nouns are not normally marked by any suffix, while feminine nouns have a suffix normally –ah - added at the end of the noun. Second, semantically; nouns are arbitrarily classified into masculine or feminine, except when a noun refers to a human being or other creature having natural gender (sex), when it is normally conforms to natural gender (Ryding 2005). Therefore, a noun can have feminine suffix –ah; which is classified as morphologically feminine, but it indicates a male such as \l;=; ḥamzah ‘Hamza (male proper name)’, or vice versa, such as Â;= ; maryam ‘Mary (female proper name). 5.4.4 Idiosyncratic Words Arabic has some words with special, idiosyncratic behaviour, such as particles which cannot be analyzed morphologically according to root and pattern. (Khoja et al. 2001) includes examples of this type in an “Exception” category, which covers group of particles that are equivalent to the English word “except” and the prefixes non-, un- , and im-. 5.4.5 Categorization Problems A detailed categorisation scheme requires each tag to be defined clearly and unambiguously, by giving examples in a “case-law” document. This definition should include how to decide difficult, borderline cases, so that all examples in the corpus can be tagged consistently. Many words can belong to more than one grammatical category, depending on context of use. Tagging schemes should specify how to choose one tag as appropriate, if a word can have different part-of-speech tags in different contexts (Atwell 2008). - 114 Vowelized Arabic text has less ambiguity than non-vowelized Arabic text. Short vowels and some affixes add linguistic information which reduces the ambiguity. In the SALMA Tag Set, each feature category is described, clearly documented and examples are provided. Moreover, tagging guidelines define the appropriate attribute for the morphological feature category. 5.4.6 Tokenisation: What Counts as a Word? Arabic text tokenisation is not an easy task. Simple tokenisation of text can be carried out by dividing text into words by spaces, or punctuation. This tokenisation process is primitive and the first step in tokenising Arabic text. The majority of Arabic words are complex words; one or more clitics can be attached to the beginning and the end of the word [clitic(s) + word + clitic(s)]. These clitics are particles, pronouns or definite article. A tag is provided for each clitic attached to a word along with the tag of the word. For instance, the word u>>„2;)i; ;±>; wabiḥasanātihim ‘and with their good deeds’, > bi ‘with’ the consists of four parts, the conjunction letter  wa ‘and’, the preposition J > ;)i ḥasanāti ‘good deeds’ and the pronoun u him ‘ their’. The tag of this word word `2 ;; will be the tags of the four morphemes and the whole word tag which is a combination of the morphemes tags. The clitics will help the tagging scheme in identifying some of the > bi governs the genitive case of the noun. morphological features attributes; preposition J 5.4.7 Multi-Word Lexical Items Multi-words lexical items are rare in Arabic (Alqrainy 2008). Such items might consist of two words; noun followed by adjective describing the proceeding noun, some compound proper names such as  ? =4; ’abdu allāh ‘Abdullah’, or compound particles such as 2R; =>H fῑmā which consists of the preposition *> fῑ and the non-human relative noun 2; mā. In the case of proper names; a single tag might be more appropriated. While, for the other cases a separate tags for each part of the lexical item will give more morphological detail about the multi-word lexical items. The Penn Arabic Treebank guidelines ignore multi-word lexical items and tag each word of a compound word separately: “....Divided/compound proper names in Arabic (Abdul Ahmed, e.g.): Label all parts of the name with the "Is a name" button. Idioms: (for example, in what in them = 'included'): Label each word independently for its own part of speech (ignore the idiomatic meaning)....”49 49 Penn Arabic Treebank annotation guidelines http://www.ircs.upenn.edu/arabic/pos.html - 115 - 5.4.8 Target Users and/or Applications Fitness for purpose and customer satisfaction are the most important practical criteria for a new tag set. One common use of part-of-speech tagged corpora is language teaching and research. A detailed tag set is required in teaching and learning to reflect fine distinctions of grammar, even though Machine Learning systems could cope better with a smaller tag set. General-purpose tag set developers should be more aware of potential re-use: detailed and more sophisticated part-of-speech tag schemes allow wider re-use of the corpus in future research (Atwell 2008). The SALMA Tag Set is a general-purpose tag set. It encodes detailed information of morphological features embedded in any word. This morphological features information enables the tag set to be widely re-used. 5.4.9 Availability and/or Adaptability of Tagger Software If a part-of-speech tag set is implemented in automatic tagger software, this has a clear advantage over a purely theoretical tag set (Atwell 2008). HMM taggers can be reused for any language including Arabic. Experiments on highly inflectional languages such as German and Czech using an HMM tagger with a fine-grain tag set achieved higher tagging accuracy than two state-of-the-art general purpose part-of-speech taggers, The TnT tagger and SVMTool (Schmid and Laws 2008). Another experiment that uses a fine-grain tag set was done for Latin. Latin words require morphological analysis of nine features: part-of-speech, person, number, tense, mood, voice, gender, case and degree. The experiment used the TreeTagger analyzer which achieved an accuracy of 83% in correctly disambiguating the full morphological analysis (Bamman and Crane 2008). 5.4.10 Adherence to Standards The EAGLES guidelines are designed for European languages. However, the Arabic language is different from Indo-European languages and has its own structure and morphological features. Instead, the standard adhered to in the SALMA Tag Set is that of traditional Arabic grammar books e.g. (Dahdah 1987; Dahdah 1993; Wright 1996; AlGhalayyni 2005; Ryding 2005). 5.4.11 Genre, Register or Type of Language The SALMA Tag Set is intended to be general-purpose and to be used in part-ofspeech tagging of different text types, formats and genres, of both vowelized and nonvowelized text. The tagging schemes and the tag set can be evaluated on a variety of text types, formats and genres. Corpora can include text in Classical Arabic such as; Qur’an, Classical Arabic dictionaries and poems from ancient Arabic literature, as well as Modern Standard Arabic text from newspapers, magazines, web pages, blogs, children’s books, and school text books, etc. - 116 - 5.4.12 Degree of Delicacy of the Tag Set The total number of tags is an indicator of the level of fine-grainedness of analysis. Existing Arabic corpus tag sets have degree of delicacy ranging from 6 for CATiB, 25 for the RTS tag set of the Penn Arabic Treebank, 75 tags for ERTS, 161 tags for ARABTAGS, 177 tags for Khoja’s tag set, 2200 for PATB FULL tag set, and unspecified number of function combinations for QAC and MorphoChallenge 2009 tag sets. The SALMA Tag Set is a fine-grain tag set. It is unfeasible to enumerate all possible tags that can be generated from valid combinations of the 22 morphological feature categories; however, we can count the attributes of each feature category, and use these to estimate an upper bound or limit on the degree of delicacy of the SALMA Tag Set. Chapter 6 discusses the 22 morphological features of the SALMA – Tag Set and their attributes. An upper limit of possible feature combinations is 4.07E+16, the total number of possible combinations of features in the SALMA Tag Set of Arabic, calculated by multiplying together the number of attributes of each of the 22 morphological features. But, of course, this includes many invalid tags that will never be used. A more realistic upper bound is given by counting the possible feature combinations for each major part of speech, and summing these. Table 2 shows the absolute upper limit of possible feature combinations for each major part of speech (Noun, Verb, Particle, Other (Residual), Punctuation); this gives an upper limit of 101,945,168 possible morphological feature combinations: about one hundred million possible SALMA tags. - 117 Table 5.2 The upper limit of possible combinations of SALMA features Verb Punctuation Template Combinations Template Combinations Template Combinations Template Combinations Template Combinations 1 Number of attributes Feature Noun Part of speech Particle Other 5 n 1 v 1 p 1 r 1 u 1 34 ? 34 - 1 - 1 - 1 - 1 3 - 1 ? 3 - 1 - 1 - 1 22 - 1 - 1 ? 22 - 1 - 1 15 - 1 - 1 - 1 ? 15 - 1 12 - 1 - 1 - 1 - 1 ? 12 3 ? 3 - 1 - 1 ? 3 - 1 7 Main Part-ofSpeech Part-of-Speech: Noun Part-of-Speech: Verb Part-of-Speech: Particle Part-of-Speech: Other Punctuation marks Gender 8 Number 9 ? 9 - 1 - 1 ? 3 - 1 9 Person 3 - 1 ? 3 - 1 ? 3 - 1 10 Inflectional morphology Case or Mood 4 ? 3 ? 2 ? 1 ? 1 - 1 4 ? 3 ? 3 - 1 - 1 - 1 10 ? 7 ? 6 ? 4 ? 4 - 1 2 ? 2 - 1 - 1 - 1 - 1 2 - 1 ? 2 - 1 - 1 - 1 2 - 1 ? 2 - 1 - 1 - 1 4 - 1 ? 4 - 1 - 1 - 1 2 ? 2 ? 2 ? 2 - 1 - 1 9 ? 4 ? 6 ? 1 - 1 - 1 5 ? 5 ? 5 - 1 - 1 - 1 2 3 4 5 6 11 12 13 Case and Mood marks Definiteness 14 Voice 15 16 Emphasized and non-emphasized Transitivity 17 Rational 18 Declension and Conjugation Unaugmented and Augmented 19 20 3 ? 3 ? 2 - 1 - 1 - 1 21 Number of root letters Verb root 30 - 1 ? 30 - 1 - 1 - 1 22 Nouns finals 6 ? 6 - 1 - 1 - 1 - 1 Totals 18,662,400 4.1E+16 83,280,960 176 Upper limit of possible morphological feature combinations 1620 12 101,945,168 - 118 - 5.5 Complex Morphology of Arabic Most Arabic words are derived from their roots following certain templates called patterns. The derivation process adds prefixes, suffixes and infixes to the root letters to generate a new word, which has a new function or meaning but preserves the main concept or meaning carried by the root. Moreover, using the derived word in a certain context will require clitics to be added to the beginning and the end of the word. Proclitics include prepositions, conjuctions and definite articles, and enclitics include relative pronouns. In addition, one or more affixes or clitics can be added to the derived word. In conclusion, most Arabic words are complex words consisting of multiple morphemes. To specify a word’s morphemes, tokenization is needed to analyse the word morphemes as clitics, affixes or stem. For example the tokenizer will specify the morphemes of the word 2À' -  wasayaktubūnahā ‘and they will write it’ as follows: preclitic * wa ‘and’ (conjunction), prefixes v sa ‘will’ and stem ya (imperfect prefix), the - kataba ‘write’, the suffix k ūn ‘they’ and the enclitic 2 hā ‘it’ (object suffixed pronoun). The word consists of 6 morphemes. Each morpheme carries morphological features and belongs to a specific part of speech category. The SALMA Tag Set assigns a tag to each morpheme of the word. Then in principle, the morphemes’ tags are combined into one whole word tag. The word tag inherits its morphological feature attributes using an algorithm that establish agreements on morphological feature attributes. The description of the algorithm is beyond the scope of this chapter. This chapter is about the output of the tagger rather than describing the algorithm of tagging and combining morpheme tags into word tags. The following example in figure 5.13 shows the tokenization of the word into morphemes, the assignment of the part of speech tag for each morpheme and the result of combining the morpheme tags into one whole word tag. Tokenization is a known problem even for English corpus tagging. The tagged LOB corpus defines the word or graphic word as a sequence of characters surrounded by spaces (or punctuation marks). Each word is assigned a tag. Differences in tagging occurred due to: first, variation in segmentation of compound terms, as in: fancy free given the tags NN (noun, singular, common) JJ (adjective), and fancy-free given the tag JJ (adjective). Second, hyphenated sequences, as in: an above-the-rooftops position given the tag JJB (adjective, attributive-only). Third, syntactic boundaries, as in: Henry NP (noun, singular, proper) 8’s CD$ (numeral, cardinal, genitive) hall. In some cases, the LOB Corpus tagging guidelines have changed from ‘one-word-one-tag-approach’ to idiom tagging to handle the cases of recurrent multiword sequences functioning as units (Johansson et al. 1986). On the other hand, contractions forming regular patterns such as, I’ll, she’s, John’s, let’s, d’you, etc. are split up in the tagged LOB corpus as the following: I’ ll, she’ s, John’ - 119 s, let’ s, d’ you. Each part is treated as a separate word and assigned a single tag. Except where ’s is possessive suffix, then the word gets a single tag entry $ e.g. John’s gets the tag NP$ (Johansson et al. 1986). Analyzed sentence: Analyzed word: x24 \m \!o p)!† dR5: ‘aqamtu bimadῑnatῑ al-ğadῑdat limuddat ‘āmayn “I have stayed in my new city for two years” p)!† bimadῑnatῑ in my city Step 1 : Tokenization of words into morphemes Word Proclitics &5:6  bi in prefixes ------- Stem $&5 city Suffixes enclitics madῑna  (E) t feminine F ῑ my tā’ Step 2 : Assign morpheme tags Morpheme Tag p--p-----------------J bi in C! madῑna city nl-------vg?i----tat-s ` t feminine tā’ r---f-fs-s-k---------- ῑ my r---r-msfsgs---------- Step 3: Assign word tag Word Tag nl----fs-vgki----tat-s &5:6 bimadῑnatῑ Description Particle; Preposition Noun; Noun of place; Varied; Genitive; Indefinite; Primitive/ Concrete noun; Augmented by one letter; Triliteral root; Sound noun. Other (Residual); tā' of femininization; feminine; Singular; Invariable; kasrah; Other (Residual); Connected pronoun; Common gender; Singular; First person; Invariable; Genitive; sukūn (Silence) Description Noun; Noun of place; feminine; Singular; Declined; Genitive; kasrah; Indefinite; Primitive/ Concrete noun; Augmented by one letter; Triliteral root; Sound noun. Figure 5.13 Example of tokenization, the SALMA tag assignment for separate morphemes and the combination of the morphemes tags into the word tag 5.6 Chapter Summary The release of the first Brown corpus in 1964 represented the start of tag set design as scheme for morphosyntactic annotation of corpora. Then, standards and guidelines for morphosyntactic annotation evolved. Eight Arabic tag sets are surveyed and compared in terms of purpose of design, characteristics, tag set size, and their applications. The most widely used and important morphosyntactic annotation standards and guidelines the EAGLES, are designed for Indo-European languages. These guidelines are not entirely suitable for Arabic. Therefore, the design of the SALMA Tag Set applied the standards of traditional Arabic grammar instead. Many Arabic grammar books have been written. A collection of comprehensive and widely used and referenced traditional Arabic grammar books was used as basic reference for morphosyntactic knowledge extraction. The - 120 SALMA Tag Set adds more fine-grained details to the existing tag sets. It encodes 22 morphological feature categories of the word’s morphemes where attributes or values are specified by referring to the widely-referenced traditional Arabic grammar books. Chapter 6 describes in detail the morphological feature categories and illustrates each feature and its possible values. The SALMA Tag Set applied the tag set design criteria proposed by Atwell (2008). The design criteria are dimensions; in effect choices to be made by the designers of new part-of-speech tag sets. Through section 5.4, design decisions are investigated to handle each design dimension. Moreover, references to the existing Arabic tag sets showed the decisions made by these tag sets to handle each design dimension. - 121 - Part III: Proposed Standards for Arabic Morphological Analysis - 122 - Chapter 6 The SALMA – Tag Set This chapter is based on the following sections of published papers: Sections 1 and 2 are based on section 4 from (Sawalha and Atwell Under review) Chapter Summary The SALMA Morphological Features Tag Set (SALMA, Sawalha Atwell Leeds Morphological Analysis tag set for Arabic) captures long-established traditional morphological features of Arabic, in a compact yet transparent notation. For a morphologically-rich language like Arabic, the Part-of-Speech tag set should be defined in terms of morphological features characterizing word structure. A detailed description of the SALMA – Tag Set explains and illustrates each feature and its possible values. In our analysis, a tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash “-” represents a feature not relevant to a given word. The first character shows the main Parts of Speech, from: noun, verb, particle, punctuation, and Other (residual); these last two are an extension to the traditional three classes to handle modern texts. The characters 2, 3, and 4 are used to represent subcategories; traditional Arabic grammar recognizes 34 subclasses of noun (letter 2), 3 subclasses of verb (letter 3), 22 subclasses of particle (letter 4). Others (residuals) and punctuation marks are represented in letters 5 and 6 respectively. The next letters represent traditional morphological features: gender (7), number (8), person (9), inflectional morphology (10) case or mood (11), case and mood marks (12), definiteness (13), voice (14), emphasized and non-emphasized (15), transitivity (16), rational (17), declension and conjugation (18). Finally there are four characters representing morphological information which is useful in Arabic text analysis, although not all linguists would count these as traditional features: unaugmented and augmented (19), number of root letters (20), verb root (21), types of nouns according to their final letters (22). The SALMA – Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora. - 123 - 6.1 The Theory Standard Tag Set Expounding Morphological Features The SALMA – Tag Set is a general-purpose fine-grained tag set. The aim of this tag set is to be used by part-of-speech tagging software to annotate corpora with detailed morphological information for each word, and to enable direct comparisons between tagging algorithms and taggers using the same tag set. The tag set has been designed by grouping 22 morphological feature categories in one tag. Most of these morphological categories are described in any traditional Arabic language grammar book. In our study, all the morphological features are attested in five well known traditional Arabic grammar books (Dahdah 1987; Dahdah 1993; Wright 1996; Al-Ghalayyni 2005; Ryding 2005). Table 6.1 shows the 22 morphological feature categories. The tag string consists of 22 characters. Each character represents a value or attribute which belongs to a morphological feature category. The position of the character in the tag string is important to identify the morphological feature category. Each morphological feature category attribute is represented by one lowercase letter, which is still human-readable, such as: v in the first position to indicate verb, n in the second position to indicate name, gender category values in the seventh position: masculine represented by m, feminine represented by f and common gender represented by x. If the value of a certain feature is not applicable for the word, then a dash ‘-’ is used to indicate this; e.g. the mood morphological feature is not a noun feature. In contrast, a question mark ‘?’ means a certain feature belongs to a word but, at the moment, the feature value is not available or the automatic tagger could not guess it. The tag is intended to remain readable by linguists. Moreover, it can be rendered more readable if the interpretation of the tag string features is generated automatically: software can convert each position+letter to a human-readable English and/or Arabic grammar term. Figures 6.1 and 6.2 show examples of two sentences tagged by the SALMA Tag Set. The first sentence is a newspaper text taken from the Arabic Treebank:  "# Q  h'H  <2R 4 k‹ % c: c' \ " \H'-m ŸA2–' Q4 ¿ tamma ‘i’dād al-waṯāi’qa almutawaffirati ḥawla ’awwali riḥlati ṭayyarānin ‘uṯmāniyyatin fawqa al-bilādi al-‘arabiyyati ‘Many available documents relate to the first Ottoman’s flight over the Arab countries’. > . ;; wa The second sentence is taken from the Qur’an (chapter 29): 2^)i= ? >=!; >';>" k2 ; i< ; N= 2;)G=/ waṣṣaynā al-‘insāna biwālidayhi ḥusnan ‘We have enjoined on man kindness to parents’. - 124 - Word Morphemes wa waaṣṣaynā And We have enjoined +(,) - *+*+ al-’insāna (on) man 3, 9 + ./ + 0 bi- wālidayhi His parents 34&5+ 3 36 , + *+ , ) - *+ +/ C, 3 9 + ./ +  3  3 5+ *+ = G3 an ḥusn Kindness $+ ., 78  +., 78 Tag wa And waṣṣay Have enjoined nā We al- The ’insāna man bi To wālida Parents y Both hi His ḥusn kindness an p--c-----------------v-p---mpfs-s-amohvtt&r---r-xpfs-s----hn---r--d-----------------nq----ms-pafd---htbt-s p--p-----------------nu----md-vgki---htot-s r---r-xdts-s---------r---r-msts-k---------ng----ms-vafi---ndst-s r---k------f---------- Figure 6.1 Sample of Tagged vowelized Qur’an text using the SALMA Tag Set Word tamma Accomplished ‘i’dādu Preparing al-waṯā’iqa Documents al-mutawaffirati Available Morphemes Tag H ¿ tamma Accomplished "5I Q4 ‘i’dādu Preparing ng----ms-vndi---?db3-s c al The r---d----------------- ŸA2– waṯā’iqa Documents nq----fb-vafd---ndbt-s c al The r---d----------------- H'- \ mutawaffira Available ti nj----fs-vafd---ndtt-s r---t-fs-------------- J bi In p--p------------------  \ kaṯra Many tin nj----fb-vgki----dat-s r---t-fs-------------- J>K  E% : v-p---msts-f-amihdstb- bi kaṯratin In Many E%LM6 ḥawla About C7 c' ḥawla About nv----m--s-fi----nst-s ’awwali First C*< c: ’awwali First n+----ms-vgki----dst-s + \ riḥla Trip 7@ ti no----fs-vgki----dat-s r---t-fs-------------- 9%N k‹ tayyarānin Flight ng----ms-vgki----dbt-s k2R 4 uṯmān Ottoman iyya n*----fs-pgki----daq-s r---y----------------- \ t tā’ marbūṭah r---t-fs-------------- O h'H fawqa Over nv----m--s-fi----nst-s "B c Q" al the bilād countries r---d----------------nl----mb-vgkd---ndat-s 6% c J4 al the ‘arab Arab r---d----------------n*----fb-vgkd---hdst-s \ ti riḥlati Trip tayyarānin Flight uṯmāniyyat Ottomani fawqa Over al-bilādi Countries al-‘arabiyyati Arabian /:LI r---y----------------- iyya tā’ marbūṭa h r---t-fs-------------- Figure 6.2 Sample of Tagged non-vowelized newspaper text using the SALMA Tag Set - 125 The categories and features are drawn from traditional Arabic grammar books (Dahdah 1987; Dahdah 1993; Wright 1996; Al-Ghalayyni 2005; Ryding 2005). In most cases there is agreement among them, but in some cases there are discrepancies. When there is agreement, the approach taken is simply a matter of presenting the agreed features. When there is a discrepancy in most cases the difference is that one text has more fine-grained subcategories which are merged in other texts; so the more fine-grained wider sub-classification is adopted. The only significant disagreement is in the number of nouns; see section 6.2.2, and in that case we adopted the widest most fine-grained subclassification system. Arabic grammar terms used to describe the attributes of the morphological feature categories in the SALMA - Tag Set are the same terms used by traditional Arabic grammar. The equivalent English translations of these grammar terms were extracted from 4 well-known traditional Arabic grammar reference books written in English. These books are: Wright, W. (1996), Ryding, K. C. (2005), Dahdah, A. (1993) and Cachia, P. (1973). These reference books agree on translating general Arabic grammar terms such as, noun, verb, adjective, person, number, case and mood. However, these reference books do not agree on translating some fine-grained attribute names such as w2i +#S alfi‘l as-sālim, which is translated into ‘the strong verb’ by Wright, W. (1996), ‘regular (sound) root’ by Ryding, K. C. (2005), ‘intact verb’ by Dahdah, A. (1993), and ‘sound verb; strong verb; verbum firmum’ by Cachia, P. (1973). The agreed English translations of the grammar terms were directly used. For the non-agreed English translation, Professor James Dickins (head of Arabic and Middle Eastern Studies, University of Leeds, UK) was consulted to give advice on those English translations of Arabic grammar terms that would be clearest to English speaking linguists. Appendix A lists the morphological features categories and their attribute values at each position of the 22 positions of the tag string. 6.2 The Morphological Features of the SALMA Tag Set The SALMA Tag Set of Arabic consists of merging 22 morphological features of the Arabic into one compact morphological feature tag. The morphological features categories used to construct the SALMA Tags are listed in table 6.1 below. The following sub-sections 6.2.1 to 6.2.22 describe each morphological category and its attributes in more detail. - 126 Table 6.1 Arabic Morphological Feature Categories Position Morphological Features Categories t -.>% !M !.+< ’aqsām al-kalām ar-ra’īsiyya 1 Main Part-of-Speech 2 Part-of-Speech: Noun 3 Part-of-Speech: Verb 4 Part-of-Speech: Particle 5 6 Part-of-Speech: (Residual) Punctuation marks 7 Gender 8 Number 9 Person "?Q al-’isnād 10 Inflectional morphology T% - aṣ-ṣarf 11 Case or Mood 12 Case and Mood marks 13 Definiteness 14 Voice 15 16 Emphasized emphasized Transitivity 17 Rational 18 20 Declension and Conjugation Unaugmented and Augmented Number of root letters 21 Verb root 22 Noun finals 19 t (H?Q) -I% !M !.< ’aqsām al-kalām al-far‘iyya (al’ism) t (S) -I% !M !.< ’aqsām al-kalām al-far‘iyya (al-fi‘l) Other t (T%) -I% !M !.< ’aqsām al-kalām al-far‘iyya (alḥarf) t (U%#<) -I% !M !.< ’aqsām al-kalām al-far‘iyya (’uẖrā) al-kalām I) -I% !M !.< ’aqsām (‘alāmāt at-tarqīm) (H% al-far‘iyyat V-/W+ :* 8 %- + : 8 al-muḏakkar wa al-mu’annaṯ "5 al-‘adad and tu tu S *< H? 6%I0  al-ḥāla al-’i‘rābiyya lil-’ism ’aw al-fi‘l XB *< %I0 I ‘alāmāt al-’i‘rāb wa al-binā’ non- ti ti E%+ M3 -* + %3 , : + al-ma‘rifa wa an-nakira 3 C 8 Y, :+ 3 B, : + * ! 8 , :+  B, : + al-mabnī lil-ma‘lūm wa al-mabnī lil-mağhūl 5-W: %Z* 5  W : 8 8 8 al-mu’akkad wa ḡayr al-mu’akkad F5:* !A al-lāzim wa al-muta‘addi S %Z* S al-‘āqil wa ḡayr al-‘āqil ;&%- at-taṣrīf al-muğarrad wa al-mazīd 5&[:* "%Y: @, Y , "5+ I+ ‘adad ’aḥruf al-ğaḏr + T%8 7< tu S 86 bunya al-fi‘l an ^%#_ ]  B H?\ !.< ’aqsām al-’ismi tib‘ li-lafẓi ’āẖirhi 6.2.1 Main Part-of-Speech Categories Generally, there is agreement among existing Arabic tag sets on the classification of main part-of-speech categories in traditional Arabic grammar books e.g. (Dahdah 1987; Dahdah 1993; Wright 1996; Al-Ghalayyni 2005; Ryding 2005; ALECSO 2008a) Arabic language scholars classify Arabic words into three main part-of-speech categories namely: nouns, verbs and particles. Khoja’s tag set added categories of punctuation marks and residuals. The punctuation marks used in Arabic are ( ! ‫ ؛‬: ‫ ؟‬- . ،). Others (residuals) include other non-Arabic words appearing in the text such as; currency, numbers or words in other languages. Figure 6.3 lists the attributes of the main part-of-speech category, which occupies the first character in the tag string. - 127 - Noun (n) H? Verb (v) S Particle (p) T%7 Main Part-of-Speech -.>% !M !.+< Punctuation mark (p) H% I Other (Residual) (r) U%#< Figure 6.3 Main part-of-speech category attributes and letters used to represent them at position 1 6.2.2 Part-of-Speech Subcategories of Noun A noun is defined as a word that has complete meaning and no tense associated with it. The Arabic concept of complete meaning corresponds approximately to content words except that it is also includes pronouns. Traditional Arabic grammar uses the concept of meaning to separate nouns and verbs from particles. This is roughly equivalent to content vs. function or lexical vs. grammatical in contemporary lexical terminology. This is not an exact correspondence since pronouns – a grammatical category – are a sub class of nouns. Arabic linguists distinguish many kinds of nouns. According to Dahdah (1987) nouns are classified into 21 kinds. Other classifications overlap. We classified nouns into 34 different types. Table 6.2 shows the 34 different types of nouns and examples of each type. Figure 6.4 shows the classification attributes of the noun part-of-speech category, which occupies the second character in the tag string. Table 6.2 Noun types as classified in traditional Arabic grammar 1 2 3 Noun types Gerund / verbal noun @5: al-maṣdar T g Meaning and Examples A noun which indicates a case or an action that is not related to time or tense. E.g. aD ;G;H faraḥun ‘happiness’. Gerund / verbal noun m A noun which indicates a case or an action that with initial mῑm is not related to time or tense. It has certain :: @5: patterns which have the augmented letter (M) mῑm al-maṣdar al-mῑmῑ at the beginning of the word. E.g. >%; G)=? munqalib ‘turned over’, 4> '= ; maw‘id ‘date’. Gerund of instance o A noun that describes an action that has taken E%: place once. It is formed by adding the feminine - @5 h termination (\) to the verbal noun. E.g. S; G=5; maṣdar al-marra waqfah ‘one stop’, \2; ;!“> ziyārah ‘a visit’. - 128 - 4 Noun types Noun of state T s `  @5 / b @5 maṣdar al-hay’ah maṣdar al-naw‘ 5 Gerund of emphasis / e 5  @5 maṣdar al-tawkῑd 6 Gerund of profession i I @5: al-maṣdar al-ṣinā‘ῑ 7 Pronoun p %:c al-ḍamῑr 8 Demonstrative pronoun E@d0 H? ’ism al-’išārah d Meaning and Examples A noun that describes an action. It indicates the manner (state, character and representation) of the action expressed by the verb. It always has > the form D;%#= >H fi‘latun. E.g.  ; ;;@=  n@; mašā mišyata al-’asad ‘he walked like a lion’. A noun that emphasizes an action. E.g. Ÿ; =%;T ? ;'. / ; an ^!'( = ;8 ṣawwara allāhu al-ẖalqa taṣwῑr ‘God does shape the creatures’. A noun which indicates an industry or profession. The gerund of industry ends with doubled yā’ followed by feminine tā’ marbūṭah (\). E.g. Y<2i ;-¤=? 2<;: ’anā muğtahidun ‘I am a hard worker’, and 2;<;: r.>Z ; F; -t = 2 mā ’iğtahada ’illā ’anā ‘no one worked hard except me’. There are 24 pronouns classified into 12 nominative pronouns and 12 accusative pronouns. The nominative pronouns are: 2;<;: ’anā ‘I’, C? =¾; > naḥnu ‘We’, d ; =<;: ’anta ‘You’, d=<;: ’anti ‘You’, 2R?-G=<;: ’antumā ‘You’, u?-G=<;: ’antum ‘You’, • . ? G=<;: > ’antunna ‘You’, '; ? huwa ‘He’, L;  hiya ‘She’, 2Ã? humā ‘They’, u? hum ‘They’, and C. ? hunna ‘They’. See table 11. The accusative pronouns are: ; 2.!>Z ’iyyāya ‘Me’, 2;<2.!>Z ’iyyānā ‘us’, ; 2.!Z ’iyyāka ‘your’, > 2.!>Z ’iyyāki ‘your’, 2R2.!>Z ’iyyākumā ‘your’, u2 ? .!>Z ’iyyākum ‘your’, C. 2 ? .!>Z ’iyyākunna ‘your’, ?2.!>Z ’iyyāhu ‘his’, 22 ; .!>Z ’iyyāhā ‘her’, 2Ã2.!>Z ’iyyāhumā ‘they’, u2.!>Z ’iyyāhum ‘they’, C2 . .!>Z ’iyyāhunna ‘they’. A noun that indicates by a tangible sign a person, an animal, a thing or a place such as; +t  ;12t; ğā’ hāḏā ar-rağul ‘ this man came’, and x ; ;8 d! ? ;:; x82-S ra’aytu tayna al-fatātayn ‘ I saw these two girls’. - 129 - 9 Noun types Specific relative pronoun T r ef C ) : H? ’ism al-mawṣūl al-ẖāṣ 10 Non-specific pronoun relative c g%': C ) : H? ’ism al-mawṣūl muštarak 11 Interrogative pronoun al- b !?Q H? ’ism al-’istfhām 12 Conditional noun h %' H? ’ism al-šarṭ 13 Allusive noun a &M al-kināyah 14 Adverb T%-h aẓ-ẓarf v Meaning and Examples A group of nouns that connect two sentences to give a full meaning. The special relative pronouns are affected by three morphological feature categories, number, gender and humanness. E.g. . al-laḏῑ ‘who’ is a singular masculine human pronoun; p al-latῑ ‘who’ is s singular feminine human pronoun; I'% al-lawātῑ ‘who’ is a plural feminine human pronoun. A group of nouns that connect two sentences to give a full meaning. The common relative pronouns are not affected by gender and number, so they have invariable form. They are affected by the morphological feature of humanness. E.g. C= ; man ‘who’ is used for human nouns, 2 mā ‘who’ is used for non-human nouns, and y ḏā ‘what’ and Y : ’ayyu ‘which’ are used for nonhuman nouns. A pronoun used to make a query or question about a thing or an action, e.g. Ä C= ; man haḏā? ‘who is this?’. Ä+R# 2 mā al- ‘amal? ‘what shall we do?’. The nouns C= ; man ‘who’ and 2 mā ‘what’ are interrogative nouns. A noun which connects two sentences. It indicates that the action in the second sentence does not occur unless the action of the first sentence has occurred, e.g. b= ƒ; =);G! = F> ;-=; _ R=%>8 B ;: ’ayyu tilmῑḏin yağtahid yanğaḥ ‘if any student studies hard, then he will succeed’. The noun B ;: ’ayyu ‘if any’, is a conditional noun. A noun which indicates a specific intention by means of unclear terms. These nouns are: | ;E; ka’ayyi ‘Any’, ; kaḏā ‘So and so’, u; kam ‘How …’, d ; =; kayta ‘So and so’, d ; =!y; ḏayta ‘So and so’, }? £ ? ?H fulān ‘someone’, = >" biḍ‘u ‘few’, k an e.g. `e/ ^'S(4 | ;E; ka’ayyi ‘usfūr ’isṭadta ; ‘Like any bird you have hunted’. The word | ;E; ka’ayyi ‘As any’, is a generalization A noun which indicates the time or place of the action. It incorporates into its overall meaning a sence of relative locality on time or place, e.g. tu x ; ḥῑna ‘when’, \. ? mudda ‘at a period of’, and M2;: ’amām ‘straight forward (direction)’ - 130 - 15 16 17 18 19 20 21 22 23 Noun types Active participle T u Meaning and Examples A form that describes the doer of the action. This SI H? noun is derived from the action or the verb itself. E.g. D >82 kātibun ‘writer’. This noun is derived ’ism al-fā‘il from the action of writing or the verb write ; ;-; kataba. Intensive Active w A noun which has the same basic meaning as the participle present participle +42S u ’ism al-fā‘il but 3 H? +B SI +8 indicates an augmentation of the meaning of the mubālaḡat ’ism al-fā‘il present participle. un E.g. J2 D .-; kattāb ‘writer’, which indicates that the writer writes a lot. kattābun is derived from the verb ‘write’ ; ;-; kataba. Passive participle k A derived noun which indicates an abstract C : H? meaning that describes something or someone affected by an action. ’ism al-maf‘ūl un E.g. 'i D = ; maksūr ‘broken’. This noun is derived from the verb break ;i; ; kasara. Adjective j A derived noun which indicates a meaning of -B':  firmness. i.e. the absolute existence of the i h h > quality in its possessor. E.g. ”2 aṣ-ṣifa al-mušabbaha D ƒ; ¯? B =)?o alğundiyyu šuğā‘un ‘brave soldier’. The word D”2ƒ; ¯? šuğā‘un ‘brave’ describes the soldier. This word is an adjective. Noun of place l A derived noun which indicates the place of an 9M: H? action. E.g. D ;=e; maṭbaẖun ‘kitchen’ indicates the place ’ism al-mkān of cooking. Noun of time t A derived noun which indicates the time of the un 9A H? ’ism zamᾱn action or a verb. E.g. J D > $= ; maḡrib ‘sunset’. Instrumental noun z A derived noun which indicates a tool used to j H? some work. E.g. a2S= > miftāḥun ‘key’, 2@)> minšār D ’ism al-’ālah ‘saw’, and a2 (> miṣbāḥ ‘light’. Proper noun n The name of a dedicated or specific instance in a H  H? group or type. E.g. D >2‰ ẖālidun ‘Khalid’, > Y%? =4; ’ism al-‘alam ‘abdu allāhi ‘Abdullah’, ` ? ‹= ;G" bayrūt ‘Beirut (the capital city of Lebanon)’. Generic noun q Indicates what is common to every element of kY H? the genus without being specific to any one of them. ’ism al-ğins un > E.g. J2D  kitāb ‘book’, +t; rağul ‘man’, and d " bayt ‘home’. - 131 - 24 Noun types Numeral "5 H? ’ism al-‘adad 25 Verb-like noun S H? ‘ism al-fi’il 26 The five nouns .:f X:?\ al-’asmā’ al-ẖamsah 27 Relative noun T + Meaning and Examples A noun that indicates the quantity and order of countable nouns by transferring the numbers into >  +t the correct form of Arabic words. E.g. D  D ?; > k > t rağulāni rağulun wāḥidun ‘one man’. k2)–Z ?; _ > ’iṯnāni ‘two men’. c2t ?;–;– ṯalāṯatu riğālin ‘three men’. The words ––  k2)–   wāḥid, ’iṯnāni and ṯalāṯah ‘one’, ‘two’ and ‘three’, are ordinal numeral nouns. & A noun which acts as a verb in its meaning. It indicates time of action, e.g. k2 ; .-¯; šattāna ‘how different they are!’, `2F; ; hayhāt ‘but oh! far from the mark!’ and ; #? ;G" ba’uda ‘far away’. f The five nouns are a group of five nouns belonging to the category of noun of genus. However, unlike standard nouns, which have three root letters, each of these nouns has only two root letters the third root letter being deemed to have been deleted. The five nouns are J D ;: ’abun ‘father’, Å: D ’aẖun ‘brother’, uD ; ḥamun *  . H? ’ism mansūb 28 Diminutive y %  H? ’ism taṣḡīr 29 Form of exaggeration x B ) ṣῑḡat al-mubālaḡah 30 Collective noun l: H? ’ism ğam’ $ ‘father in law’, 'H fū (u;H fam)‘mouth’, and y ḏū ‘owner’. A declinable noun which has the suffix –iyy.. It indicates affiliation of something to this noun. E.g. j¢>Q?=?: ’urduniyyun ‘Jordanian’ (i.e. affiliated to Jordan). A declinable noun which has the sound -ai- after its second root letter. It indicates paucity, contempt or affection. E.g. `2RF> =!; Q? duraihimāt ‘a few dirhams’, >#=!'; ¯? šuway‘ir ‘poetaster’, and € .; ?G" bunayya ‘my (little) son’. It indicates exaggeration of the quality of the qualified noun and occurs as a derived noun with the basic meaning of the present participle. E.g. ”.“; zarrā‘ ‘a very good cultivator’. A noun which indicates two or more. A singular form cannot be derived from this kind of noun. E.g. ²=t; ğayš ‘army’, the corresponding singular being )t ğundῑ ‘a soldier’, or +=‰; ẖayl ‘horses’ the corresponding singular being v;G;H faras ‘a horse’. - 132 - 31 Noun types Plural collective noun : k H? ’ism ğins ğam‘ī 32 Elative noun Sc H? ’ism tafḍῑl 33 Blend noun   H? ’ism manḥūt 34 Ideophonic interjection  ) H? ‘ism ṣawt T # Meaning and Examples A noun of genus where the singular and plural share the same basic form in meaning and pronunciation. The singular form is distinguished by adding the feminine tā’ marbūtah or the relative suffix gῑ. E.g. (\“) “ zahr (zahrah) ‘flowers’ (‘a flower’), and (K4) J4 ‘arab (‘arabῑ) ‘Arabs’ (‘an Arab’). @ A derived noun used for the comparative and superlative when comparing persons or things. E.g. +> t?  . C; > '5: ? ; ; al-’asadu ’aqwā mina arrağuli ‘The lion is stronger than the man’. The noun '5: ’aqwā ‘stronger’ is used for comparing the strength of the lion and the man. % This consists in composing a single word by the fusion of two or more words, so that some letters are dropped from each word on condition that the resultive form has an authentically acceptable pronunciation and meaning. E.g. +? S; #= t; ğa‘falu ‘Could I but sacrifice myself for you’ composed > from the words ; > >H d ? =%#t; ğa‘altu fidāka (same meaning). ! A noun improvised by human spontaneity and used initially as a verbal noun to talk to animals and small children, e.g. ] āh “Oh”, c2; hāl used for horses. - 133 - Noun H?Q Non-inflected nouns T% i %Z Pronoun (p) %:c Demonstration pronoun (d) E@dQ H? Inflected nouns T% i Derived nouns J' Passive participle (k) C : H? Primitive noun 5 Active participle (u) SI H? Relative pronoun (r, c) C ) : H?Q Form of exaggeration (x) B: L< Adjective (j) -B':  i Conditional noun (h) - H? %' Noun of place (l) 9M: H? Elative noun (@) Sc- S < Interrogation pronoun (b) !?Q H? Instrumental noun (z) j H? Noun of time (t) 9[ H? Concrete noun  H? Has the following sub-types 1- Proper noun (n) H  H? 2- Generic noun (q) k H? 3- Some nouns of place (l) 9M: X:?< m6 4- Some Instrumental nouns (z) j X:?< m6 Abstract Noun o: H? Has the following sub-types: Allusive noun (a) &M Augmented gerund / verbal noun 5&[: @5: Adverb (v) T%-h Numeral (+) H? "5 Derived n': nouns X:?\ 1- Stripped gerund / verbal noun (g) "%Y: @5: 2- Some gerunds /verbal noun with initial mῑm (m) -:: @": m6 Origin of derived words Stripped Perfect verb "%Y: D: S n': S)< Stripped gerund / verbal noun (g) "%Y: @5: - Figure 6.4 The classification attributes of noun part-of-speech subcategories with letter at position 2. 6.2.3 Part-of-Speech Subcategories of Verb A verb is defined as a word that indicates a meaning by itself which is united with a tense or time; verbs takes words or affixes as indicators such as the particles 5 qad, 3' sawfa , or suffixed pronouns or the prefixes v /s/, ` /t/, k /n/ (Al-Ghalayyni 2005). Verbs can be classified according to tense and morphological form into three groups. Table 6.3 shows the 3 attributes of the part-of-speech subcategories of verbs with their definition and examples of each attribute. Figure 6.5 below shows the subcategories of the verb, represented at position 3 of the tag string. - 134 - Verb S Perfect verb (p) D: S Imperfect verb (c) `@c: S Imperative verb (i) %\ S Figure 6.5 Part-of-Speech subcategories of verb, with letter at position 3 Table 6.3 Verb types as classified by Arab grammarians Verb types Perfect verb D: S al-fi’l al-māḍῑ Imperfect verb `@c: S al-fi’l al-muḍāri’ Imperative verb %\ S fi’l al-‘amr T Meaning and Examples p Indicates the occurrence of an action is in the past. E.g. p@5 8 q + ++ kataba aṭ-ṭāilbu ad-darsa ‘The student + wrote the lesson’. The verb + ++ kataba ‘wrote’ is a perfect verb. c Indicates an action or case in the progressive tense or the action occurs at the time of speaking. E.g. H8 + M+ +(+& yatakallamu ‘someone is talking now’. i Indicates a required action in the future, or a request (order) to do an action. E.g. , 88 ’uktub ‘write’ as a request or order. 6.2.4 Part-of-Speech Subcategories of Particles Particles are classified in two broad categories. The first category is non-meaningful particles ¢2 m 3 ḥurūf al-mabānῑ or alphabet letters. From these alphabet letters Arabic words are constructed. The second category is meaningful particles ¢2#m 3 ḥurūf alma’ānῑ. They are words which do not belong to noun or verb but they add specific meaning to the noun or verb in a sentence, or they connect two or more sentences. They are also classified according to their ‘effect’ on nouns or verbs into two groups; governing particles %24 3 ḥurūf ’āmilah which affect the form of the following noun or verb; and non-governing particles %24 ‹¥ ḥurūf ḡayr ‘āmilah which do not affect the form of the following nouns or verbs (Dahdah 1987; Dahdah 1993). Governing particles affect the following noun or verb by changing the mood of the verb or the case of the noun. They affect the verb by changing its mood to jussive, subjunctive or partially subjunctive. And they affect the case of noun in genitive, vocative or exception. Conjunctions 6e# 3 ḥurūf al-‘aṭf affect both nouns and verbs. Table 6.4 shows definitions and examples of the 22 subcategories of particles. Figure 6.6 shows the particles category attributes, represented at position 4 of the tag string. - 135 - Particles T*% Meaningful particles T*%7 Non-meaningful particles /B: T*%7 Non-governing Particles T*%7 Governing particles I T*%7 Verb S A F F E C T S Noun H?Q • Jussive-governing particles • Subjunctive-governing particles • Partially subjunctivegoverning particles • • • • Both (S* H?Q) :r Preposition Annulling particle Vocative particles Exceptive particles Jussive-governing particles (j) ![Y T*%7 Subjunctive-governing particles (o) - T*%7 • Conjunction r  2m  w  c  kZ =  2yZ Y C  L  k= ;:  kyZ = Partially subjunctive-governing particles (u) I% - T*%7   c  3  Æ Y  .?¡  :   c  3  Æ Y  .?¡  +"  :  M: Conjunctions (c) ;q T*%7  Prepositions (p) %Y i T*%7 Annulling particles (a) s.- T*%7  *  C4;  n%4  4  J . ?  ‰  2¯2  Æ Y  `  J  qZ >   )?  C  ?  r'  c  L 2  `r Y  k: Y  kZ Y .  +# ;  r  kZ  d ;   C Y  kE 2!  2     2!:  ]  ]  1 Vocative particles (v) X5i T*%7 Exceptive particles (x) XL?Q T*%7 YrZ Figure 6.6 Subcategories of Particle, with letter at position 4 Table 6.4 Examples of part-of-speech category attributes 1 Particle Type Jussive-governing particle T j ![ T%7 ḥarf ğazim 2 Subjunctivegoverning particle o / T%7 ḥarf naṣib 3 Partially Subjunctivegoverning particle I% / T%7 ḥarf naṣib far‘ῑ u Meaning and Examples A group of particles that have the meaning of negation and prevention. They govern a following imperfect verb in the jussive mood. E.g.  C vE = 8r ti h  lā tay’as min raḥma al-lā ‘Do not give up God’s mercy’. A group of particles that govern a following imperfect verb in the subjunctive mood. Mainly used for conditions. E.g. u; .%#8: L d{t ği’tu likay at‘allama ‘I came to ? study’. A group of particles that govern a following imperfect verb in the subjunctive mood through an > ;-=)G;8 .?¡  implicit ’an (\R; £ . # = m k= ;:). E.g. uD ‘> 4; D’= ;H ;( ; ?-;2 ; ‡ ^ ? ? muqāwamatuka al-‘aduwwa ṯumma tantaṣira faẖrun ‘aẓῑmun ‘your resistance to the enemy, then your victory, are the source of a great pride’. - 136 - 4 Particle Type Preposition T p % i T%7 ḥarf ğarr 5 Annulling particle a s?/ T%7 ḥarf nāsiẖ 6 Conjunction c ;qI T%7 ḥarf ‘aṭf 7 Vocative particle v X5i/ T%7 ḥarf nidā’ 8 Exceptive particle x XL? T%7 ḥarf ’istiṯnā’ 9 Interrogative particle i !? T%7 ḥarf ’istifhām 10 Particle of futurity f CBn? T%7 ḥarf ’istiqbāl 11 Causative particle S  T%7 ḥarf ta‘lῑl s Meaning and Examples A group of particles that govern a following noun in the genitive case. This group consists of true and fundamental markers of location and direction > m qZ; dQ darastu ’ilā almasā’i ‘I particles. E.g. 12i ? ;; ; studied up to the night’. A group of particles that ‘intervene’ in the nominal sentence and induce a change in the case of the . following noun. These particles include 2„> '‰: ;  kZ ’inna wa ’aẖawātihā ‘indeed and its sisters’,  H2.) r ˆ)ƒ% lā an-nāfiyah lil-ğins ‘generic negative lā ’ and 2„> '‰; ;: ; 2 mā wa ’aẖawātihā ‘mā and its sisters’. >; ˆ.e k. >Z ’inna aṭ-ṭaqsa ğamῑlun ‘Indeed, the E.g. +D ¨ ; weather is nice’ A group of particles used to connect elements of equal status in pronunciation or in meaning. This group includes ten conjunctions. E.g. 2‰ D  L%4 j ;12t un un ğā’a ‘aliyy wa ẖālid ‘Ali and Khalid came’. A group of particles used to call or alert the person addressed. There are eight vocative particles. A noun preceded by a vocative particle is called a > vocative noun. E.g. }= R> ;- = ? 2; 2;!;: ’ayā ṭālibu ’istami‘ ‘Oh student, listen’. A group of particles used to exclude the following noun from the scope of the words before it. E.g. ;12t; an ^‹©; rY>Z ? .- ğā’ at-talāmῑḏu ’illā samῑr ‘The students came except Samir’. A group of particles used to ask to elicit understanding, conception or approval. This group includes three interrogative particles. The noun which follows an interrogative particle is called an un interrogative noun. E.g. Ä! D “; ;12t; += ; hal ğā’ zayd ? ‘Did Zaid come? A group of particles which modifies the verb tense from the present tense to the future. The particles of futurity include the letter (v) sῑn and the particle (3' ; ; ) sawfa, both meaning ‘will’. E.g. ?Q'4;: 3' ; ; sawfa ’a‘ūdu ‘I will come back’. A group of particles used to express and confirm the logic of an argument. These eight particles are: yZ= ’iḏ ‘since’, ÆY ; ḥattā ‘in order to’, n%4; ‘alā ‘on’, C= 4; ‘an ‘About’, *> fῑ ‘in’, L= ; kay ‘so that’, M Y lām ‘so that’, C= > min ‘from’. E.g. bƒ)8 Æ v = ? Q=? ’udrus ḥattā tanğaḥ ‘Study in order to succeed’. - 137 - 12 13 14 15 16 17 Particle Type Negative particle T n Meaning and Examples A group of particles used to negate the proposition / T%7 expressed after them, or to deny its affirmation. There are eight negative particles. These particles ḥarf nafῑ are: k= >Z ’in ‘not’ (with more standard sense of ‘if’), Y ; kallā ‘never’, =w; lam ‘not (in the past)’, 2R. ; lammā ‘ not yet’ , C= ; lan ‘not (in the future)’, r lā ‘not’, `r ; > > = lāta ‘not’, 2; mā ‘not’. E.g. 2e ?  `E! 2R. ; lammā ya’tῑ al-qiṭāru ‘The train has not (yet) arrived’. Jurative particles q A group of particles used to swear by the divine H. T%7 majesty or by another feature. There are four jurative particles. These are: J bā’, ` tā’, c lām,  ḥarf qasam wāw. E.g. C. ;%#H >2>" bi-allāhi la-’af‘alanna ‘By God I will surely do it’. Yes/No response w A group of particles used to reply to an invocation, particle a question, a statement, a correspondence or an   T%7 objection. There are eleven response particles. These particles are: += t; ;: ’ağal ‘yes’, k= y; >Z ’iḏan ‘in that ḥarf ğawāb case’, ^;y>Z ’iḏan ‘ihen’, Z ’ῑ ‘yes’, n%;" balā ‘yes’, += ;%t; ğalal ‘yes’, ‹>= t> ğayr ‘yes’, 12S fā’ , MrY lām, r lā ‘no’, > ;< d=<: ^;y>Z ’iḏan anta nāğiḥun u= #; ;G< na‘am ‘yes’. E.g. bD t2 ; ‘Then you have succeeded’. Jussive-governing k A group of particles used to express the occurrence conditional particle of one event in connection with another one. There !A %d T%7 are two jussive-governing conditional particles. 2; y>Z ḥarf šart ğāzim ’iḏ mā ‘whenever’ and k= >Z; wa ’in ‘even if’ . E.g. 2 y= >Z M. Y G;-G;8 u.%#; G;-G;8 ’iḏ mā tata‘allam tataqaddam ‘Whatever you learn you will progress’. Incitement particle m A group of particles used to request something with mc T%7 force, incitement, and harassment. There are five incitement particles. These particles are: r;: ’alā ‘is it ḥarf taḥḍῑḍ (etc.) not’, r.;: ’allā ‘lest’, r' lalā ‘were it (etc.) not’, 2'= ; lawmā ‘if it were (etc.) not’, . ; hallā ‘is it (etc.) > not. E.g. ‡ ; >t'>" M' ? ? G;8 . ; hallā taqūmu bi wāğibika ‘Will not you carry out your duty’. Gerund-equivalent g A group of particles used to ‘intervene’ in a particle sentence which can be replaced by gerund. These F@5 T%7 four particles are: \lR´ hamzah, k= ;: ’an ‘that’, L= ; kay ḥarf maṣdarῑ ‘so’, '= ; law ‘if’. E.g. €;; M; > ‰= ;: k= ;: B > ?: ’uḥibbu ’an aẖdima waṭanῑ ‘I like to serve my country’. - 138 - 18 Particle Type Particle of attention T t 4B T%7 ḥarf tanbῑh 19 Emphatic particle z 5  T%7 ḥarf tawkῑd 20 Explanatory particle d %. T%7 ḥarf tafsῑr 21 Particle of comparison l 4B' T%7 ḥarf tašbῑh 22 Non-governing particle SI %Z T%7 ḥarf ḡayr ‘āmil b Meaning and Examples A group of particles used to clarify the matter for the orientation of the alert listener. There are two attention particles; r;: alā ‘is it not’, and 12´ hā’ > ‘attention’. E.g. ?;G=¥; u? |%#; m +? t?  . 2F; BG!: 2;! yā’ayyuhā ar? rağulu al-mu‘allimu ḡayrahu ‘I call on you, man who teaches others’. A group of particles used to emphasise intention and to consolidate a pledge. There are eight emphatic particles. 2.; ’ammā ‘as for’, k= ;: ’an ‘that’, k. >Z ’inna ‘indeed’, 12  bā’, n%4; ‘alā ‘on’, 32 kāf, k'B) >; ˆ.e k. >Z ’inna aṭgṭaqsa ğamῑlun nūn, k. nna. E.g. +D ¨ ; ‘Indeed, the weather is nice’ A group of particles used to clarify the meaning of a word, to discover the purpose of a question and to interpret it. There are two explanatory particles. k= ;: ’an ‘that’, and = ;: ’ay ‘That is’. E.g. D ; y; = ;: D ƒ; i= 4; ; haḏā ‘asğadun ’ay ḏahabun ‘This is a precious metal, that is gold’. A group of particles used to liken one thing to another, but not in the same way as a metaphor. There are two particles of comparison; 32 kāf, and k. ;E; ka’anna ‘As if’. E.g. ? = ; ‡ ; . R; .%4; ‘allamanī ‘he taught me’ nūn of protection appears between the perfect verb u; .%4; ‘allama and the object suffixed pronoun –ī ‘me’. A morpheme that is attached to the end of the verb to add emphasis to the word by adding the letter 9, nūn or doubled one 9u nūn-nūn. One of a group of morphemes attached at the beginning of the verb stem which mark the verb as being imperfect (or progressive) rather than perfect. - 140 - 11 12 13 14 Others (Residuals) Definite article ;&% E"< ’adāt ta‘rῑf Masculine sound plural letters H. %: l: T*%7 ḥurūf ğam‘ al-muḏakkar as-sālim Feminine sound plural letters H. V/W: l: T*%7 ḥurūf ğam‘ al-mu’nnaṯ as-sālim Dual letters oL: T*%7 ḥurūf al-muṯannā 15 Imperative prefix %Q T%7 ḥarf al-’amr T d Explanation A ‘definiteness particle’, added to the beginning of the nouns or adjectives and making them definite, rather than indefinite. m A morpheme that is attached to the end of singular nouns or adjectives to form sound plurals. They are used to derive masculine plural. l A morpheme that is attached to the end of singular nouns or adjectives to form sound plurals. They are used to derive feminine plural. u A morpheme that is attached to the end of singular nouns or adjectives to derive dual noun or adjective. To derive feminine dual these letters must be preceded by the feminine letter tā’ () i (V/t X). A morpheme that is attached at the beginning of the verb stem and changes it from perfect to imperative verb. WORD ‫الكلمة‬ Proclitic(s) Definite Article (d) ;&% E"< Prepositions* %Y i T*%7 Conjunctions* ;q T*%7 Introgative particles* !?0 T*%7 Particles of futurity* CBn? T*%7 * Belong to Particles Prefix(es) Prefix (p) C*<  E"&A : M prefix Imperfect (a) I@c: T*%7 Imperative prefix (i) %\ T*%7 Stem Suffix(es) Suffix (s) %#_  E"&A Relative yā'(y) B. X& Emphatic nūn (z) 5  9 / nūn of protection (n) &  9 / Dual letters (u) oL: T*%7 Masculine sound plural letters (m) H. %: l: T*%7 Feminine sound plural letters (l) H. V/W: l: T*%7 Enclitic(s) Suffixed pronouns (r) S %:D tanwῑn (k) $&  tā' marbūṭah (t) N 6% X tā' of femininization (f) V/t X Figure 6.7 The word structure and the residuals that belong to each part of the word, with letter at position 5 - 141 - 6.2.6 Part-of-Speech Subcategories of Punctuation Marks Punctuation appears in most Arabic texts. Punctuation marks include: full stop, comma, colon, semi colon, parentheses, square brackets, quotation mark, dash, question mark, ellipsis and continuation mark. “Punctuation usage in original Arabic text is characterized by a great deal of fluidity” (Khafaji 2001) Figure 6.8 shows the punctuation marks that are used in Arabic text. Table 6.6 lists the 12 subcategories of punctuation marks and their use. The part-of-speech category of punctuation marks is represented at the sixth position of the tag string. . Full Stop (s) qn/ ، Comma (c) ) () Parentheses (p) 9? n Punctuation Marks H% I "" Quotation mark (t) pB0 I Dash (d) D% N%d : Colon (n) 9qn/ ! Exclamation mark (e) Y I … Ellipsis mark (i) T I ‫؛‬ Simi colon (l) ) [] Square brackets (b) 9%)7 9?  ‫؟‬ Question mark (q) !? I = Continuation mark (f) 6 I Figure 6.8 Punctuation marks used in Arabic, with letters at position 6 Table 6.6 Subcategories of punctuation and examples of their attributes # 1 2 3 4 5 Punctuation marks Full stop (.) qn/ nuqṭah Comma (w) ) fāṣilah T s Colon (:) 9qn/ nuqṭatān Semi-colon (y) N n ) fāṣilah manqūṭah n Parentheses ( ( ) ) 9?  qawsān p c l Example A full stop is used at the end of paragraph, or after the meaning is completed. E.g. ˆR@ d# = ;%;. ṭala‘at aš-šamsu ? “the sun has risen.” A comma is used after the vocative and to separate phrases or clauses. E.g. .> ;e;T2>" QD . F; ? ‡ yā rağulu, ; .<>Z +t2! ? un ’innaka muhddad bilkhaṭar “hey man, you are in danger.” A colon is used after reported speech. E.g. (. D > y; 2<;: :c25 ; ) un qāla: ‘anā ḏāhib . “he said: I am leaving” A semi-colon is used between two linked clauses, e.g. > if one is the cause of the other. E.g. +? ; #= ?G! += ;; šMD Q2> ;5 ?.<;: d ? R= %4; Ä;I>=E;! Yr;: ‘alimtu ’annahu qadimun; wahal yu’qalu ’allā ya’tῑ? “I knew that he is coming; is it possible that he is not coming?” Parentheses are used around numbers, and sometimes _ >< (8) ¢2> Ç; 12t ğā’ (8) nisā’ “8 used for limitations. E.g. 12i ;; women have come”. - 142 # 6 Punctuation marks Square brackets ( [ ] ) 9%)7 9?  qawsān ḥāṣiratān T b 7 Quotation mark ( " " ) pB I ‘alāmatu ‘iqtibās t 8 Dash ( } ) D% N%d šarṭah mu‘tariḍah d 9 Question mark ( ~ ) !? I ‘alāmatu ’istifhām Exclamation mark ( ! ) Y I ‘alāmatu ta’ağğub Ellipsis mark (...) T7 I ‘alāmatu ḥaḏf Continuation mark (=) 6- I ‘alāmatu at-tabi‘yyah q 1 1 1 e i f Example Square brackets are used for limitation, and are also used around the sentence added to a quotations. E.g. c25 > " > ; ;: n;%4; d ? =G;)t; 2 ; : Y #m . qāl ; [ DG? ? =;%4; \2; ;)?o k. ;: }; ; ] L. ;%4; K>;: ?2;)t; ; " al-ma‘rrῑ: “haḏā ğanāhu ’abῑ ‘alayya [ ma‘ ’anna alğunāta ‘alyhi kuṯurun] wamā ğanaytu ‘lā ’aḥad”. “alma’arry said: “This what my father did to me [ although many people hurt him] and I have never hurt anybody” Quotation marks are used for quotations without changing the original text. E.g. C; > d . dR ; R= ( ? .%#; G;8 " : k·t c25 > ;–G. qāl ğubrān: ta‘almtu aṣ-ṣmta mina aṯ" ... 2 = ṯarṯār…” (Jubran said: “I learnt how to be silent from a talkative person”.) A dash is used at the beginning and end of a parenthetical clause. It is also used when speaker is changed. E.g. D‹©; L© g ć© 2 mā ’ismuka? – ‘ismῑ samῑrun “What’s your name? – My name is Samir” A question mark is used after a question. E.g. ć© ; 2 mā ’ismuka? “What’s your name?” An exclamation mark is used after an exclamation. 2 mā ’ağmala ar-rabῑ ‘a! “What a E.g. !}; " . +¨: ; beautiful spring!” An ellipsis mark is used to mark an ellided word or phrase in a text. E.g. (... ;:; ;" u? |%#; m ;12t) ğā’ al-mu‘alimu ? wa bada’a … “ the teacher came and stared …” A continuation mark is used in a footnote to indicate that the text has to be continued on another page. 6.2.7 Morphological Feature of Gender Arabic classifies nouns according to gender into three classes50; nouns which are only masculine (. ; ? ) muḏakkar, nouns which are only feminine (§.<—; ? ) mu’annaṯ, and nouns which are both masculine and feminine (common gender or neuter gender) ( =;: . ; ? §.<—; ? ) muḏakkar ’aw mu’annaṯ such as; b% milḥ ‘salt’, and a rūḥ ‘spirit’ (Wright 1996). Figure 6.9 shows the morphological feature of gender subcategories. Table 6.7 lists the 3 subcategories, with examples of masculine, feminine and of common gender words. The morphological feature of gender is repsented at position 7 in the tag string. 50 According to Wright’s (1986) classification. Ryding (2005) classifies nouns according to gender into two classes; masculine and feminine, and the “dual gender noun” is mentioned in a footnote on page 119. - 143 Table 6.7 Examples of gender category attributes for nouns, verbs, adjectives and pronouns # Subcategories of gender 1 Masculine % muḏakkar 2 Feminine V/W mu’annaṯ 3 Common gender V/W *< % muḏakkar ’aw mu’annaṯ T m f x Examples Noun  kitāb book BM maktabah library = milḥ salt Verb 9 BM& yaktubūn They are writing (Pl. / Masc. ) $BM taktubῑn You are writing (sing. / Fem.) M/ naktubu We are writing (Pl. / Masc. or Fem) Adjective  kātib writer (Sing. / Masc.) Pronoun r huwa He B kātibah writer (Sing. / Fem.) r hiya She 51 >/ nā’ib Parliament member (Sing./ Masc. or Fem.) :r humā They (Dual) Gender kY Masculine (m) %: Feminine (f) V-/W: Natural masculine nn %: Natural feminine nn V-/W: Non-natural masculine FAY: %: Non-natural feminine FAY: V-/W: Common Gender (x) V-/W *< % Figure 6.9 Arabic classification of nouns according to gender, with letter at position 7 Morphologically the masculine form is the simplest and most basic shape (word structure), whereas feminine nouns usually have a suffix that marks their gender. On the other hand, semantically, nouns are arbitrarily classified into masculine or feminine, except where a noun refers to a human being or other creature, when it is normally conforms to natural gender (Ryding 2005). Therefore, we can distinguish between two types of the morphological feature of gender that nouns can indicate: semantic gender where nouns indicate natural gender of humans, animals or things (male or female) whether the gender is a true characteristic of the human being or animal, or it is figurative for things that do not have natural gender. Morphological gender is defined if the noun is in its simplest form or if it contains a feminine suffix attached to it. Discussion of the detailed classifaction of the morphological feature of gender into morphological gender and semantic gender is beyond the scope of this thesis. 51 Recently the word >/ nā’ib is being used for both masculine and feminine as the regular feminine form of this word B>/ nā’ibah means disaster, which not suitable to indicate feminine parliament member. - 144 - 6.2.8 Morphological Feature of Number Singular, dual and plural are number morphological features identified in traditional Arabic grammar books. Singular applies for one entity of a category. Dual applies to “two” entities of a category, and plural applies to three or more entities. Number applies to nouns, adjectives, pronouns and verbs (i.e. the doer or the subject of verb). Other morphological categories, namely gender and rationality, affect the formation of the plural of nouns, particles or adjectives (Ryding 2005). Table 6.8 gives examples of singular, dual and plural words. We distinguish between two types of plural: the sound plural w2 }¨ ğam‘ sālim and the broken plural ‹i8 }¨ ğam‘ taksῑr. Sound plurals take specific suffixes to form the plural of certain masculine and feminine nouns. Broken plurals of nouns, by contrast do not follow regular rules but take one of a number of templatic patterns. For instance the un > word J2 D ;- kitāb ‘book’, has the plural D ?-? kutub ‘books’ following the templatic pattern +D #? G?H fu‘ulun. Broken plurals are formed by adding letters to the singular form, by deleting letters from the singular form, or by changing the short vowels of the singular form. The plural of paucity %5 }¨ ğam‘ qillah indicates few instances of a certain entity or type, while the Plural of Multitude \  }¨ ğam‘ kaṯrah indicates any number of instances more than three of a certain entity or type. The Ultimate plural ”'Ro nF-) munthā al-ğumū‘ is kind of Plural of Multitude but it follows only certain patterns. The Ultimate plural has an added infix ’alif added to generate the broken plural from its corresponding singular noun followed by two consonants, or three consonants where the middle letter is silent (not followed by a vowel). Sometimes a broken plural can be further pluralized by a sound plural. If the broken plural is rational then the plural takes masculine plural suffixes, while, if it is an irrational broken plural, the feminine plural suffix is used to form the plural of the plural }Ro }¨ ğam‘ al- ğam‘, e.g. `28' ?" buyūtāt ‘houses’, which is formed by adding the feminine plural suffix ` āt to the broken plural `' ?" buyūt ‘houses’, which has the singular d " bayt ‘house’. The category ‘undefined’ in the parser indicates cases where it is hard to guess the morphological feature of number of a particular word. For example, in the sentence ; ;-; . ? >2.e katab aṭ-ṭālibu ad-darsa ‘the student wrote the lesson’, the verb ; ;-; kataba v ; =  ‘wrote’ is singular and there is agreement between the verb and the subject of the sentence ? >2.e aṭ-ṭālibu ‘the student’, which is also singular. On the other hand, in the > >2.e ;-; katab aṭ-ṭālibān ad-darsa ‘the two students wrote the lesson’, the . k2 sentence v ; ; ; =  > >2.e aṭ-ṭālibān ‘the two students’, verb ; ;-; kataba ‘wrote’ is singular while the subject k2 ; . . is dual. The sentence v   J e  kataba aṭ-ṭullābu ad-darsa ‘the students wrote the ; ;; ;= ? lesson’, similarly has no agreement in gender between the singular form of the verb ; ;-; - 145 kataba ‘wrote’ and the plural form of the subject J ? .e aṭ-ṭullābu ‘the students’. The attribute ‘undefined’ is added to the number category of the verb to mark these cases. Table 6.8 shows examples of the number category of nouns, verbs, adjectives and pronouns and illustrates the effects of the gender and humanness in the formation of the plural. Figure 6.10 shows the attributes of the morphological feature of number, represented at position 8 in the tag string. Number ", 5+  + Singular (s) ", %+ , : 8 Dual (d) o-(+L: 8 Sound Plural (p) H. - l:, Y + Broken Plural (b) %.M- l:+ Undefined (x) T%- + 8 %Z Plural of paucity (m)  l: Plural of multitude (j) E%L l: Ultimate plural (u) ` :Y o Plural of plural (l) l:Y l: Figure 6.10 Morphological feature of number category attributes, with letter at position 8 - 146 Table 6.8 Examples of the morphological feature category of Number Category Singular (s) Dual (d) Sound plural (p) Broken (b) plural Verb +<%+ (+ qara’a he read < , %+ (+ qara’at she read 3 + (+ qalamani 9: two pens(masculine) 3 @* waraqatani 9 ++ two papers (feminine) 3 & yaqra’āni 9_%n + they (two) reading (masculine) 3 taqra’āni 9_%n they (two) reading (feminine) 9*„%n+& yaqra’ūn they are reading (masculine) 9< + %n& yaqra’na they are reading (feminine) 9 ?% murāsilūn agents (masculine) ?%8 murāsilāt agents (feminine) Plural multitude (j) of plural Plural of plural (l) Undefined (x)  ’abwābun ƒ 6+< doors un ƒ 88 kutub books 5. masāğid mosques Q@ riğālāt men ------------ 52 are are X.3/ nisā’ women %I ‘arab Arabs Plural of paucity (m) Ultimate (u) Noun Hƒ + (+ qalamun pen (Masculine) @+*+ waraqah paper (Feminine) ------------ Adjective S:+ ğamῑl beautiful (masculine, singular) :+ ğamῑlah beautiful (feminine, singular) 9:+ ğamῑlāni beautiful (masculine, dual) 9 :+ ğamῑlatān beautiful (feminine, dual) 9 :+ ğamῑlūn beautiful (masculine, plural) :+ ğamῑlāt beautiful (feminine, plural) @+B3 kibār senior (masculine, plural) Pronoun52 r huwa he r hiya she :r humā they (Common gender, dual) ------------ Hr hum they (M) $- r8 hunna they (F) ------------ ------------ ------------ ------------ lƒ -@8 rukka‘un people who bow to the ground ------------ ------------ ------------ ------------ ------------ ------------ ------------ - 8 3-q + ++ katab p + @, 5 aṭ-ṭālibu ad-darasa ‘the student wrote the 3 B3-q ++ lesson’; 9 + + p @ 5 katab aṭ-ṭālibān +, ad-darsa ‘the two students wrote the lesson’;  8 -q + ++ p kataba aṭ+ @, 5 ṭullābu ad-darsa ‘the students (plural) wrote the lesson’ ------------ ------------ The number category applies to pronouns. They can be classified into singular, dual, and broken plural even though they are not templatic. - 147 - 6.2.9 Morphological Feature of Person Arabic has three main person attributes; first person u|%; ;-m al-mutakallim, second ? person ;2’; m al-muẖāṭab and third person >A2;$ al-ḡā’ib. First person refers to the person ? or people speaking. The second person refers to the person or people who are present and sharing the talk or speech. The third person addresses the person or people who are absent and do not participate in the talk or speech (Ryding 2005). The person category is affected by other morphological feature categories namely; gender and number. Thirteen personal pronouns and verb forms of person category, which are affected by gender and number, can be distinguished. There is no gender distinction in the first person but two forms of first person; singular and plural which is used as dual as well. There are five forms of second person; masculine singular, feminine singular, dual (masculine or feminine), masculine plural and feminine plural. The third person distinguishes between six forms of personal pronouns or verbs; masculine singular, feminine singular, masculine dual, feminine dual, masculine plural and feminine plural (Ryding 2005). Table 6.9 shows the three main category attributes of person and how they are affected by gender and number categories with examples of both verbs and personal pronouns. Figure 6.11 shows the attributes of the morphological feature of person, represented at position 9 in the tag string. Table 6.9 The three main attributes of person category with examples Number Person POS Gender Masculine Singular First Person (f) Personal Verb pronoun Second Person (s) Personal Verb pronoun + ,/+< /< ’anā I 8 B++ katabtu I wrote Feminine + B++ ’anta you katabta you wrote 3 ,/+< 3 B++ ’anti you katabti you wrote Third person (t) Personal Verb pronoun + r8 huwa he + r3 hiya she + ++ kataba he wrote , +B++ katabat she wrote B++ Masculine $8 , +/ Dual naḥnu we Feminine B++ katabnā we wrote :8(,/+< ’antumā you :8,B++ katabtum ā you wrote :r8 humā they katabā they wrote +B++ katabatā they wrote - 148 - Number Person POS Gender First Person (f) Personal Verb pronoun Masculine $8 , +/ Plural naḥnu we Feminine B++ katabnā we wrote Second Person (s) Personal Verb pronoun Third person (t) Personal Verb pronoun H8(,/+< Hr8  8B++ $- r8 $B + ++ ’antum you $- 8(,/+< ’antunna you  B++ katabtū you wrote $B - ++ katabtunn a you wrote hum they hunna they katabū they wrote katabna they wrote Person †f' First Person (f) +Nf: 8 Second Person (s) Hi M + +: 8 Third Person (t) 3>+  Figure 6.11 Morphological feature of person category attributes, with letter at position 9 6.2.10 Morphological Feature Category of Inflectional Morphology Inflectional morphology 3( . aṣ-ṣarf is an important feature of most Arabic word. Words are classified according to inflectional morphology into (i) invariable €  mabnῑ or (ii) declined or conjugated J# mu‘rab. Declined or conjugated words J# mu‘rab are defined as these words which are affected by their preceeding word in context. The affect causes a change in case or mood of the word, changing its case or mood mark. By contrast, invariable words €  mabnῑ are defined as words that do not change their case or mood marks in context, although they preceeded by words that otherwise have an effect on the following words in context (Dahdah 1987; Al-Ghalayyni 2005). A declined or conjugated word can be an imperfect verb, e.g. ? -! yaktubu ‘he is writing’, and most nouns such as 12R; i . . as-samā’ ‘the sky’, ¬=; al-‘arḍ ‘the earth’ and +t?  ar-rağul ‘the man’. An invariable word can be any particle, past and imperative verbs, and some nouns such as = ;5 qad ‘already or perhaps’, ; ;-; kataba ‘he wrote’, = ?-= ’uktub ‘write (order)’,  hāḏihi ‘this (fem.)’, C!: ; ‘ayna ‘where’, and C= ; man ‘who’ (Dahdah 1987; Al-Ghalayyni 2005). Most nouns are declined an exception being some nouns that are similar to particles. For example, pronouns are indeclinable nouns. Declined nouns are classified into (i) triptote or fully declined 3() munṣarif, and (ii) diptote or non-declinable 3( . C ”')Œ mamnū’ min aṣ-ṣarf. Triptote or fully declined nouns are regular nouns which change their case in context affected by the preceding word. The case mark can be any short vowel, tanwῑn or a letter such as, ’alif and yā’. Diptote or non-declinable nouns by - 149 contrast, cannot accept tanwῑn or kasrah as case mark; for example, ? ;= ;: ’aḥmadu ‘Ahmad’, J'# ? @; =e4; ‘aṭšānu ‘thirsty’ (Dahdah 1987; Al-Ghalayyni ; ;! ya‘qūba ‘Jacob’, and k2 2005). Figure 6.12 shows the attributes of the morphological feature of Inflectional Morphology. Table 6.10 lists examples and definitions of the 4 attributes of the morphological feature category of Inflectional Morphology, represented at position 10 in the tag string. Table 6.10 Examples of the morphological feature category of Inflectional Morphology POS Noun H?\ al-’ism Morphology attributes Examples Invariable An Invariable noun does not change its case marks in context. Although it is preceded by special words that have effects on (s) the following words. E.g. Pronouns u?-G=<;: ’antum ‘You (second B person, plural)’. mabnῑ Declined Triptote / fully Triptote or fully declined nouns are regular declined (v) nouns which change their case in context %8 due to the effect of the preceding word. E.g. T%3  mu‘rab + 8 12R; i . as-samā’ ‘the sky’, ¬=; al-‘arḍ ‘the munṣarif earth’, +t?  . ar-rağul ‘the man’. Diptote / non- Diptote or non-declined nouns can not accept tanwῑn or kasrah as case mark , e.g. declined (p) ? ;= ;:‘aḥmadu ‘Ahmad’, J'# T% - $ ` : ; ;! ya’qūba mamnū’ min ‘Jacob’, k2 ? @; =e4; ‘aṭšānu ‘thirsty’. aṣ-ṣarf Verb S al-fi‘l Invariable (s) B mabnῑ Conjugated (d) %8 mu‘rab An invariable €  mabnῑ verb is defined as a word that does not change its mood marks in context. ; ;-; kataba ‘he wrote’, and = ?-= ’uktub ‘write (order)’. A conjugated verb is affected by the preceding word in context. E.g. ? -! yaktubu ‘he is writing’. ; ?-;! C= ; lan yaktuba ‘he will not write’. = ?-! =w; lam yaktub ‘he did not write’ Noun H?\ Declined % + 8 Invariable (s) B+ Verb S Conjugated (d) % + 8 Invariable (s) B+ Triptote / fully declined (v) T% Diptote / non-declinable (p) $ ` : Figure 6.12 The morphological feature subcategories of Morphology attributes, with letter at position 10 - 150 - 6.2.11 Morphological Feature Category of Case or Mood Case or mood is the morphological feature that determines the appropriate ending of a word, whether the word ends with a letter, short vowel or tanwῑn. Case applies to nouns, and mood applies to verbs; since a word cannot be a noun and verb at the same time, no word can have both case and mood, they are mutually exclusive. So, we used position 11 to encode both case of noun and mood of verb. Case u ."4N 2  al-ḥālah al’i‘rābiyyah lil’ism is a morphological feature which applies to nouns and the subclasses of noun such as adjectives. There are three attributes of the case category: nominative ”'H marfū‘, genitive ¤ mağrūr and accusative J'() manṣūb. Case marks are short vowel h h suffixes; ḍammah R­ . ( ?G ) /u/ for nominative, kasra \i ( >G ) /i/ for genitive and fatḥa ,-H ( ;G ) /a/ for accusative; with some exceptions to these general rules. Case is classified under morphology because it is part of word structure. Case is also classified under syntax because it is determined by the syntax of the sentence or clause. Subjects are marked by nominative case, direct objects of transitive verbs are marked by accusative case, and the object of a preposition and the possessor in a possessive structure are marked by genitive case (Ryding 2005). Mood +#S% |"4N 2  al-ḥālah al-’i‘rābiyyah lilfi‘l is a morphological feature which applies to verbs. There are three attributes of this category, namely indicative }>H . ar-raf‘, > .) an-naṣb and imperative or jussive Ml> o al-ğazm. Straightforward subjunctive ( ; statements or questions involve the indicative mood, whereas the subjunctive mood indicates an attitude toward the action (doubt, desire, wishing, necessity), and the imperative or jussive mood indicates an attribute of command or need (Ryding 2005). Imperative here describes the mood of the verb, while in section 6.2.3 imperative describes a verb category. Like case, mood is classified under morphology because it is reflected in word structure. Mood is indicated by suffixes attached to the end of the verb stem. Mood is h marked by ḍammah R­ . ( ?G ) /u/ to indicate the indicative mood, marked by fatḥa ,-H ( ;G ) /a/ to indicate the subjunctive mood, and by sukūn k'? (G=G) to indicate the imperative or jussive mood. Mood marking is determined by particular particles or by narrative context. This marking applies only to imperfect and imperative verbs. Perfect verbs do not have mood (Ryding 2005). EAGLES guidelines for morphosyntatic annotation recommended putting attributes under part-of-speech headings. The standard requirement for these attributes/values is that it is advisable that the tag set of that language should encode them. The recommended attributes include type of noun, gender, number, case, person, definiteness, verb form / mood, tense, voice, status, degree, possessive, category of pronouns, and type for pronoun, determiner, article, adposition, conjunctions, numerals, and residuals. Case is a - 151 recommended attribute for nouns (N), adjectives (AJ), pronouns and determiners (PD), articles (AT) and numerals (NU). Table 6.11 shows the different attribute values of the case under each part-of-speech heading recommended by EAGLES. Mood or verb form is a recommended attribute specified for verbs. EAGLES guidlines distinguishes between eight attributes of mood for European languages. These values are indicative, subjunctive, imperative and conditional which are applicable to finite verbs, and infinitive, participle, gerund and supine which are applicable for non-finite verbs. Table 6.11 The different attribute values of Case under each part-of-speech heading, as recommended by EAGLES Part of Speech Nouns (N) Adjectives (AJ) Pronouns and Determiners (PD) Articles (AT) Numerals (NU) Attributes of Case 1. Nominative 2. Genitive 1. Nominative 2. Genitive 1. Nominative 2. Genitive 5. Non-genitive 6. Oblique 1. Nominative 2. Genitive 1. Nominative 2. Genitive 3. Dative 4. Accusative 5. Vocative 3. Dative 4. Accusative 3. Dative 4. Accusative 3. Dative 4. Accusative 3. Dative 4. Accusative Case and mood are also important morphological features of an Arabic word. A good morphosyntatic annotation of Arabic text should include the case or mood of the word and the two main attributes associated with it, namely, the morphological feature of Inflectional Morphology and the morphological feature of Case and Mood Marks. For morphosyntatic annotation of Arabic text, these three morphological feature categories are obligatory attributes. Specifying the attributes of these morphological feature categories is a major topic of linguistic and grammatical studies of morphology and syntax of Arabic. J4r 3( ... " . 8 2 Q_ HZ ?2 :k2-2  "# `2R%% ."3( u%4" ”'­' C ' /2‰ {  &2‰ ž k“ n%4 k'- D\QS? L 2F)4 §, ? 2H _ _ : Mlt : t _  C g uF * J# »F) ;  D ? L 2F)4 §, ?  ? ;  £-;! 2 n%4 2‰] Y :  (< : }H ? k' _ _ n%4 12" (Al-Ghalayyni, 2005 p.8) " ... ."J4N u%4" ”'­' C ' g ‹$ B ;8 C \ 2 “ … Morphology and Syntax Arabic words have two states: stand alone words (out of context words) and in-context words. Searching for an out-of-context word to specify its pattern and form is the subject of morphology 3( u%4 ‘ilm aṣ-ṣarf. And searching for a word in a contex to specify its case or mood according to the methods of Arabic grammar by determining the attribute of case or mood of the word such as nominative, accusative, genitive or jussive mood, or determing whether the word has only one state wherever it appears in context, is the subject of syntax, which is called J4N u%4 ‘ilm al- ’i‘rāb …” (Al-Ghalayyni 2005 p.8) - 152 Table 6.12 shows examples of Case or Mood attributes within sentences. Figure 6.13 shows the 6 attributes of the morphological feature of Case or Mood category, represented at position 11 in the tag string. Table 6.12 Examples of morphological feature category of Case or Mood Case or T Example mood Case of noun H? -6%I0  al-ḥālatu al-’i‘rābiyyatu lil-’ism Nominative n Marked by ḍammah R­ . ( ?G ) /u/. ` % >m q >2.e y; ḏahaba aṭ-ṭālibu ’ilā al-madrasati ‘The student ;; ? ; ; marfū‘ went to the school’. The word ? >2.e aṭ-ṭālibu ‘The student’ is the subject of the sentence and is in the nominative case. Accusative a Marked by fatḥah ,-H ( ;G ) /a/.   . ? >2.e ;:;G;5 qara’a at-talibu ad-darsa ‘The student read the v ;  manṣūb . ad-darsa ‘the lesson’ is the direct object of lesson’. The word v ;  the transitive verb ;:;G;5 qara’a ‘read’, and is in the accusative case. Genitive g Marked by kasrah \i ( >G ) /i/. @*%Y > m q >2.e ;y ḏahaba aṭ-ṭālibu ’ilā al-madrasati ‘The student  ;; ? ; ; mağrūr went to the school’. > m al-madrasati ‘the school’ is the object of the The word  ;; preposition q ’ilā ‘to’ and is in the genitive case. Mood of verb S  Indicative (n) n l3% ar-raf’ Subjunctive 3 -  an-naṣb a Imperative or jussive ![3 Y + al-ğazm j tu tu i6%I0  al-ḥāla al-’i‘rābiyya lil-fi‘l Marked by ḍammah R­ . ( ?G ) /u/. >\QN * +R#;! ya’malu fi al-‘idarati ‘He works in administration’. ; ?; The verb +? R# ; ;! ya’malu ‘he works’ is in the indicative mood. Marked by fatḥah ,-H ( ;G ) /a/. in _\2!l>" M' = ? ; yağibu ’an naqūma bi ziyārat ‘It is necessary that ; ; ;< k: we undertake a visit’. The verb M' ; ;< naqwma ‘we undertake’ is in the subjunctive mood because it is preceded by the subjunctive particle k: = ’an. Marked by sukūn k'? ( =G ) or shortening of the final vowel of the > > 24 verb if this vowel is otherwise long. x ’iṣlāḥāt D ; ? )=? += R= ;8 =w; `2/Z lam taktamil munḏu ‘āmayni renovations that haven’t been completed for two years. !ˆ ; =)G;8 r lā tansa! ‘Don’t forget!’. The verb += R> = ;8 taktamil ‘completed’ is in the jussive mood because it is been preceeded by the negative particle =w; lam. The verb ˆ ; )=G;8 tansa ‘forget’ is in the jussive mood, and is marked by shortening of the final vowel letter  ’alif of the original verb ni)=G;8 tansā. - 153 - Nominative (n) ` % Accusative (a)   Genitive (g) @*%Y Case H? -6%I0  Mood S  i6%I0  Indicative (n) l3% - 3 - Subjunctive (a)  Imperative/Jussive (j) ![3 Y + Figure 6.13 The morphological feature of Case or Mood, with letter at position 11 6.2.12 The Morphological Feature of Case and Mood Marks The case or mood is an important morphological feature of the word. The case or mood of a word changes in context, and it is affected by the preceding words. The change of case or mood of the word affects the end of the word, by either change or omission of the word’s last letter or the short vowel which appears on it. There are three kinds of case or mood marks; short vowel, letter or omission. The short vowels are ḍammah R­ . ( ?G ), h h > fatḥa ,-H ( ;G ) /a/ and kasra \i ( G ) /i/. The letters are ’alif (  ) /ā/, nūn (k) /n/, wāw () /w/ and yā’ ( ) /y/. Finally, omission is of three kinds; the deletion of the short vowel which is called sukūn k'? ( =G ), the deletion of the vowel letter (’alif, wāw, yā’) and the deletion of the letter nūn (Al-Ghalayyni 2005). The nominative case or indicative mood has four marks, ḍammah R­ . , wāw (), ’alif (  ) and nūn (k). The default mark for nominative case or indicative mood is ḍammah R­ . . h The accusative case or subjunctive mood has five marks; fatḥa ,-H, ’alif (  ), yā’ ( ), kasrah \i and the deletion of letter nūn. The default mark is fatḥah ,-H. The genitive case has three marks; kasrah \i, ’alif (  ) and yā’ ( ). The default mark is kasrah \i. Finally, the imperative or jussive mood has three marks; sukūn k'? , the deletion of the vowel letter (’alif, wāw, yā’) and the deletion of the letter nūn . The default mark is sukūn k'? (Al-Ghalayyni 2005). Table 6.13 shows examples of the 10 attributes of the Case and Mood Marks category. Figure 6.14 shows the 10 attributes of the morphological feature category of Case and Mood Marks, represented in position 12 of the tag string. - 154 Table 6.13 Examples of each attribute of the Case and Mood Marks category Case and Mood Mark Nominative Case (Noun) ` % marfū‘ ḍamma :D - T d wāw () w ’alif (  ) a fatḥah  f ’alif (  ) a yā’ (F) y kasrah E%. k Genitive @*%Y kasrah E%. mağrūr yā’ (F) k Accusative   manṣūb Mood (Verb) Indicative l3% - ar-raf’ Subjunctive 3 -  an-naṣb Imperative or jussive ![3 Y + al-ğazm h y fatḥah  f ḍammah :D - d Inflectional nūn (9) fatḥah  deletion of nūn n sukūn 9 M?8 s deletion of vowel letter -  T%7 T7 deletion of nūn 9  T7 v f o o Example hQ2( B ;z? yuḥabbu aṣ-ṣādiqu ‘The honest (man) ? is loved’. k')—m b%H: ; ; aflaḥa al-mu’minūna ‘The believers won’. kF-½ k R%- M? ; ?! yukramu al-tilmīḏāni almujtahidāni ‘Both of the hardworking students are rewarded’. u;%i-H .@ . <2t ğānib aš-šarra fa-taslam ‘If you avoid evil, then you will be fine’ > ? Ÿ|  y 4: ’a‘ṭi ḏā al-ḥaqqi ḥaqqahu “give the rightful man his right” x-m  ? z yuḥibbu ’allāhu al-muttaqῑna “God likes righteous people” > > -S M: ’akrim al-fatayāti al`F-½ `2 mujtahidāti ‘reward the hardworking girls’ > +A2£S2" ‡i¦ tamassak bil-faḍā’ili ‘keep doing good deeds’ ‡ ": : }: ’aṭi‘ ’amra ’abῑka ‘obey your father’s order’. > +42H ˆ  laysa fā‘ilu al-ẖayri  H L42i C +£HE" ‹T ; ? bi-’afḍala mina as-sā‘ῑ fῑhi “the one who does good deeds is not better that the one who help in them” hQ2( B ;z? yuḥabu aṣ-ṣadiqu ‘The honest (man) ? is loved’ h(2" k'e)8 tanṭiqūna biṣ-ṣidqi ‘You speak the truth’ tin _\2!l>" M' = ? ; yağibu ’an naqūma bi ziyāra ‘It ; ; ;< k: is necessary that we undertake a visit’. k'BÉ>? 2.Œ 'S) . '2)8 C lan tanālū al-birra ḥattā ? ?8 Æ · tunfiqū mimmā tuḥibbūn ‘You will not earn profit unless you spend what you like’ > > 24 x ’iṣlāḥātun lam taktamil D ; ? )=? += R= ;8 =w; `2/Z munḏu ‘āmayni ‘renovations that haven’t been completed for two years’. !ˆ ; )=G;8 r lā tansa! ‘Don’t forget!’. 'R)$8 ^‹‰ ''5 qūlū ẖayran taḡnamū ‘If you speak well, you will get benefit’. - 155 - Case and Mood Marks XB* %I0 I Short Vowel %7 Letter T%7 ḍammah (d) :D fatḥah (f)  kasrah (k) E%. ’alif (a) () yā’ (y) (‫)ي‬ wāw (w) (*) nūn (n) (9) Deletion T7 Sukūn (s) 9 M?8 Deletion of vowel letter (v) (alif, wāw, yā’) -  T%7 T7 Deletion of nūn (o) 9  T7 Figure 6.14 The morphological feature Case and Mood Marks, with letter at position 12 6.2.13 The Morphological Feature of Definiteness Definiteness in Arabic has two attributes (markers); definiteness ;H> #= ; ma‘rifah and indefiniteness \;> ;< nakirah. The prefix (c) alif-lām (6!#- c) is the definiteness prefix for nouns or adjectives; while the diacritical suffix (C!')8) tanwῑn (G_G  DG  ^G ) /-n/ is the indefiniteness suffix. The tanwῑn is a diacritic mark which does not appear in nonvowelized text, while the definiteness mark, the definite article, (c) alif-lām appears on definite nouns or adjectives in non-vowelized text (Ryding 2005). Table 6.14 shows examples of the morphological feature of Definiteness. Figure 6.15 shows the 2 attributes of the morphological feature of Definiteness, represented at position 13 in the tag string. Table 6.14 Examples of the morphological feature of Definiteness Definiteness T Example 1 Definiteness h +%3 , + ma‘rifa d d=G; al-bayt ‘the home’. Is a definite noun marked with prefix (c) ’alif-lām. 2 Indefiniteness E%M3 +/ nakirah i un d D =;G" bayt ‘home’. Is an indefinite noun marked with the diacritical suffix tanween (GDG)/un/. Definiteness E%+ M3 -* +%3 , : + Definiteness (d) +%3 , + Indefiniteness (i) E%+ M3 +/ Figure 6.15 The morphological feature of Definiteness, with letter at position 13 - 156 - 6.2.14 Morphological Feature of Voice Verbs in Arabic are either in the active voice M'?%#= R% ; €> =; mabnῑ lil-ma‘lūm or the passive voice c'F? ƒ= R% ; €> =; mabnῑ lil-mağhūl. The active voice standardly indicates that the doer of the action is the subject of the verb, while in the passive voice the subject of the verb is the direct object of the corresponding active, and the doer of the action (the activevoice subject) is unknown or not mentioned (Ryding 2005). Table 6.15 shows examples of the 2 Voice category attributes in sentences. Figure 6.16 shows the 2 attributes of the morphological feature of Voice, represented at position 14 in the tag string. Table 6.15 Examples of Voice category attributes in sentences Voice Active ! 8 , :+  3B, + mabnῑ lilma‘lūm T a Passive p 3 C 8 Y, :+  B, + mabnῑ lil-mağhūl Example . ? >2.e ; ;-; kataba aṭ-ṭālibu ad-darsa ‘The student wrote v ;  the lesson’. The verb ; ;-; kataba ‘wrote’ is an active verb. The subject >. ? 2e aṭ-ṭālibu ‘The student’ appears in the sentence. . ; >-? kutiba ad-darsu ‘The lesson was written’. v ?  The verb ; >-? kutiba ‘was written’ is a passive verb. The . ad-darsu ‘The subject of the verb is the direct object v ?  lesson’. Voice ! 8 , : + Active voice (a) ! 8 , :+  3B, + Passive voice (p) C 8 Y, :+  3B, + Figure 6.16 The morphological feature of Voice, with letter at position 14 6.2.15 Morphological Feature of Emphasized and Non-emphasized The morphological feature of Emphasized and Non-emphasized .—m ?‹¥ .—m al? ? mu’akkad wa ḡayr al-mu’akkad applies to verbs only. It has three attributes: nonemphasized .—? ‹= ¥; ḡayr mu’akkad which applies to past or perfect verbs, obligatorily emphasized  E- ; yağibu at-ta’kῑd and optionally emphasized  E- a'Ri masmūḥ atta’kῑd. Imperfect verbs must be emphasized in some circumstances when some conditions have been met such as: interrogation, wish, demand, encouragement, prevention, negation, and swearing. Emphasized verbs are marked by the suffix letter k= /n/ added to the end of the verb stem; see table 6.5. There are two types of emphatic k= /n/; one is the intensive nūn kY /nn/ % – k'< nūn ṯaqῑlah and the other is the non-intensive nūn k= /n/  S S‰ k'< nūn ẖafῑfah (Dahdah 1987; Dahdah 1993). - 157 Table 6.16 shows examples of Emphasized and Non-emphasized category attributes in sentences. Figure 6.17 shows the 2 attributes of the morphological feature of Emphasized and Non-emphasized, represented at position 15 in the tag string. Emphasized and Non-emphasized 5-W: 8 %Z* 8 8 5-W: Non-emphatic verb (m) 5-W8 %, +Z S Emphatic verb (n) 5-W8 S Figure 6.17 The morphological feature of Emphasized and Non-emphasized, with letter at position 15 Table 6.16 Examples of the morphological feature Emphasized and Non-emphasized Emphasized or T Example Non-Emphasized > m q >2.e ;y ḏahaba aṭ-ṭalibu ‘ilā al-madrasati ‘The Non-emphatic verb m  ;; ? ; ; 5-W8 %,+Z S student went to the school’. fi‘l ḡayr mu’akkad The perfect verb ; ; ;y ḏahaba ‘went’ is not emphasized. Emphatic verb n ÄÊ . ; ; = ;8 += ; hal taḏhabanna? ‘Would you go?’ 5-W8 S The verb Ê . ; ; = ;8 taḏhabanna ‘go’ is emphasized. The suffix fi‘l mu’akkad letter kY /nn/ (%   k')) is added to the original verb ? ; = ;8 taḏhabu ‘go’. !;Ê Y= y ; ’iḏhabnna ‘Go!.’ The imperative verb Ê . = ; y=  ’iḏhabnna ‘Go!’ is emphasized. The suffix letter kY /nn/ (%   k')) is added to the original verb = ; y=  ’iḏhab ‘go’. 6.2.16 The Morphological Feature of Transitivity Verbs in Arabic are either transitive > lāzim. # | ;-? muta‘addῑ or intransitive M“r Intransitive verbs are verbs which give full meaning in a sentence without the need for an object. On the other hand, transitive verbs require an object to complete the meaning of the sentence. There are three types of transitive verbs. First, singly transitive c'#S | ;-? ? ; q # >  muta‘addῑ ’ilā maf‘ūlin wāḥid where there is only one object in the sentence. Second,  doubly transitive verb x;'#? S= ; q | #; G;-? muta’addῑ ’ilā maf‘ūlayn which requires two objects to complete the meaning in a sentence. Third, triply transitive verb + >42S; ;–;– q | #; G;-? muta‘addῑ ’ilā ṯalāṯati mafā‘ῑl, which require three objects to complete the meaning of a sentence; there are only seven of these verbs: : ’arā ‘showed’, u; ;%4;: ’a‘lama ‘notified’, ¼ ; . ; ḥaddaṯa ‘narrated’, ;G.‰; ẖabbara ‘informed’, ;G;‰= ;: ’aẖbara ‘gave information’, ;E;G=<;: - 158 ’anba’a, and ;E.;G< nabba’a ‘advised’ ‘announced’ which share the meaning of telling or informing (Dahdah 1987; Dahdah 1993). Table 6.17 shows examples of the 4 Transitivity category attributes in sentences. Figure 6.18 shows the 4 attributes of the morphological feature of Transitivity, represented at position 16 in the tag string. Transitivity +&5, (- 3 Intransitive (i) !AQ Doubly transitive (b) $+ 8 , + o F5i + (+8 i +8 Singly transitive (o) C  8 + o F5 3 57* Triply transitive (t) +K+K o F5i + (+8 3  SI + Figure 6.18 The morphological feature of Transitivity, with letter at position 16 Table 6.17 shows examples of the Transitivity category attributes in sentences Transitivity Intransitive verb 3 !AQ lāzim Singly transitive verb 3* i +8 57 C  o F5 8 + muta‘addῑ ’ilā maf‘ūlin wāḥid Doubly transitive verb $+ 8 , + o F5i + (+8 muta’addῑ ’ilā maf‘ūlayn Triply transitive verb 3  +K+K o F5i ( SI + +8 + muta‘addῑ ’ilā ṯalāṯati mafā‘ῑl T Example i ? >A2 ; `2 ; ; māta al-qā’idu ‘The commander has died’. The verb `2 ; ; māta ‘has died’ is an intransitive verb. The sentence is meaningful without the need for an object. >  ?%=e! yaṭlubu al-bāḥṯu al-ma‘rifati ‘The o ;;H> #= m § ? 2 ; ? ; ; researcher asks for knowledge’. The verb ? ?%=e;! yatlubu ‘asks’ is a singly transitive verb. The sentence is not meaningful without the object ;;H> #= m ; al-ma‘rifati ‘knowledge’. an b ^‹= ‰; v2 ; ?? =E;8 ta’murūna an-nāsa ẖair ‘You order ; .) k people [to do] good’. The verb k ; ?? =E;8 ta’muruuna ‘order’ is a doubly transitive verb. The sentence is not meaningful without the first object v2 ; .) an-nāsa ‘people’ and the an second object ^‹= ‰; ẖair ‘for good’. _ i u´2; R4;: x ><m  ;: ’arā allāhu al-muḏnibῑna t ` ; ; ? ? ; ;; ’a‘mālahum ḥasarātin ‘God shows sinners what they did as repentances’. The verb ;;: ’arā ‘shows’ is a triply transitive verb. The sentence is not meaningful if any of the three > objects are missing. x ; <?m al-muḏnibῑna ‘sinners’, u´2; R4 ; ;: _ i ḥasarātin ’a’mālahum ‘what they did’, and ` ;; ‘repentances’. - 159 - 6.2.17 The Morphological Feature of Rational The morphological feature of rational describes the ability to be endowed with reason and comprehension, like human beings, angels and demons. The opposite is irrational. The morphological feature of “rational” or “rationality” differs from the linguistic concept of animacy because the latter divides nouns/entities into two categories: animate versus inanimate, while the former is used to denote human or human-like entities (e.g. djinn) at the top of the person hierarchy (Zaenen et al. 2004) and endowed with the faculty of reason as distinct from all other entities, whether animate or inanimate. Rational is a morphological feature which is applicable to some types of nouns such as singular proper nouns (names) QSm u%# u ’ism al-‘alam al-mufrad, demonstrative pronouns \2¯N 12©: ’asmā’ al-’išārah, conditional nouns f@ 12© ’asmā’ aš-šarṭ relative pronouns '/'m 12© al-’asmā’ al-mawṣūlah, interrogative pronouns M2FS-N 12©: ’asmā’ al’istifhām and allusive nouns !2) al-kināyah (Dahdah 1987; Dahdah 1993). Table 6.18 shows the 2 attributes of the morphological feature Rational, with rational and irrational examples for these noun types. Figure 6.18 shows the noun types that have the Rational morphological feature, represented at position 17 in the tag string. Table 6.18 Examples of the morphological feature category of Rational Noun Rational Singular proper name H? %:? samῑr ‘Samir’, "%: H  ’ism al-‘alam al- S&%B ğibrῑl ‘Gabriel’, mufrad k 6 ‘iblῑs ‘Satan’. Demonstrative pronouns E@d0 X:?< ’asmā’ al’išārah Interrogation pronouns !?0 X:?< ’asmā’ al’istifhām Conditional nouns %' X:? ’asmā’ aš-šarṭ ‡b*< ’ulā’ika ‘hese’. Irrational Irrational compound proper name such as; H+ ,(+6 bayt laḥm ‘Bethlehem’, ‡+B , (+6 ba’lbak ‘Baalbak’. ‡  tilka ‘that’. $, + man ‘who’, ˆ $, + man ḏā ‘who is he’.  mā ‘that which’, $, + man ‘who’.  mā that ‘which’. ˆ māḏā ‘what’. : mahmā ‘whatever’. Relative pronouns X:?\ $, + man ‘who’.  mā ‘that which’.  ) : al-’asmā’ almawṣūlah Allusive nouns 98 fulān (used to refer to h &M al-kināya ------------------------rational singular masculine proper name) - 160 Rational S %Z* S Rational (h) S Irrational (n) S %Z Rational S 1) Singular proper 2) Conditional 3) Allusive nouns "%: H  H? nouns %' X:? nouns &M 4) Interrogation pronouns 5) Relative pronouns 6) Demonstrative pronouns !?0 X:?<  ) : X:?\ E@d0 X:?< Figure 6.19 Morphological feature category of Rational, with letter at position 17 6.2.18 The Morphological Feature of Declension and Conjugation Declension means a class of nouns or adjectives having the same type of inflectional forms, and conjugation is the schematic arrangement of the inflectional forms of a verb53. In Arabic, both of the terms mean subject to change too. In Arabic grammarical terminology, declension and conjugation is put under the ‘science’ (area of enquiry) that describes the rules of word structure. It identifies the underlying letters of the word, the word’s consonant letters and vowels. It also identifies which of the word’s letters are changed during derivation. In addition, the meaning includes changing the word into different forms of different meanings, such as deriving the perfect verb L­2m +#S al-fi‘l almaḍῑ, imperfect verb ”2£m +#S al-fi‘l al-muḍāri‘, imperative verb  +#H fi‘l al-’amr, active participle +42S u ’ism al-fā‘il, passive participle c'#Sm u ’ism al-maf‘ūl, relative noun J'i)m ur al-’ism al-mansūb, diminutive ‹$(- u ’ism at-taṣḡῑr and others from the gerund (m al-maṣdar (Al-Ghalayyni 2005). h Nouns are classified into inflected nouns H(- | 12© ’asmā’ mutaṣarrifa and nonh inflected nouns H(- | ‹¥ 12© ’asmā’ ḡayr mutaṣarrifa . The inflected noun has number, i.e. it can be dual or plural as well as singular. It can be a relative noun J'i) u ’ism mansūb or diminutive .$( u ’ism muṣaḡḡar. The non-inflected noun 3(-m ‹¥ ur al-’ism ḡayr | al-mutaṣarrif, by contrast has only one form which does not change in context. Noninflected nouns include pronouns A2R£ al-ḍamā’ir, demonstrative pronouns \2¯N 12©: ’asmā’ al-’išārah, relative pronouns '/'m 12© al-’asmā’ al-mawṣūlah, conditional nouns f@ 12© ’asmā’ aš-šarṭ, interrogative pronouns M2FS-N 12©: ’asmā’ al-’istifhām, allusive nouns !2) al-kināyah, adverbs 3.‘ al-ẓurūf and numerals Q4 12© ’asmā’ al-’a‘dād. Inflected nouns H(- 12©r al-’asmā’ mutaṣarrifah are classified into the derived | nouns Ÿ-@ Y u ’ism muštaqq and the primitive nouns 2t u ’ism ğāmid. The derived noun > ‘ālim ‘scientist’ and u|%#- muta‘allim ‘learner’ are is derived from its verb; for example w24 ;? > . derived from the verb u; %4; ‘alima ‘knew’ and u; %#; G;8 ta‘allama ‘he learnt’ respectively. Derived nouns includes 10 types of nouns; active participle +42H u ’ism fā‘il , passive 53 Merriam Webester Dictionarry - 161 participle c'#S u ’ism maf‘ūl, adjective F @ S/ ṣifah mušabbahah, intensive active participle +42S u $2  mubālaḡat ’ism al- fā‘il, elative noun + £S8 u ’ism tafḍῑl, noun of time k2“ u ’ism zamān, noun of place k2 u ’ism makān, gerund with initial mῑm (m LR m al-maṣdar al-mῑmῑ, instrumental noun ] u ’ism al-’ālah and the gerund of the unaugmented verb consisting of more than three letters Q½ | L–  h'H +#S ( maṣdar alfi‘l fawq al-ṯulāṯī al-muğarrad (Al-Ghalayyni 2005). The primitive noun 2o ur al-’ism al-ğāmid cannot be derived from a verb. > Examples are ƒ ḥağar ‘stone’, 6 saqf ‘ceiling’ and u ; Q dirham ‘Dirham (currency)’. They also include, the gerund of unaugmented triliteral verbs \Q½ .  –  c2#H Q2( maṣādir h h al-af‘āl al-ṯulāṯiyya al-muğarrada such as u%= 4> ‘ilm ‘science’ and \1>5 qirā’ah ‘reading’ (Al-Ghalayyni 2005). Verbs are classified into conjugated verbs H(- c2#H: af‘āl mutaṣarrifah and non| conjugated verbs \2t c2#H: af‘āl ğāmidah according to whether the verb has a tense or not. Verb forms are changed to indicate the tense of an action; past tense, present tense and future tense. But if a verb does not indicate any tense or an action, then there is no need to change the verb form, because its meaning does not change when the tense or action changes. Only a change of tense or action requires changing the form of the verb to indicate different meanings in different tenses. The non-conjugated verb 2o +#S al-fi‘l al-ğāmid is similar to particles. It indicates an abstract meaning that has no tense or action. Therefore, the non-conjugated verb has only one form which does not change in any context. Non-conjugated verbs are either restricted to the perfect L­2R% M“ mulāzim lil-maḍῑ such as ni4 ‘asā ‘might’ and ˆ ; =; laysa ‘not (negation)’, or restricted to the imperfect ”2£R% M“ mulāzim lil-muḍāri‘ as in  ? F> ;! yahῑṭu ‘scream’, or restricted to the imperative as in = ; hab ‘suppose’. Finally, the conjugated verb 3(-m +#S al-fi‘l al-mutaṣarrif indicates an action or . tense. So, it accepts the changes of form which reflect the different meanings of different tenses. The majority of verbs belong to the class of fully conjugated verbs 6!(.- M28 +#H fi‘l tām at-taṣrīf where the three types of signification are found as in - katab ‘he wrote’ (perfect), ? ?-= ;! yaktunu ‘he is writing’ (imperfect) and = ?- ‘uktub ‘write (imperative)’. The partially conjugated verb 6!(.- ¸52< +#H fi‘l nāqiṣ at-taṣrīf has only two types of signification, i.e. either perfect and imperfect but not imperative as in Q2 ; kāda Q2 ? ;! yakādu > yūšiku ‘[be] about [to]’, or ‘[be] close near [to] or almost [to]’ and ‡¯: ’awšaka ‡ ; ? ¯'! imperfect and imperative but not perfect as in ?”; ;! yada‘u ‘he leaves’, ”Q; da‘ ‘leave’ and ? ; ;! yaḏaru ‘he leaves’ = ;y ḏar ‘leave’ (Al-Ghalayyni 2005). Table 6.19 shows examples of the 9 attributes of the Declension and Conjugation morphological feature. Figure 6.20 shows the the classifications of nouns and verbs - 162 according to the Declension and Conjugation morphological feature, represented at position 18 in the tag string. Table 6.19 Examples of the Declension and Conjugation morphological feature Declension and Conjugation Noun Non-inflected T n Examples The pronoun '; ? huwa ‘he’ T%i  + 8 %Z ḡayr mutaṣarrif Primitive / Concrete noun t The concrete noun is perceptible by one or more of the five senses and includes the generic noun \:Z ‘imra’ah ‘woman’, the proper noun ;(= > miṣra ‘Egypt’, and some nouns of place and instrument: 2;l=> mizmār ‘pipe’ a The abstract noun is not preciptible by the five senses and includes the unaugmented un drinking, and some gerund: J D = ¯? šurb gerunds with initial ‘mīm’: D ;%=e; maṭlabun ‘claim’ d > ‘ālim ‘scientist’ derived from the verb u>%4 w24 ;; 3 – T%i  ˆ H? }5 + +8 mutaṣarrif – ğāmid – ’ism ḏāt Primitive / Abstract noun 3 – T%i  o H? }5 + +8 mutaṣarrif – ğāmid – ’ism ma‘nā Inflected / Derived noun ‘alima ‘knew’ JŠ +', 8 H? } T%i  + +8 and u|%#;-? muta’allim ‘learner’ derived from the verb u; .%#; ;G8 ta’allama ‘he learn’ mutaṣarrif - ’ism muštaqq Verb Non-conjugated / restricted to the 3 S perfect D:  !A }5 p ˆ ; =; laysa ‘not (negation)’ fi‘l ğāmid- mulāzim lil-māḍῑ Non-conjugated / restricted to the 3 S imperfect `@c:  !A }5 ni4 ‘asā ‘might’ c ? F> ;! yahῑṭu ‘scream’ i = ; hab ‘suppose’ v - katab ‘he wrote’, ? ?-= ;! yaktubu ‘he writes’ and = ?- ‘uktub ‘write’ m Q2 ; kāda Q2 ? ;! yakādu ‘[be] close near [to] or fi‘l ğāmid- mulāzim lil-muḍāri‘ Non-conjugated / restricted to the 3 S imperative %‹ !A }5 fi‘l ğāmid- mulāzim lil-’amr Conjugated / fully conjugated verb ;&%- ! S – T%i  + +8 mutaṣarrif – fi‘l tāmm at-taṣrīf Conjugated / partially conjugated verb ;&%- †/ S – T%i  + +8 mutaṣarrif –fi‘l nāqiṣ at-taṣrīf almost [to]’ > yūšiku ‘[be] about [to]’, ‡¯: ‘awšaka ‡ ; ? ¯'! ” ? ; ;! yada’u ‘he leaves’ ”Q; da’ ‘leave’ ? ; ;! yaḏaru ‘he leaves’ = ;y ḏar ‘leave’ - 163 - Declension and Conjugation ;&%- Verb S Noun H?Q Inflected T% i Primitive 5 Non-inflected (n) T%i  + %Z Derived (d) J' Conjugated J'8 /T% i Fully conjugated (v) ;&%- ! Partially conjugated (m) ;&%- †/ Concrete noun (t) ˆ H? Abstract noun (a) o H? Non-conjugated 5 Restricted to the perfect (p) D:  !A Restricted to the imperfect (c) `@c:  !A Restricted to the imperative (i) %‹ !A Figure 6.20 The the classification of nouns and verbs according to the morphological feature of Declension and Conjugation, with letter at position 18 6.2.19 The Morphological Feature of Unaugmented and Augmented Arabic verbs have roots consisting of three or four letters. From these roots many verbs can be derived by following certain patterns. There are many patterns for Arabic verbs. The standard way of determining the pattern of a verb is to refer to an Arabic lexicon or dictionary. Nonetheless, Arabic linguists have constructed general rules to extract these patterns. Verbs have two basic patterns consisting of three or four letters +; #; G;H fa‘ala and +; ;%#= G;H fa‘lala respectively. Any verb derived following these two patterns is called an unaugmented verb (Q.;¤? +#H) fi‘l muğarrad. From +; #; G;H fa‘ala; the basic triliteral pattern, 10 more patterns can be derived, and from +; ;%#= G;H fa‘lala; the basic quadriliteral pattern, 3 more patterns can be derived. These new patterns are derived by adding one, two or three letters to the basic patterns or by duplicating the second letter ” ‘ayn of the basic pattern. The group of letters that are added to the basic patterns to produce the other 13 patterns are;   :  `  v  c  M  k  G    (ā, ’ , t, s, l, m, n , h, w, y) that combine with the word 2F <'R-E sa’altumūnῑhā ‘you (second person, plural) asked me it (feminine, singular)’ (Dahdah 1987; Dahdah 1993; Al-Ghalayyni 2005). Unagmented declineable nouns are either triliteral L–?– ṯulāṯῑ such as ƒ ḥağr ‘stone’, quadriliteral L42"? rubā‘ῑ such as S#t ğa‘far ‘male proper name’, or quinquiliteral L2Á? ẖumāsῑ such as +tS; safarğal ‘quince [kind of fruit]’. A noun which consists of more than five letters is an augmented noun. A noun can be augmented by one letter !l 3± mazῑd bi ḥarf such as k2( ḥiṣān ‘horse’ (augmented by ā ) and +!)5 qindῑl ‘light’ (augmented by ī ), augmented by two letters xH± !l mazῑd bi ḥarfayn such as a2 ( miṣbāḥ ‘lamp’ (augmented by m M and ā ), augmented by three letters 3: – " !l mazῑd - 164 bi ṯalāṯati ’aḥruf such as he< ’inṭilāq ‘starting’ (augmented by ’ , n k and ā ) and M2µ ’iḥranğām ‘crowded’ (augmented by ’ , n k and ā ), or augmented by four letters #"E" !l 3: mazῑd bi ’arba‘ati ’aḥruf such as 2S$- ’istiḡfār ‘asking for forgiveness’ (augmented by ’ , s v, t ` and ā ). Table 6.20 shows examples of the 5 Unaugmented and Augmented category attributes. Figure 6.21 shows the 5 attributes of the Unaugmented and Augmented category, represented at position 19 in the tag string. Table 6.20 Examples of Unaugmented and Augmented category attributes Unaugmented and Augmented Unaugmented "%- Y+ :8  al-muğarrad T Augmented by 3 one letter T%+ 6 5,&[+ mazῑd bi ḥarf a Augmented two letters $,(+%+ 36 mazῑd by 5,&[+ bi b Augmented by three letters 5,&[+ T*%8 78 3+K+L36 mazῑd bi ṯalāṯati ḥurūf t Augmented four letters q s ḥarfayn by 5&[ T%7< 6@t6 mazῑd bi ’arba‘ati ’aḥruf Examples Triliteral verbs b; ;-G;H fataḥa ‘he opened’. b? ;-S= G;! yaftaḥu ‘he is opening. The letter ( ; ) yā is added to the beginning of the verb stem b; ;-G;H fataḥa ;i; ; =< ’inkasara ‘ has broken’. The letters  ‘alif and k= nūn are added to the beginning of the verb stem ;i; ; kasara ‘broke’. ; ;’= ;- = ’istaẖrağa has extracted. The letters  ’alif, v sῑn and ` ; tā’ are added to the beginning of the verb stem ; ;‰; ẖarağa ‘extracted’. ---------------------- Quadriliteral verbs ; ;= Q; daḥrağa ‘rolled’. Nouns ƒ ḥağr ‘stone’. S#t ğa’far ‘a name’. +tS; safarğal ‘quince, [kind of fruits]’   >   ! yudaḥriğu ‘he is k2( ḥiṣān ‘horse’. ? = ;? rolling’. +!)5 qindῑl ‘light’. The letter ( ; ) yā is added to the beginning of the verb stem ; ;= Q; daḥrağa.  miṣbāḥ ? ;= ; ;-G;! yatadaḥrağu ‘ is a2 ( rolling’. ‘lamp’. The letters ( ; ) yā’ and M2µ ’iḥranğām ‘crowded’ ` ; tā’ are added to the verb stem ; ;= Q; daḥrağa ‘rolled’. he< ‘starting’ ’inṭilāq ---------------------- ---------------------- 2S$- ’istiḡfār ‘asking for forgiveness’ - 165 - Unaugmented and Augmented "%- Y+ :8  Unaugmented (s) "%- Y+ 8 Augmented by two letters (b) $, (+%+ 36 5,&[+ Augmented by one letter (a) T%+ 36 5,&[+ Augmented by three letters (t) T%8 7, +< 3 +K+L36 5,&[+ Augmented by four letters (q) T%8 7, +< 3 + (+6@t, 36 5,&[+ Figure 6.21 The Unaugmented and Augmented category attributes, with letter at position 19 6.2.20 The Morphological Feature of Number of Root Letters “Root is a relatively invariable discontinuous bound morpheme, represented by two to five phonemes, typically three consonants in same order, which interlocks with a pattern to form a stem and which has lexical meaning” (Ryding 2005) Discontinuous means vowels can be interspersed between the root consonants e.g v ; ; Q; d-r-s study. These consonants must always be present in the same sequence in the derived words first Q /d/ then  /r/ then v /s/ (Ryding 2005). Verbs, as mentioned in the previous section, have triliteral L–?– ṯulāṯῑ or quadriliteral L42"? rubā‘ῑ roots. The general Arabic rule is that any noun with less than three letters or more than five letters then either has letters deleted from it or added on (Dahdah 1987). According to this rule, Arabic nouns are either triliteral L–?– ṯulāṯῑ such as ƒ ḥağr ‘stone’, quadriliteral L42"? rubā‘ῑ such as S#t ğa‘far ‘a name’, or quinquiliteral L2Á? ẖumāsῑ such as +tS; safarğal ‘quince’. Table 6.21 shows examples of the 3 attributes of the Number of Root Letters category. Figure 6.22 shows the 3 attributes of the Number of Root Letters category, represented at position 20 in the tag string. Number of Root Letters @, Y , "5+ I+ + T%8 7< Triliteral (t) KK Quadriliteral (q) I6@ Quinquiliteral (f) ?:# Figure 6.22 The Number of Root Letters category, with letter at position 20 Table 6.21 Examples of Number of Root Letters category attributes Number of root letters Triliteral 3K8K ṯulāṯῑ 3 @ rubā‘ῑ Quadriliteral I6 8 3 Quinquiliteral ?:#8 ẖumāsῑ T Examples t   g k t b ‘wrote’ q Œ @  " d ḥ r ğ ‘rolled’ f C Œ @ T p s f r ğ l ‘quince’ - 166 - 6.2.21 The Morphological Feature of Verb Root Arabic linguists classify Arabic triliteral verbs (roots) into two main categories according to the groups of letters which construct the verb. These categories are the intact verb b ,( . +#S al-fi‘l aṣ-ṣaḥῑḥ and the defective verb +-#m +#S al-fi‘l al-mu‘tall. Intact verbs are classified into three subcategories; sound verb w2i +#S al-fi‘l as-sālim, verb containing hamzah “'RFm +#S al-fi‘l al-mahmūz, and doubled verb 6#£ . m +#S al-fi‘l almuḍa‘‘af. All the underlying (original) letters of the sound verb belong to the consonant letter group only; i.e. all letters except for the vowels and hamzah. The second verb subcategory containing hamzah has hamzah ( : , Z , P , [ , 1 ) as one of its underlying (original) letters either as first, second or third letter. The doubled subcategory has the same letter as its second and third radicals (Al-Ghalayyni 2005). The second category is the defective verb %-#m c2#H al-’f‘āl al-mu‘tallah , where one or two of the the underlying (original) letters belong to the set of vowels  ,  , (’alif, wāw, yā’). This category has four subcategories. The first contains a vowel as the first letter of its root. This is called an initial-weak verb c2 m +#S al-fi‘l al-mithāl. The second subcategory contains a vowel as the second letter of the root. This is called a hollow verb 3't +#S al-fi‘l al-ağwaf. The third subcategory contains a vowel as the third letter of its root. This is called a final-weak verb ¸52) +#S al-fi‘l an-nāqiṣ. The last subcategory contains two vowels in its root. If these vowels are adjacent, as the first and second letters of the root, or as the second and third letters of the root, this is called an adjacent doublyweak verb k 6 S lafῑf maqrūn. If it contains two vowels as the first and third root letters, it is called a separated doubly-weak verb hS 6 S lafῑf mafrūq (Al-Ghalayyni 2005). Figure 6.23 shows part of this classification of 30 Verb Root attributes. More detailed subclassification of triliteral verbs can be derived by combining the subcategories of verbs containing hamzah, doubled letters and defective letters. Table 6.22 shows the 23 Verb Root attributes with an example of each attribute. The Verb Root category is represented at position 21 of the tag string. Table 6.22 Verb Root category attributes and their tags at position 21 # 1 Category attributes Sound verb b ,/ saḥīḥ 2 Doubled verb 6#£ muḍa’’af b 3 Initially-hamzated verb 12S “'RF mahmūz al-fā’ c Y  ḥabba ‘loved’ +: ’akala ‘ate’ 4 Initially-hamzated doubled verb d k: Y ’anna ‘moan’ 5 Initially- and hamzated verb e E–: ’aṯa’a ‘hit’ 6 Medially-hamzated verb 6#£ 12S “'RF mahmūz al-fā’ . muḍa’’af M “'RF 12S “'RF mahmūz al-fā’ wa mahmūz al-lām x# “'RF mahmūz al-‘ayn f cE sa’ala ‘asked’ 7 Finally-hamzated verb M “'RF mahmūz al-lām g :" bada’a ‘started’ and finally- Tag a Examples i ḥasaba ‘calculated’ - 167 # 8 Category attributes wāw-initial verb 9 wāw-initial and doubled verb wāwinitial and medially-hamzated verb 10 11 wāw-initial and finallyhamzated verb 12 yā'-initial verb 13 yā'-initial and doubled verb yā'- initial and mediallyhamzated verb 14 15 Hollow with wāw 16 Hollow with wāw and initially-hamzated verb 17 Hollow with wāw and finally-hamzated verb 18 Hollow with yā' 19 Hollow with yā' and initially-hamzated verb 20 Hollow with yā' and finally-hamzated verb 21 Defective with wāw verb 22 Defective with wāw and initially-hamzated verb 23 Defective with wāw and medially-hamzated verb 24 Defective with yā' verb 25 Defective with yā' and initially-hamzated verb 26 Defective with yā' and medially-hamzated verb 27 Adjacent verb 28 Adjacent doubly-weak and initially-hamzated verb Separated doubly-weak verb Separated doubly-weak and medially-hamzated verb 29 30 doubly-weak Tag h  c2  miṯāl wāwī Examples 4 wa‘ada ‘promised’ 6#£  c2  miṯāl wāwī muḍa’’af i x# “'RF  c2  miṯāl wāwī mahmūz al-‘ayn M “'RF  c2  miṯāl wāwī mahmūz al-lām LA2! c2  miṯāl yā’ī j A wa'iba 'be angry' k Ë waṭi’a ‘trampled’ l C! yaqina ‘certained’ 6#£ LA2! c2  miṯāl yā’ī muḍa’’af m Y yamma ‘to betake’ x# “'RF LA2! c2  miṯāl yā’ī mahmūz al-‘ayn  3't: ’ağwaf wāwī n ˆ{! ya’isa ‘to despair’ o M25 qāma ‘to stand up’ 12S “'RF  3't: ’ağwaf mahmūz al-fā’ M “'RF  3't: ’ağwaf mahmūz al-lām LA2! 3't: ’ağwaf yā’ī wāwī p J] āba ‘to return’ wāwī q 12< nā’a ‘to fall down’ r ”2" bā‘a ‘to sell’ s ˆ!: ’ayisa ‘to despair’ t 12¯ šā’ ‘to want’ u v  saraw ‘to rid s.o’s worries’ 2: ’asā ‘to nurse’ w E ma’ā ‘to extend’ x L@‰ ẖašiya ‘to fear’ 12S “'RF LA2! ¸52< nāqiṣ yā’ī mahmūz al-fā’ x# “'RF LA2! ¸52< nāqiṣ yā’ī mahmūz al-‘ayn k 6 S lafῑf maqrūn y y: ’aḏiya ‘to damage’ : ra'ā ‘saw’ 12S “'RF k 6 S lafῑf maqrūn mahmūz al-fā’ $ '5 qawiya ‘to become strong’ : ’awā ‘to seek refuge’ hS 6 S lafῑf mafrūq & n5 waqā ‘to guard’ x# “'RF hS 6 S lafῑf mafrūq mahmūz al-‘ayn @ : wa’ā ‘to garantee’ 12S “'RF LA2! 3't: ’ağwaf yā’ī mahmūz al-fā’ M “'RF LA2! 3't: ’ağwaf yā’ī mahmūz al-lām  ¸52< nāqiṣ wāwī 12S “'RF  ¸52< nāqiṣ wāwī mahmūz al-fā’ x# “'RF  ¸52< nāqiṣ wāwī mahmūz al-‘ayn LA2! ¸52< nāqiṣ yā’ī z * Q Y wadda ‘wished’ suffer - 168 - 3 8 (, (6 Verb Root S3 ,  +8 > Intact verb b ,> ( . +#= S Sound (a) > b ,> ( . +#= S Defective verb +;-#= m +#= S> ? Hamzated Doubled (b) > > “'R? F= m +#= S 6#. £ ; m +#= S ; ? Initial-weak > verb c2;>m +#= S Hollow verb Final-weak Doubly-weak > > verb 6 S% +#S > 3'; t= ; +#= S verb ¸;52.) +#= S = Initially-hamzated (c) 12S“'R? F= ; wāw-initial (h)  c2; > Hollow with wāw (o)  3't: Defective with wāw (u)  ¸52< Medially-hamzated (f) x= # ; “'R? F= ; yā’-initial (l) LA2! c2; > Hollow with yā’ (r) LA2! 3't: Defective with yā’ (x) LA2! ¸52< Finally-hamzated (g) M“'R? F= ; Adjacent doubly-weak verb (*) k 6 S Separated doubly-weak verb (&) hS 6 S Figure 6.23 Verb Root attributes, with letter at position 21 6.2.22 The Morphological Feature of Types of Noun Finals Nouns are classified according to their final letters into six categories. 1. The sound noun ‰~ b ,/ ur al-‘ism ṣahῑh al-‘āir is a noun which ends with a consonant rather than a vowel or extended ’alif \QŒ 6: ’alif mamdūdah which is an ’alif followed by hamzah. Case and mood marks appear at the end of sound h nouns. Examples of sound nouns are; +t?  . ar-rağul ‘the man’, \;:= m al-mar’a ‘the ; > al-kitāb ‘the book’, and u;% woman’, J2;- ; al-qalam ‘the pen’ (Al-Ghalayyni 2005). 2. The semi-sound noun b ,(  ¯ ur al-‘ism šibh aṣ-ṣaḥῑḥ is a noun which ends with a vowel preceded by a silent consonant. Examples are '=Q; dalw ‘bucket’, Ì= ;7 ẓaby ‘oryx’, = ; hady ‘guidance’ and L#= ; sa‘y ‘striving’. Case and mood marks appear on the end of semi-sound nouns; for example the genitive case of the word '=Q; dalw ‘bucket’ is marked by tanwīn kasr and the nominative case of the word Ì= ;7 ẓaby ‘oryx’ is marked by tanwīn ḍamm as in the following sentence C= > Ì ? ;@= ;! D= ;7 J '_ =Q; yašrabu ẓabyun min dalwin ‘an oryx is drinking from a bucket’. Similarly, the accusative case of the word Ì= ;7 ẓaby ‘oryx’ is marked by tanwīn fatiḥ in the an following ^2 =;7 d ? =!;:; ra’aytu ẓaby ‘I saw an oryx’ (Al-Ghalayyni 2005). 3. The noun with shortened ending '(m ur al-‘ism al-maqṣūr is a declinable noun ending with ’alif of either ’alif or yā’ shapes. The final ’alif is the underlying (original) letter, but it is either changed or augmented. The underlying (original) letter of the changed ’alif is the vowel wāw or the vowel yā’. The underlying (original) vowel of the changed ’alif appears in the dual form of the noun. The - 169 noun final is affected by other morphological features such as number, root letters, and case and mood marks. For example, the underlying (original) vowel of the final ’alif of the noun 2(4; ‘aṣā ‘stick’ is wāw, which appears in the dual form k'; ( ; 4; ‘aṣawān ‘two sticks’, and the underlying (original) vowel of the final ’alif of the noun Æ; G;H fatā ‘boy’ is yā’, which appears in the dual form k2;G;-G;H fatayān ‘two boys’. The augmented ’alif is added to the noun to make it similar to other nouns or to match a certain pattern such as n=;: ‘arṭā ‘kind of trees’ and ;G=Hy> ḏifrā ‘bone behind the ear’. The final ’alif is written either as ’alif or yā’. If the word consists of four or more letters such as nS@ ; ;-i= ? mustašfā ‘hospital’, or if it is derived from yā’, which is its third underlying radical, as in Æ; G;H fatā ‘boy’, it is as yā’. It is written as an ’alif, if it is derived from the vowel letter wāw which is its third underlying radical. An example is ; ;< nadā ‘dew’, where the root is < n-d-w (Al-Ghalayyni 2005). 4. The noun with extended ending QRm ur al-‘ism al-mamdūd is a declinable noun ending with hamzah preceded by augmented ’alif such as 12;©; samā’ ‘sky’ and 1;,= / ; ṣaḥrā’ ‘desert’. The hamzah at the end of the noun is either underlying (original) as in 1.G?5 qurrā’ ‘readers’ or derived from yā’ or wāw as in, 12;©; samā’ ‘sky’ and 12;)>" binā’ ‘building’ where the former is derived from yā’ and the later is drived from wāw. The hamzah might be an added letter indicating feminine nouns as in 12)i= ; ḥasnā’ ‘beautiful’, or might be added to make it similar to certain patterns as in 12;"= > ḥirbā’ ‘chameleon’ (Al-Ghalayyni 2005). 5. The noun with curtailed ending &')m ur al-‘ism al-manqūṣ is a declinable noun >  ending with yā’ and preceded by a letter with the short vowel kasrah such as L­2 ; > al-qāḍῑ ‘the judge’ and L4 . ar-rā‘ῑ ‘shepherd’. The final yā’ is deleted if the noun is an indefinite noun, where the definite article ’alif-lām (c) is not attached to the _ ;5 u; ; ; beginnig of the word, and the noun is in nominative or genitive case as in ¬2 in in _ n%4 ḥakama qāḍ ‘alā ğān ‘A judge judged a criminal’. However, the final yā’ k2t appears if the definite article is attached to the noun or if it is added to another >  noun which defines it as in ¢2> ;o n%4 L­2 ; u; ; ; ḥakama al-qāḍῑ ‘alā al-ğānῑ ‘The > ;5 12t ğā’ qāḍῑ al-quḍāt ‘A chief justice judge judged the criminal’ and \2£ ? L­2 ;  ;; came’ (Al-Ghalayyni 2005). 6. The noun with deleted ending ‰~ 30 ur al-‘ism maḥḏūf al-‘āẖir is a noun where its final underlying vowel is deleted. This kind of noun may consist of two letters such as = ;! yad ‘hand’, where the final underlying vowel yā’ is deleted ! y-d-y. Other examples are; ;); sanah ‘year’, where the final underlying vowel wāw is deleted ') s-n-w, and ;$? luḡah ‘language’, where the underlying vowel wāw is deleted '$ l-ḡ-w (Al-Ghalayyni 2005). - 170 Figure 6.24 shows this classification of Noun Finals. Table 6.23 shows examples of the 6 attributes of the morphological feature of Noun Finals, represented at position 22 of the tag string. Noun Finals ^%#_ ]  B H?\ !.< Sound (s) %#j =) H?Q Noun with extended ending (e) "*5:: H?Q Semi-sound (i) = 4Bd H?Q Noun with curtailed ending (c) e n: H?Q Noun with shortened ending (t) @ n: H?Q Noun with deleted ending (d) %#j T* H?Q Figure 6.24 The classification of nouns according to their final letters, for the morphological feature of Noun Finals, with letter at position 22 Table 6.23 Examples of the attributes of the morphological feature of Noun Finals Attributes of noun final letters category Sound noun T Examples s +t?  ar-rağul ‘the man’, \;:= m al-mar’ah ‘the . ; > al-kitāb ‘the book’, woman’, J2;- and u;% ; alqalam ‘the pen’. '=Q; dalw ‘bucket’, Ì= ;7 ẓaby ‘oryx’, = ; hady ‘guide’ and L#= ; sa’y ‘striving’. %#j =) H?Q al-’ism ṣahῑh al-’āir Semi-sound noun i = 4Bd H?Q al-’ism šibh aṣ-ṣaḥῑḥ Noun with shortened ending t @ n: H?Q al-’ism al-maqṣūr Noun with extended ending e "*5:: H?Q al-’ism al-mamdūd Noun with curtailed ending c e n: H?Q al-’ism al-manqūṣ Noun with deleted ending %#j T* H?Q al-’ism maḥḏūf al-’āẖir d 2(4; ‘aṣā ‘stick’, Æ; G;H fatā ‘boy’, nS@ ; ;-i= ? mustašfā ‘hospital’, n=;: ‘arṭā ‘kind of trees’, ;G=Hy> ḏifrā ‘A bone behind the ear’ and ; ;< nadā ‘dew’. > 12;©; samā’ ‘sky’, 1;,= / ; ṣaḥrā’ ‘desert’, 12;)" binā’ ‘building’, 12)i= ; ḥasnā’ ‘beautiful’ and 12;"= > ḥirbā’ ‘chameleon’. >  >  L­2 . ar-rā‘ῑ ; al-qāḍῑ ‘the judge’ and L4 in _ n%4 ¬2 _ ;5 u; ; ; ḥakama qāḍ ‘alā ğānin ‘shepherd’, k2t > ;5 12t ğā’ ‘A judge judged a criminal’ and \2£ ? L­2 ;  ;; qāḍῑ al-quḍāt ‘A chief justice came’. = ;! yad ‘hand’, ;); sanah ‘year’, and ;$? luḡah ‘language’. - 171 - 6.3 Chapter Summary This chapter discussed the SALMA Tag Set morphological feature categories and their attribute values. The SALMA Tag Set captures long-established traditional morphological features of Arabic, in a compact yet transparent notation. For a morphologically-rich language like Arabic, the Part-of-Speech tag set should be defined in terms of morphological features characterizing word structure. A detailed description of the SALMA Tag Set explains and illustrates each feature and its possible values. In our analysis, a tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash “-” represents a feature not relevant to a given word. The SALMA Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora. The SALMA Tag Set has been applied to a sample from the Quranic Arabic Corpus (QAC) to prove its applicability to morphologically annotate Arabic text with very finegrained morphological analysis of each morpheme of the corpus words. The next chapter (chapter 7) discusses the steps in applying the SALMA Tag Set to annotate a sample of 1000 words from the Quranic Arabic Corpus. - 172 - Chapter 7 Applying the SALMA – Tag Set This chapter is based on the following sections of published papers: Section 3 depends on section 5 from (Sawalha and Atwell Under review) Sections 4 and 5 are based on sections 3 and 4 from (Sawalha and Atwell 2011c) Chapter Summary Morphosyntactic tag sets are evaluated by studying external and internal design criteria. The external design criterion involves measuring the capability of making the linguistic distinctions required by higher level NLP applications. The internal design criterion evaluates the application of the tag set in tagging of a corpus. The SALMA – Tag Set has been validated in two ways. First, it was validated by proposing it as a standard to the Arabic language computing community, and it has been adopted in several Arabic language processing systems. Second, an empirical approach to evaluating the SALMA – Tag Set of Arabic showed that it can be applied to an Arabic text corpus, by mapping from an existing tag set to the more detailed SALMA Tag Set. The morphological tags of a 1000-word test text, chapter 29 of the Quranic Arabic Corpus, were automatically mapped to SALMA tags. The SALMA – Tag Set and the SALMA – Gold Standard tagged corpus are opensource resources and standard to promote comparability and interoperability of Arabic morphological analyzers and Part-of-Speech taggers. - 173 - 7.1 Introduction The evaluation of morphosyntactic tag sets has been less studied in the literature than the evaluation of the morphosyntactic tools (Dejean 2000). Evaluating the external and internal design criteria of tag sets are two types of evaluation methodology. The external criterion for evaluation checks if the tag set is capable of making the linguistic distinctions required by higher level NLP applications such as part-of-speech taggers and parsers. The internal criterion evaluates the applicability in accurately tagging corpus (Elworthy 1995; Dejean 2000; Melamed and Resnik 2000; Sharoff et al. 2008; Zeman 2008). Modifying the tag set (e.g. decreasing the cardinality of the tag set by omitting certain attributes) and comparing the tagging accuracy of the modified tag set with the accuracy gained using the original tag set is an evaluation approach for tag sets (Dejean 2000; Dzeroski, Erjavec and Zavrel 2000; Melamed and Resnik 2000; Diab 2007). Another evaluation methodology involves mapping from an existing coarse tag set to a fine-grained tag set and enriching the corpus by linguistically informed knowledge, then measuring the increment in accuracy gained by using the mapped tag set to train part-ofspeech tagging systems (Melamed and Resnik 2000; MacKinlay 2005). (Dickinson and Jochim 2010) evaluated different tag set mappings and their distributional properties depending on the external and internal design criteria. Theoretical comparison of tag sets depending on certain specifications and requirements of application or tagging scheme of a corpus is also seen as evaluation methodology for tag sets (Gopal, Mishra and Singh 2010). However, evaluating the tag set by measuring whether the tag set is useful for certain application depends on how much information the application needs (Jurafsky and Martin 2008). Moreover, tag sets are always associated with a certain annotated corpus or annotation system. For instance, the Brown tag set is used in the part-of-speech tagging of the Brown corpus; the C5 tag set is associated with both the CLAWS part-of-speech tagger and the BNC; the Penn Arabic Treebank tag set is used by the Buckwalter morphological analyzer and to part-of-speech tag the Penn Arabic Treebank; and the QAC tag set is used in the morphosyntcatic annotation layer of the Quranic Arabic Corpus. Applying the tag set in real-life data or applications, represented by text corpora and part-of-speech taggers, is the validation methodology of the tag sets. Section 7.3 discusses two proposed evaluation methodologies for evaluating the SALMA Tag Set. First, evaluating the tag set by proposing the morphosyntactic annotation scheme to be used by wider the NLP community. Second, by tagging a test corpus, by mapping from an existing tag set to the SALMA Tag Set. - 174 - 7.2 Why was Manual Annotation not Applied? An essential prerequisite to implementing an automatic morphosyntactic analyzer is to try out the tag set manually. Two benefits are gained by trying the tag set manually. First, tag sets which are designed depending of the published grammar of the language rather than direct reference to data, need to be applied to reflect valid distinctions of their categories in the language, and to identify phenomena which are difficult to categorize or intrinsically ambiguous. Second, the manually tagged text represents training data for tagging systems that apply machine learning algorithms, and it represents a gold standard for evaluating morphosyntactic analyzers in general (Hardie 2004). Due to the limitations of time, funds to hire annotators, and the lack of availability of professional annotators especially in a non-Arabic speaking country such as the UK where the project is taking place, purely manual annotation for an Arabic corpus was not practical. However, samples of both Classical Quranic Arabic and Modern Standard Arabic (MSA) were morphologically annotated using the SALMA – Tag Set. Section 7.4 and Chapter 9 discuss the construction of the SALMA – Gold Standard. Moreover, fine-grained distinctions might affect inter-annotator agreement. Hence, measuring inter-annotator agreements and defining clear decision criteria for suitable tags, are time-consuming and require major effort. On balance, it was more practical to adapt an existing tagged text. The mapping from the Quranic Arabic Corpus morphological tags to SALMA tags allowed the construction of a gold standard and verified that the SALMA Tag Set is applicable and can be used to enrich Arabic text corpora with fine-grained morphosyntactic information. As a future work project, applying the SALMA Tag Set to a larger representative Arabic corpus will be of high priority. Chapter 11 discusses this future work project. 7.3 Methodologies for Evaluating the SALMA Tag Set Two ways to validate the SALMA Tag Set of Arabic are: first, to propose it as a standard to the Arabic language computing community and have the standard adopted by others. Second, another empirical evaluation is to see how readily it can be applied to a sample of Arabic text, for example by mapping from an existing tagged corpus to the SALMA tag set. The SALMA Tag Set has been used in the SALMA Tagger (Sawalha Atwell Leeds Morphological Analysis Tagger). It is used as the standard for specifying the word’s morphemes and for encoding the morphological features of each morpheme (Sawalha and Atwell 2009b; Sawalha and Atwell 2009a). The SALMA Tag Set has been published - 175 online (http://www.comp.leeds.ac.uk/sawalha/tagset.html) and has been adopted as a standard by other Arabic language computing researchers. For instance, part of the tag set is also used in the Arabic morphological analyzer and part-of-speech tagger Qutuf (Altabbaa, Al-Zaraee and Shukairy 2010). Qutuf uses the main part-of-speech, the subcategories of nouns, the subcategories of verbs named as verb aspects, the subcategories of particles and the morphological features of gender, number, person, case or mood, definiteness, voice, transitivity, and part of the declension and conjugation category named as perfectness. Qutuf does not use the SALMA tag format. Rather it uses a tag consisting of slots for each feature separated by a comma. Another re-use of the SALMA – Tag Set has been reported as a standard for evaluating Arabic morphological analyzers, and for building a Gold Standard for evaluating Arabic morphological analyzers and part-of-speech taggers (Hamada 2010). The second method for evaluating the SALMA Tag Set is to apply it to a sample of Arabic text, by mapping from an existing broad tag set to the more fine-grained SALMA Tag Set. Morphologically annotated sample text from the Quranic Arabic Corpus (QAC), chapter 29, consisting of about 1000 words, was selected. Then, an automated mapping algorithm was developed to map the QAC morphological tags to the SALMA tags. After that, the automatically mapped morphological features tags were manually verified and corrected, to provide a new fine-grain Gold Standard for evaluating Arabic morphological analyzers and part-of-speech taggers. The mapping from the QAC morphological tag set to the SALMA Tag Set was done by the following six-step procedure. 1. Mapping classical to modern character-set: the QAC uses the classical Othmani script of the Qur’an (77,430 words) which was mapped to Modern Standard Arabic (MSA) script (77,797 words). 2. Splitting whole-word tags into morpheme-tags: the morphological tag in the QAC is a whole-word tag, composed by combining the prefix with the stem and suffix morphological tags, while the SALMA Tag Set is designed for word morpheme tagging. 3. Mapping of feature-labels: the mnemonics of the Quranic Arabic Corpus tags were mapped to their equivalent in the SALMA Tag Set. 4. Adjustments to morpheme tokenization: due to differences between the underlying word tokenization model used in the QAC and the one required for the SALMA Tag Set, the mapped tags of the prefixes and suffixes were replaced with SALMA tags by matching them to the clitics and affixes lists used by the SALMA Tagger (Sawalha and Atwell 2009a; Sawalha and Atwell 2010b). - 176 5. Extrapolation of missing fine-grain features: for the morphological features which are not included in the QAC tag set, automatic “feature-guessing” procedures applied linguistic knowledge extracted from traditional Arabic grammar textbooks, encoded as a computational rule-based system, to automatically predict the values of the missing morphological features of the word. 6. Manually proofread and corrected the mapped SALMA tags: proofreading and correction is done by an Arabic language expert. The result is a sample Gold Standard annotated corpus for evaluating morphological analyzers and part-of-speech taggers for Arabic text. Section 7.4 explains the mapping procedures followed to map the QAC morphological tags to the SALMA tags. 7.4 Mapping the Quranic Arabic Corpus (QAC) Morphological Tags to SALMA Tags The reuse of existing components is an established principle in software engineering. The Quranic Arabic Corpus (QAC) is a newly available resource enriched with multiple layers of annotation including morphological segmentation and part-ofspeech tagging (Dukes and Habash 2010). A morphologically annotated test text sample from the QAC, chapter 29, consisting of about 1000 words, was selected. Then, an automated mapping methodology mapped the QAC morphological tags to SALMA morphological features tags. The mapping from the QAC morphological tags to the SALMA morphological features tags is done by following a six-step procedure. The following sub-sections describe in detail the mapping steps, highlight the challenges of mapping and show examples of mapping the QAC morphological tags to the SALMA morphological features tags. 7.4.1 Mapping Classical to Modern Character-Set The QAC uses the Othmani script of the Qur’an. Most Arabic NLP applications deal with MSA script. These programs need some modifications to deal with the Othmani script. However, the Qur’an script is also available in MSA script. One-to-one mapping, between the Qur’anic words written in Othmani script and the Qur’an written in MAS script, can be applied to the QAC except for a few special cases. Such cases exist due to the spelling variations between the Othmani script and the MSA script. For instance the vocative particle 2! yā is written connected to the next word in Othmani script, and it is written as standalone token in MSA script e.g. the word nÍ ' ; ?º;Í yāmūsā ‘O Musa “Moses”!’in Othmani script is one token but it is written as two tokens in MSA script as 2;! - 177 n' ; ? yā mūsā ‘O Musa “Moses”!’. Therefore, The QAC has 77,430 words while the Quran in written MSA has 77,797 tokens. Figure 7.1 gives some examples of the spelling variations between the Othmani script and MSA script. Othmani Standard Arabic Meaning yāmūsā yā mūsā O Musa (Moses)! nÍ ' n' ; ?º;Í ; ? 2;! yā’ahla yā ’ahla O people of +; = ;E;Í! +; = ;: 2;! yālaytanī yā laytanī I wish if I had s>;-=;%;Í! €>;-=; 2;! wa’allaw wa’n law And if not '> .;:; '> ; k= ;:; yā‘isā yā ‘isā O Issa (Jesus)! ni; >#;Í! ni; 4> 2;! yāqawm yā qawm O people M>'= ;;Í! M>'= G;5 2;! Figure 7.1 Examples of spelling / tokenization variations between the Othmani script and MSA script The one-to-one mapping was done automatically. The difference of 375 tokens between the two writing schemes was manually corrected, by grouping two tokens of MSA that match one token of the Othmani script. This grouping is done to preserve the morphological tag of the words. From the previous example the word nÍ ' ; ?º;Í yāmūsā ‘O Musa “Moses”!’ has the QAC morphological tag ya+ POS:PN LEM:muwsaY` M NOM, which is mapped to the two tokens 2;! and n' ; ? yā mūsā ‘O Musa “Moses”!’ and these two tokens are given the same morphological tag as illustrated in figure 7.2. Othmani nÍ ' ; ?º;Í QAC morphological tag ya+ POS:PN LEM:muwsaY` M NOM MSA 2;! QAC morphological tag ya+ n' POS:PN LEM:muwsaY` M NOM ; ? Figure 7.2 mapping example, preserving the part-of-speech tag 7.4.2 Splitting Whole-Word Tags into Morpheme-Tags Tokenizing the word into its morphemes is not an easy task for Arabic words. The tokenization of QAC words into morphemes was done automatically using BAMA. However, there is no resource provided by the QAC that tokenizes the words into their morphemes and assigns the morphological tags for each morpheme. The given morphological tags are whole word tags, combining the prefix with the stem and the suffix morphological components separated by a + sign. So, for our mapping process, the words and their morphological tags were automatically tokenized into morphemes and morphemes tags. Figure 7.3 shows an example of tokenizing a word and its morphological tag into morphemes and morpheme tags. - 178 Word no. Othmani word (16:72:16) +> e> ;Í=Î>;H;: Morpheme [1] ;: 3 ; > J Morpheme [2] Morpheme [3] MSA word > =2>;H;: +> 2 ; ;: 3 ; > J QAC morphological tag A:INTG+ f:REM+ bi+ Al+ POS:N ACT PCPL LEM:ba`Til ROOT:bTl M GEN A:INTG f:REM Bi Al cÏ= c= >" POS:N ACT PCPL LEM:ba`Til ROOT:bTl M GEN Morpheme [5] +> e> ;Í" +> 2 ; Figure 7.3 Example of tokenizing Quranic Arabic Corpus words and their morphological tags into morphemes and their morpheme tags Morpheme [4] The QAC has 18,994 word types (Othmani script) and 18,123 different morphological tags. This large number of different morphological tags can be reduced to 1,067 different morpheme tags after dividing the morphological tag of the whole word into morpheme tags and removing the ROOT: and LEM: parts of the QAC morphological tags. 7.4.3 Mapping of Feature-Labels The third mapping step starts by mapping the mnemonics of the QAC to their equivalent in the SALMA – Tag Set, followed by application of the morphological feature templates that determine the applicable and non-applicable morphological features of the analyzed morphemes. A mapping dictionary was constructed to map the mnemonics of the QAC that captures the morphological features of the analyzed morphemes, to their SALMA Tag Set equivalent attribute values and the attributes’ positions in the SALMA tag string. Figure 7.4 shows part of the dictionary data structure used to map between QAC and SALMA tags. The dictionary consisting of 158 entries was used via a specialized program that matches the QAC morphemes tags after tokenization, and returns the attributes’ values and the positions for the mapped features. Then, the attributes are placed in their specified positions in the SALMA tag string. {"1FP" "1FS" "1MP" "1P" "1S" "2D" "2FD" "2MS" "POS:ACC" "POS:ADJ" "POS:N" "POS:P" "POS:V" :[(7,'f'),(8,'p'),(9,'f')], :[(7,'f'),(8,'s'),(9,'f')], :[(7,'m'),(8,'p'),(9,'f')], :[(8,'p'),(9,'f')], :[(8,'s'),(9,'f')], :[(8,'d'),(9,'s')], :[(7,'f'),(8,'d'),(9,'s')], :[(7,'m'),(8,'s'),(9,'s')], :[(1,'p'),(4,'o')], :[(1,'n'),(2,'j')], :[(1,'n')], :[(1,'p'),(4,'p')], :[(1,'v')], # # # # # # # # # # # # # 1st person / Feminine /Plural 1st person / Feminine /Singular 1st person / Masculine / Plural 1st person / Plural 1st person / Singular 2nd person / Dual 2nd person / Feminine / Dual 2nd person / Masculine / Singular Accusative particle Adjective Noun Preposition Verb Figure 7.4 Part of the dictionary data structure used to map the Quranic Arabic Corpus tag set to the morphological features tag set - 179 The SALMA tag string consists of 22 features. Not all these features are applicable for a given part-of-speech. For instance, number and gender at positions 7 and 8 respectively, are noun features, while person and voice at positions 9 and 14 respectively are verb features. The SALMA Tag Set uses ‘-’ to show that the feature in that position is not applicable, and it uses ‘?’ to show that the feature is applicable but its attribute value is not known yet. A matrix of the main and sub parts of speech and their applicable features (or possible attributes) has been constructed and used by the mapping program and the SALMA – Tagger (Sawalha and Atwell 2009b; Sawalha and Atwell 2009a; Sawalha and Atwell 2010b). Chapter 8 discusses in detail the SALMA – Tagger algorithm. The matrix is used as SALMA tag string templates. For each main or sub part-of-speech there is a template that shows the applicable and non-applicable morphological features. The main part of speech and some of the sub part of speech categories are already marked in the initially mapped tag. A string, formed by grouping the attributes of the first 6 positions of the initial SALMA tag string representing the main and the sub part of speech categories, is used as a key to search the templates dictionary that stores the SALMA tag templates. These templates are used to add ‘-’, ‘?’ or any other specified attributes to the initially mapped tag string. Figure 7.5 shows a sample of SALMA tag templates. {‘n?----‘ ‘v-?---‘ ‘p--?--‘ ‘r---?-‘ ‘u----?’ ‘ng----‘ ‘np----‘ ‘v-p---‘ ‘v-c---‘ ‘v-i---‘ ‘p--p--‘ ‘p--a--‘ ‘p--c--‘ ‘r---r-‘ ‘r---t-‘ ‘r---d-‘ ‘u----s’ ‘u----c’ ‘u----n’ : : : : : : : : : : : : : : : : : : : ‘n?----??-????---????-?’ ‘v-?-----????-????????-‘ ‘p--?-----????---?-----‘ ‘r---?-??????????------‘ ‘u----?----------------‘ ‘ng----??-v???---?d??-?’ ‘np----???s-??---?ns---‘ ‘v-p-----?s-?-?m??????-‘ ‘v-c-----?d??-????????-‘ ‘v-i-----?s-?-a???????-‘ ‘p--p-----s-?-----n----‘ ‘p--a-----s-?-----n----‘ ‘p--c-----s-?-----n----‘ ‘r---r-???s-?----------‘ ‘r---t-fs-s-?----------‘ ‘r---d-------d---------‘ ‘u----s----------------‘ ‘u----c----------------‘ ‘u----n----------------‘ # # # # # # # # # # # # # # # # # # # Noun Verb Particle Residual Punctuation Gerund Pronoun Past verb Present verb Imperative verb Preposition Annular Conjunction Connected pronoun tā' Marbouta Definite article Full stop Comma Colon Figure 7.5 A sample of the morphological features tag templates 7.4.4 Adjustments to Morpheme Tokenization Due to the differences between the underlying word’s morpheme tokenization models used in the QAC and the one required for the SALMA – Tag Set, adjustment to morpheme tokenization is required. The fine-grained SALMA – Tagger divides the word into five parts: proclitic(s), prefix(es), stem, suffix(es) and enclitic(s). Clitics and affixes can be multiple clitics or affixes. The underlying word’s morpheme tokenization model - 180 used by the QAC is inherited from BAMA. So, the SALMA-Tagger is used to tokenize the words into morphemes and to assign the morpheme tag by matching the clitics and affixes morphemes of the analyzed words with the clitics and affixes from the clitics and affixes dictionaries of the SALMA-Tagger. The clitics and affixes dictionaries contain detailed information about proclitic and prefix combinations and suffix and enclitic combinations. This information includes suitable SALMA tags and three information labels that help in matching the correct combination of proclitics and prefixes from one side with the suffixes and enclitics from the other side. The first label [proc, perf, suf, enc] indicates whether the clitic or affix is a proclitic, prefix, suffix or enclitic respectively. The second label [n, v, x] represents the main part-of-speech of the stem morpheme which indicates whether the clitic or affix belongs to noun, verb or both. The final information is [y, n]. This indicates whether the clitic or affix is part of the pattern or not. This information is useful for pattern generator and lemmatizer programs. The construction and the properties of clitics and affixes dictionaries are discussed in more detail in chapter 8. The SALMA – Tagger selects the clitic and affix combinations that match this information and match the main part of speech of the stem. Figure 7.6 shows examples from the clitics and affixes lists. Figure 7.7 shows a sample of the mapped morphological features tags after applying step 4. Proclitics and prefixes list O C. R; ;%#= G;;; walaya‘lamanna “And he will surely make evident” ; ;; 6e4 3 Conjunction ; 1 wa p--c-----------------proc x n  '8 3 Emphatic particle c; 2 la proc v n p--z-----s-f---------42£ 3 Imperfect prefix 3 ya pref v y r---a----------------; Suffixes and enclitics list 2;„> 2;„2> ; G=>=e;8; wataṭbῑqātihā “And its applications” 1 > ` āti r---l-fp-------------- suf n y 2; 2 hā r---r-fsts-s---------enc x n Figure 7.6 Examples of the clitics and affixes lists §<—m }¨ 3 w2i +(- ‹R­ Feminine sound plural letters Suffixed pronoun - 181 Morpheme w ;: > ; i; c v2 ? ;< k= ;: ? ?;G=8  k= ;: ; c' ? ?5  C; ]; 2;< ; u= ? r; QAC morpheme tag POS:INL A:INTG+ POS:V PERF 3MS Al+ POS:N MP NOM POS:SUB NULL POS:V IMPF PASS 3MP MOOD:SUBJ PRON:3MP POS:SUB NULL POS:V IMPF 3MP MOOD:SUBJ PRON:3MP POS:V PERF (IV) 1MP PRON:1MP wa+ POS:PRON 3MP POS:NEG NULL POS:V IMPF PASS 3MP PRON:3MP SALMA tags after the 4th step p--?-----????---?----p--i-----s-----------v-p---mst--?-?-??????r---d----------------n?----mp-?n??---????-? p--g-------?---------r---a----------------v-c---mptda?-p???????r---r-mptsnw---------p--g-------?---------r---a----------------v-c---mptda?-????????r---r-mptsnw---------v-p---mpf--?-?-??????r---r-xpfs??---------p--c-----------------np----mpt--??---?----p--n-------?---------r---a----------------v-c---mpt-??-p???????r---r-mp?snn---------- ? • ?; G=H k ; Figure 7.7 A sample of the mapped SALMA tags after applying mapping steps 1 to 4 After applying the four-step mapping procedure to a sample of 1000 words, chapter 29 of the Qur’an, the success rate in mapping each morphological features category was computed by comparing with the final version after proof reading. Table 7.1 shows how successful the mapping was for each individual target feature. Full mapping was done for the main part-of-speech and sub part of speech categories, with a success rate of nearly 100% except for noun sub-categories of which only about 50% were mapped successfully. The morphological categories of gender, number, person, inflectional morphology and case or mood were mapped with a success rate of 68% to 89%. Case and mood marks, definiteness, voice, emphasized and non-emphasized, and declension and conjugation were poorly mapped with a success-rate of 5% to 17%. Transitivity, rational, unaugmented and augmented, number of root letters, verb root and noun finals were not mapped at all, because these morphological features do not exist in the QAC tag set. - 182 Table 7.1 The mapping success rate after applying the first four mapping steps Category ? - Applicable Not mapped mapped 1 Main Part-of-Speech 16 0 1935 0.83% 99.17% 2 Part-of-Speech: Noun 247 1435 500 49.40% 50.60% 3 Part-of-Speech: Verb 0 1675 260 0.00% 100.00% 4 Part-of-Speech: Particle 31 1424 511 6.07% 93.93% 5 Part-of-Speech: Other 0 1287 648 0.00% 100.00% 6 Punctuation marks 0 1935 0 0.00% 100.00% 7 Gender 125 785 1150 10.87% 89.13% 8 Number 244 847 1088 22.43% 77.57% 9 Person 103 1267 668 15.42% 84.58% 10 Inflectional morphology 85 1141 794 10.71% 89.29% 11 Case and Mood 280 1043 892 31.39% 68.61% 12 Case and Mood marks 1120 581 1354 82.72% 17.28% 13 Definiteness 402 1467 468 85.90% 14.10% 14 Voice 220 1698 237 92.83% 7.17% 15 Emphasized and non-emphasized 114 1805 130 87.69% 12.31% 16 Transitivity 260 1675 260 100.00% 0.00% 17 Rational 712 1223 712 100.00% 0.00% 18 Declension and Conjugation 482 1428 507 95.07% 4.93% 19 Unaugmented and Augmented 603 1332 603 100.00% 0.00% 20 Number of root letters 654 1281 654 100.00% 0.00% 21 Verb root 260 1675 260 100.00% 0.00% 22 Nouns finals 394 1541 394 100.00% 0.00% 7.4.5 Extrapolation of Missing Fine-Grain Features As previously discussed, The SALMA – Tag Set is a fine-grained tag set that captures 22 morphological features in the tag string. As shown in table 7.1 above, some of these morphological features are poorly mapped such as case and mood marks; definiteness; voice; emphasized and non-emphasized; and declension and conjugation; while others are not mapped because they are not represented by the QAC morphological tag set. The non-mapped features are: transitivity; rational; unaugmented and augmented; number of root letters; verb root; and types of nouns according to their final letters. The morphological features which are not included in the QAC tag set are automatically guessed using the SALMA – Tagger. The SALMA – Tagger has specialized procedures that apply the linguistic knowledge extracted from traditional Arabic grammar books as a computational rule-based system to automatically guess the value of the remaining morphological features of the word’s morphemes. Chapter 8 discusses in detail these procedures. - 183 A rule-based approach was used for morphological analysis of the 22 morphological features. Rules were extracted from traditional Arabic grammar books. Then, these rules were programmed and integrated to the SALMA – Tagger to predict the morphological feature values of each morpheme of the analyzed word. The rules depend on the structure of the analyzed words and their morphemes to predict the value of a given category. For instance, if the analyzed word has a prefix ; yā and suffixed pronoun k ; ūna then the appropriate tag of the person category is ‘t’ representing third person and the subject’s number and gender guessed values are ‘p’ and ‘m’ representing plural and masculine respectively. The rules also depend on linguistic lists for the features that are hard to predict depending on the structure of the analyzed words. The SALMA – Tagger has linguistic lists such as a broken plural list to predict the number feature of nouns; list of doubly transitive verbs and list of triply transitive verbs to predict the values of the transitivity feature; lists of restricted to perfect, restricted to imperfect, restricted to imperative, and partially conjugated verbs which are used to guess the values of the declension and conjugation morphological feature. Table 7.1 showed that the mapping percentage after applying the first four mapping steps for these morphological features is less than 20% and most of them have 0% mapping. These procedures are also used to verify the already mapped morphological features such as number, gender, person and case or mood. After applying these rulebased procedures the mapping success rate increased and reached 83% to 100% for most of the morphological features. Table 7.2 shows the mapping success-rate after applying the fifth mapping step of applying the rule-based system to morphological analysis. - 184 Table 7.2 The mapping success rate after applying the fifth mapping step Category ? - Applicable Not Mapped Mapped % 1 Main Part-of-Speech 0 0 1935 0.00% 100.00% 2 Part-of-Speech: Noun 247 478 1457 16.95% 83.05% 3 Part-of-Speech: Verb 0 716 1219 0.00% 100.00% 4 Part-of-Speech: Particle 26 758 1177 2.21% 97.79% 5 Part-of-Speech: Other 0 976 959 0.00% 100.00% 6 Punctuation marks 0 976 959 0.00% 100.00% 7 Gender 123 219 1716 7.17% 92.83% 8 Number 305 218 1717 17.76% 82.24% 9 Person 0 673 1262 0.00% 100.00% 10 Inflectional morphology 0 0 1935 0.00% 100.00% 11 Case and Mood 250 241 1694 14.76% 85.24% 12 Case and Mood marks 262 0 1935 13.54% 86.46% 13 Definiteness 0 478 1457 0.00% 100.00% 14 Voice 0 716 1219 0.00% 100.00% 15 Emphasized and non-emphasized 0 716 1219 0.00% 100.00% 16 Transitivity 0 716 1219 0.00% 100.00% 17 Rational 0 218 1717 0.00% 100.00% 18 Declension and Conjugation 0 218 1717 0.00% 100.00% 19 Unaugmented and Augmented 0 346 1589 0.00% 100.00% 20 Number of root letters 0 336 1599 0.00% 100.00% 21 Verb root 0 721 1214 0.00% 100.00% 22 Nouns finals 121 478 1457 8.30% 91.70% 7.4.6 Manual proofreading and correction of the mapped SALMA tags I manually proofread and corrected the mapped morphological features tags. The result of correcting the automatically mapped morphological features tags is a sample gold standard for evaluating morphological analyzers and part-of-speech taggers for Arabic text. Constructing the gold standard for evaluating morphological analyzers is one of the objectives of evaluating the SALMA – Tag Set. The gold standard is stored in different formats and published online54 to allow the wider Arabic NLP community to use it in evaluating morphosyntactic systems for Arabic. Chapter 9 discusses in detail the construction and the specifications of the SALMA – Gold Standard. Figure 7.8 shows an example of mapping from the QAC into SALMA tags, the results after applying steps 1 to 4, the results after applying step 5 and the results after manually correcting the tags. 54 The SALMA Gold Standard http://www.comp.leeds.ac.uk/sawalha/goldstandard.html - 185 SALMA tags after mapping steps 1-4 SALMA tags mapping step 5 H QAC morpheme tag POS:INL p--?-----????---?----- p--?-----s-s---------- p--b-----s-s---------- +< A:INTG+ p--i-----s------------ p--i-----s------------ p--i-----s------------ POS:V PERF 3MS v-p---mst--?-?-??????- v-p---msts-f-ambhvsta- v-p---msts-f-amohvsta- Al+ r---d----------------- r---d----------------- r---d----------------- POS:N MP NOM n?----mp-?n??---????-? n?----mp-vndd---ndst-s n#----mj-vndd---hdst-s POS:SUB p--g-------?---------- p--g-----s-s---------- p--g-----s-s---------- r---a----------------- r---a----------------- r---a----------------- v-c---mptda?p???????- v-c---mptdao-pmbhvtta- v-c---mptdao-pmohvtta- * NULL POS:V IMPF PASS 3MP MOOD:SUBJ PRON:3MP r---r-mptsnw---------- r---r-mptsnw---------- r---r-mpts-s---------- 9, +< POS:SUB p--g-------?---------- p--g-----s-s---------- p--g-----s-s---------- F + NULL POS:V IMPF 3MP MOOD:SUBJ PRON:3MP POS:V PERF (IV) 1MP PRON:1MP r---a----------------- r---a----------------- r---a----------------- v-c---mptda?????????- v-c---mptdao-amohvtto- v-c---mptdao-amohvtto- r---r-mptsnw---------- r---r-mptsnw---------- r---r-mpts-s---------- v-p---mpf--?-?-??????- v-p---mpfs-s-amohvttc- v-p---mpfs-s-amohvttc- r---r-xpfs??---------- r---r-xpfs??---------- r---r-xpfs-s---------- wa+ p--c------------------ p--c------------------ p--c-----s-f---------- POS:PRON 3MP np----mpt--??---?----- np----mpts-si---hn---? np----mpts-si---hn---- Q+ POS:NEG p--n-------?---------- p--n-----s-s---------- p--n-----s-s---------- F 8 NULL POS:V IMPF PASS 3MP PRON:3MP r---a----------------- r---a----------------- r---a----------------- v-c---mpt-??-p???????- v-c---mptdnn-pmohvtta- v-c---mptdnn-pmohvtta- r---r-mp?snn---------- r---r-mp?snn---------- r---r-mpts-f---------- 3 + .7+ C p 8 +/ 9, +< F 8 g8 %+ (, C8 8 * $+ _ + +/ *+ H, r8 $8 +(, 9* + after Corrected tags SALMA Figure 7.8 A Sample of the QAC tags and their mapped SALMA tags after applying the mapping procedure’s steps 1-4, step 5 and manually correcting the tags. 7.5 Evaluation of the Mapping Process The correction process of the automatically mapped tags involves correcting the individual morphological feature categories tags of each morpheme. This process specifies whether a morphological feature category is applicable or not. If it is applicable, the automatically mapped attribute is checked and corrected. Otherwise, if it is not applicable and the mapped tag is not “-”, the correction will replace any attribute by “-”. During the correction process, the following types of correction were observed. • Changing the automatic tag from “-”, to the correct tag of a certain morphological feature attribute. • Changing the automatic tag from “?”, to the correct tag of a certain morphological feature attribute. - 186 - • Changing an automatic tag which is not “-” or “?”, to the correct tag of a certain morphological feature attribute. • Changing the automatic tag from “?”, to “-” where a given morphological feature is not applicable to a given morpheme. • Changing an automatic tag which is not “-” or “?”, to “-” where a given morphological feature is not applicable to a given morpheme. Depending on the above observed correction types and the standard definitions of accuracy metrics55, the rules for measuring the accuracy of the mapping process were inferred. The following classifications of the different cases of the corrected SALMA tags are used as bases to measure the accuracy of the mapping process. • TN: True and not applicable; case was not applicable and predicted not applicable. • TP: True and applicable; case was applicable and predicted correctly. • FN: False and not applicable; case was not applicable and predicted applicable. • FP: False and applicable; case was applicable and predicted not applicable. The accuracy metrics of the automatically mapped tags are based on the above observations to calculate the recall, precision and accuracy. Accuracy is the percent of predictions where were correct. Formula [2] illustrates the computation of accuracy. Accuracy =  %&      …….. (2) Recall is defined as the percentage of applicable cases that are correctly mapped from the mapped cases. Formula [3] illustrates the computation of recall. & '()*++ = &%, …………………………… (3) Precision is defined as the percentage of the applicable cases which are correctly predicted from the total number of the applicable cases. Formula [4] illustrates the computation of precision. Precision =  &      ………… (4) Table 7.3 shows accuracy, recall and precision after applying the first four mapping steps and after applying the fifth mapping step. Figures 7.9, 7.10 and 7.11 show the increase in accuracy, recall and precision after using the procedures of linguistic rules, for mapping the QAC morphological tags to the SALMA tags. 55 Standard definition of Recall and Precision http://en.wikipedia.org/wiki/Recall_and_precision - 187 Table 7.3 Accuracy, recall and precision of the mapping procedure after steps 4 and 5 Mapping steps 1-4 Category Mapping steps 1-5 Accuracy Recall Precision Accuracy Recall Precision Main part-of-speech 72.30% 100.00% 72.30% 97.99% 99.43% 97.99% Part-of-speech: Noun 58.96% 99.16% 46.81% 86.15% 99.16% 46.81% Part-of-speech: Verb 87.18% 99.62% 99.62% 99.95% 99.62% 99.62% Part-of-speech: Particle 83.73% 100.00% 88.37% 96.24% 98.03% 86.63% Part-of-speech: Other 72.45% 30.84% 19.31% 94.90% 95.50% 86.43% Punctuation marks 100.00% - - 100.00% - - Gender 71.11% 100.00% 79.11% 89.03% 97.66% 88.72% Number 63.13% 100.00% 64.82% 79.09% 97.09% 70.91% Person 79.40% 100.00% 96.23% 94.28% 96.11% 89.02% Inflection 15.65% 100.00% 22.04% 88.47% 95.30% 86.73% Case and Mood 18.54% 100.00% 75.31% 79.71% 99.56% 94.98% Case and Mood marks 0.41% 100.00% 0.58% 74.25% 94.20% 66.11% Definiteness 16.68% 100.00% 12.96% 96.40% 100% 88.46% Voice 67.97% 100.00% 5.38% 98.61% 100% 89.62% Emphasis 68.07% 100.00% 6.15% 99.95% 100% 99.62% Transitivity 67.25% 0.00% 0.00% 99.69% 100% 98.45% Rationality 6.59% 0.00% 0.00% 94.34% 100% 86.68% Declension and conjugation 34.65% 95.65% 2.89% 90.11% 99.83% 75.03% Unaugmented and augmented 33.37% 0.00% 0.00% 95.21% 98.56% 86.19% Number of root letters 33.42% 0.00% 0.00% 99.74% 100% 100% Verb root 73.84% 0.00% 0.00% 100.00% 100% 100% Noun finals 46.96% 0.00% 0.00% 93.31% 100% 97.64% Figure 7.9 Accuracy of mapping after steps 4 and step 5 of mapping QAC to SALMA tags - 188 - Figure 7.10 Recall of mapping after steps 4 and step 5 of mapping QAC to SALMA tags Figure 7.11 Precision of mapping after steps 4 and step 5 of mapping QAC to SALMA tags. 7.6 Discussion of Evaluation of the SALMA Tag Set Arabic has a complex morphology and fine-grain tag assignment is significantly challenging. Arabic words should be decomposed into five parts: proclitics, prefixes, stem or root, suffixes and enclitics. The morphological analyzer should add appropriate linguistic information to each of these parts of the word. Instead of a tag for the whole word, sub-tags are required for each part. More detailed morphological feature information that describes each part of the word is generally more useful and appreciated. - 189 The software engineering principle of reuse was applied to build a morphologically tagged corpus enriched with detailed analysis of each word’s morphemes, by recycling an existing morphologically tagged corpus, the Quranic Arabic Corpus (QAC). This chapter demonstrated that this resource can be reused and enriched with detailed analysis by mapping the existing morphological analysis of a sample chapter of the QAC to the detailed morphological analysis using the SALMA – Tag Set and the SALMA – Tagger. This empirical study was achieved by following a 6-step procedure which involves direct mapping of the existing features and building a rule-based system which depends on the linguistic knowledge extracted from traditional Arabic grammar books. A measure of accuracy is “exact match”. The exact match of the prediction of all 22 features for a morpheme whole tags for the test sample is 53.5%, but some of the errors were very minor such as replacing one ‘?’ by ‘-’. The error-rate of individual features scored 2.01% for main part of speech, between 3% and 15% for morphological features coded in the QAC tags, and between 2% and 24% for features which do not exist in the QAC tags but can be automatically guessed. Due to the use of 22 morphological features categories for each morpheme, which increase the potential for making annotation mistakes, this result demonstrates that the reuse and enriching of existing resource with more detailed morphological features information is applicable and can provide tagged Arabic corpora with fine grain analysis. 7.7 Conclusions and Summary A range of Arabic Part-of-Speech taggers exist, each with a different tag set. The existing tag sets for Arabic were illustrated and compared, and this suggests the need for a common standard to simplify and promote comparisons and sharing of resources. Generic design criteria for corpus tag sets were reviewed in chapter 5. Some of these principles have been applied in existing tag sets; but there is still room for improvement, in the design of a theory-neutral standard tag set for Arabic Part-of-Speech taggers and tagged corpora. The SALMA – Tag Set captures long-established traditional morphological features of Arabic, in a compact yet transparent notation. A tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash ‘-’ represents a feature not relevant to a given word. The SALMA – Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora. The SALMA – Tag Set design decisions were made through chapter 6. The SALMA – Tag Set has been validated in two ways. First, it was validated by proposing it as a standard to the Arabic language computing community, and has been - 190 adopted in Arabic language processing systems. The SALMA – Tag Set has been used in the SALMA – Tagger to encode the morphological features of each morpheme (Sawalha and Atwell 2009a; Sawalha and Atwell 2010b). Parts of The SALMA – Tag Set were also used in the Arabic morphological analyzer and part-of-speech tagger Qutuf (Altabbaa et al. 2010). Moreover, the SALMA – Tag Set has been reported as a standard for evaluating morphological analyzers for Arabic text and for building a gold standard for evaluating morphological analyzers and part of speech taggers for Arabic text (Hamada 2010). Second, an empirical approach to evaluating the SALMA – Tag Set of Arabic showed that it can be applied to an Arabic text corpus, by mapping from an existing tag set to the more detailed SALMA – Tag Set. The morphological tags of a 1000-word test text, chapter 29 of the Quranic Arabic Corpus, were automatically mapped to SALMA tags. Then, the mapped tags were proofread and corrected. The result of mapping and correction of the SALMA tagging of this corpus is a new Gold Standard for evaluating Arabic morphological analyzers and part-of-speech taggers with a detailed fine-grain description of the morphological features of each morpheme, encoded using SALMA tags. We invite other Arabic language computing researchers to take up the SALMA – Tag Set and the SALMA – Gold Standard tagged corpus, to promote comparability and interoperability of Arabic morphological analyzers and Part-of-Speech taggers. - 191 - Part IV: Tools and Applications for Arabic Morphological Analysis - 192 - Chapter 8 The SALMA Tagger for Arabic Text This chapter is based on the following sections of published papers: Section 3 is expanded from section 2 in (Sawalha and Atwell 2009b) and section 3.2 in (Sawalha and Atwell 2009a) Section 5 is based on section 3 in (Sawalha and Atwell 2010b) Chapter summary Morphological analyzers and part-of-speech taggers are key technologies for most text analysis applications. The main aim of this thesis is to develop a morphosyntactic tagger for annotating a wide range of Arabic text formats, domains and genres including both vowelized and non-vowelized text. Enriching the text with linguistic analysis will maximize the potential for corpus re-use in a wide range of applications. We foresee the advantage of enriching the text with part-of-speech tags of very fine-grained grammatical distinctions, which reflect expert interest in syntax and morphology, but not specific needs of end-users, because end-user applications are not known in advance. This chapter describes the fine-grained Arabic morphological analyzer algorithm, the SALMA – Tagger. The SALMA – Tagger is adherent to an agreed standard of the ALECSO/KACST initiative for designing and evaluating morphological analyzers for Arabic text. The SALMA Tagger is enriched with dictionaries: SALMA – ABCLexicon, pre-stored lists of clitics and affixes, roots, patterns dictionary, function words list, and other linguistic lists such as broken plural list and proper noun list. The SALMA – Tagger combines sophisticated modules that break down complex morphological analysis problem into achievable tasks which each address a particular problem and also constitute stand-alone units. These modules are: the SALMA – Tokenizer, the SALMA – Lemmatizer and Stemmer, the SALMA – Pattern Generator, the SALMA – Vowelizer and the SALMA – Tagger module. These modules are useful as stand-alone tools which users can select and/or customise to their own applications. - 193 - 8.1 Introduction A morphological analyzer is a program which analyzes words. It extracts the root from the derived word and/or generates all possible words from a certain root. It analyzes the word into morphemes by dividing the word into proclitics, prefixes, stem or root, suffixes and enclitics. Moreover, it identifies the word’s part of speech and generates the correct derivation pattern of the analyzed word. Morphological analysis is defined as the process of analysing a word in its orthographic form, and generates all possible analyses of the analysed word. The morphological analyser, a program that does the morphological analysis of the word, must generate all possible analyses and identify the morphological features for each morpheme of the analysed word. The morphological features should be encoded using a specified scheme- morphological features tags, which can be used by higher level text analytics applications such as part-of-speech tagging and parsing. Moreover, morphological analysis involves extracting the root and matching the pattern of the word. Morphological analysers can be used to add the correct vowelization (diacritics) for each letter of the analysed word. Section 2.3 in chapter 2 has more background on morphological analysis for Arabic text. 8.2 Specifications and Standards of Arabic Morphological Analyses A robust and well-designed morphological analyser for Arabic text has to meet agreed design standards for Arabic morphological analyses. Many researchers have investigated the morphology of Arabic, and they built their morphological analysers according to specific application requirements. For instance, stemming involves morphological analyses for Arabic words where the outputs of the stemmers are the roots of the analysed words (Al-Sughaiyer and Al-Kharashi 2004). However, the complex morphology of Arabic requires more detailed analyses. Therefore, the morphological analyser for Arabic text should meet the following requirements (Al-Bawaab 2009; Hamada 2009b; Hamada 2010). 1. It can correctly divide the analysed word into morphemes such as proclitics, prefixes, stem or root, suffixes and enclitics. 2. It can generate the correct pattern of the word and specify whether the generated pattern is a noun pattern, verb pattern or both. 3. It can correctly specify the morphological features for each morpheme. 4. It can extract the correct root of the word whether it is triliteral or quadriliteral. - 194 5. It can deal with unambiguous words (inert or stop words), irregular words, rare words and borrowed words. 6. If an orthographic form is ambiguous, it should generate a set of plausible/possible analyses to be disambiguated at a subsequent processing stage taking context into account. 7. It allows the rules of transitive and intransitive verbs to be specified. 8. It allows the derivation rules of perfect verbs, imperfect verbs and imperative verbs to be specified. 9. It can deal with the orthographic features of words such as vowelizing, incorporation, substitution and the writing of hamzah. This helps in correcting spelling mistakes. The most widely-agreed and recent specification and standard is the ALECSO/KACST initiative on morphological analysers for Arabic text; see section 2.3.4.7. The organization and the institution invited specialized researchers on morphological analysers for Arabic text to present their morphological analysers, to agree on the design and development specifications and standards, and to agree on an evaluation methodology for the different morphological analysers. This section will discuss the ALECSO/KACST initiative. The ALECSO/KACST design specifications and standards will be followed in the design of the SALMA – Tagger. 8.2.1 ALECSO/KACST Initiative on Morphological Analyzers for Arabic Text This section discusses our experience in developing and evaluating morphological analysers for Arabic text. The section analyses an exemplar of how the community should work together to advance the field. The exemplar is The Arab League Educational, Cultural and Scientific Organization (ALECSO) and the King Abdul-Aziz City of Science and Technology (KACST) initiative on morphological analysers of Arabic text56 which aims to encourage research on developing open-source morphological analysers for Arabic text, which are of high accuracy, easy to use and can be integrated into higher levels of applications for processing Arabic text. The ALECSO/KACST initiative contains recommendations and standards for designing morphological analysers. These recommendations are written as papers appearing in the workshop proceedings (Al-Bawaab 2009; Hamada 2009b; Zaied 2009). It also includes agreed specifications for developing morphological analysers represented by the participants’ papers and presentations. Moreover, the initiative includes an evaluation methodology and criteria for evaluating the outputs of the morphological 56 ALECSO/KACT initiative on morphological analyzers for Arabic text http://www.alecso.org.tn/index.php?option=com_content&task=view&id=1234&Itemid=1002&lang=ar - 195 analysers. ALECSO/KACST organized a competition between the participants’ analyzers. AlKhalil morphological analyzer (Boudlal et al. 2010) was announced as the winner of the competition. However, these design specifications and standards, evaluation methodology and the results of the competition have not been widely publicized. Hamada (2010) reported the evaluation methodology in Arabic only. Another aim of this section is to publicize these important specifications, standards, methodology and the competition to the English-speaking Arabic NLP community. 8.2.2 ALECSO/KACST Prerequisites for a Good Morphological Analyser for Arabic Text The ALECSO/KACST design specifications and standards stated some essential prerequisites of robust morphological analysers for Arabic text. These prerequisites involve dealing with clitics, affixes, roots, patterns, non-inflected words, non-conjugated verbs and primitive nouns (Hamada 2009a). This requires the morphological analyser to have comprehensive lists that cover the information. Having these morphological lists previously stored within the morphological analyser will meet the first five general requirements of the Arabic morphological analyser. These prerequisites as described by (Hamada 2009a) are: • A list of all prefixes, such as definite article, subject prefix, etc. • A list of all suffixes, such as feminine nūn, masculine sound plural letters, etc. > S;  mafa‘ῑl, etc. • A list of all patterns, such as +; #; G;H fa‘ala, c'#? G;H fa‘ūl, +=42 ; • A list of all triliteral and quadriliteral roots. • A list of non-inflected words, non-conjugated verbs and primitive nouns. Moreover, the lists of prefixes and suffixes need to be classified into noun affixes, verb affixes and affixes which are common between nouns and verbs. 8.2.3 ALECSO/KACST: Design Recommendations The ALECSO/KACST initiative for morphological analysis for Arabic text has specified the general design specifications and standards as recommendations for the developers of morphological analyzers for Arabic text. These recommendations include recommendations for the inputs of the morphological analyzer, the analysis process, and the outputs of the morphological analyzer. The following subsections discuss these design recommendations as described by Al-Bawaab (2009). - 196 8.2.3.1 ALECSO/KACST: Design Recommendations of Inputs A well-designed morphological analyzer for Arabic text can accept a single word, a sentence, or a text as inputs. The morphological analyser should provide analyses for each word of an input sentence or text. Moreover, the morphological analyser should accept the input word(s) to be fully vowelized, partially vowelized or non-vowelized. In order to deal with the different word vowelization variations, the morphological analyzer should contain special functions that can generate the non-vowelized form of the input word(s), preserve the vowelization, and deal with the specific orthographic challenges of the Arabic word such as šaddah. 8.2.3.2 ALECSO/KACST: Design Recommendations of Analysis An Arabic word form may be assigned several analyses due to the absence of vowelization and the treatment of the word out of its context. Then the number of analyses differs from word to word. Because the morphological analyser analyzes the words out of their context, it should produce all possible analyses of each word form. Arabic words are classified into nouns, verbs and particles. Due to the absence of vowelization words can share noun or verb properties. Thus Q wrd can be QD=; wardun “roses” representing a noun or Q;;; warada “to come” representing a verb. The word can be un a noun or particle. An example is J rb where J j ; rubb “God” is a noun, while J . ? rubba “many” is a particle. The word can be a verb and particle as in 4 ‘dā; ; 4; ‘adā “ran” is a verb, while ; 4; ‘adā “except” is a particle. The word can also be a noun, verb and particle as in +" bl; j+;" ballun “moistering” is a noun; +. ;" balla “to moisten, wet, make wet” is a verb; += ;" bal “nay, -rather …, (and) even, but, however, yet” is a particle. Therefore, the analyser assumes that the analyzed word is noun, verb and particle then follows certain procedures to analyze verbs, nouns and particles, to extract morphological features specified below. A- Analyzing verbs The morphological analyzer must extract the following information assuming the analyzed word is a verb. 1- Verb prefixes: a one-letter or two-letter prefix can be attached to the beginning of the verb. Thus in ; ;-;; wakataba “and he wrote” ; ;-+ ; ; wa+kataba has a one letter prefix ; wa “and” representing a conjunction particle; and in ? ?-= ;; ; wasayakubu “and he will write” ? ?-= ;!+v ; ; wasa+yaktubu has a two letter prefix consisting of ; wa “and” representing a conjunction particle and v ; sa “will” representing a particle of futurity. The equivalent feature-numbers in the SALMA – Tag Set are 4 and 5. - 197 2- Verb suffixes: These are the subject-suffix pronouns and the object-suffix pronouns. The verb suffix can be one of the suffixed pronouns or a combination of both types of pronouns. For example, the verb ` ? =:;G;5 qara’tu “I have read” has ` ? tu as a subject-suffix pronoun. The verb 2F; R; .%4; ‘allamahā “he taught her” has 2; hā “her” as an object-suffix pronoun, and the word 2F; 2 ; ;)t= .“; zawwağnākahā “we have let you marry her” has 2;< nā “we” as a subject-suffix pronoun, ; ka “you” as a first objectsuffix pronoun, and 2; hā “her” as a second object-suffix pronoun. The equivalent feature-number in the SALMA – Tag Set is 5. 3- Verb subcategory: the morphological analyser should specify the subcategory of the analyzed verb. The analyzed verb can be a perfect verb, imperfect verb or imperative verb. The analyzed verb can share properties of two or three verb subcategories as in M: ’akrm. Here M;;= ;: ’akrama “treated reverentially with hospitably” is a perfect verb; M?> = ?: ’ukrimu “I treat reverentially with hospitably” is an imperfect verb; and M=> = ;: ’akrim “You! Treat reverentially with hospitably” is an imperative verb. The equivalent feature-number in the SALMA – Tag Set is 3. 4- The pattern of the verb: the morphological analyser extracts the correct pattern of the verb. For example the verb M2; ; G;- = ’istaqāma “straighten” is an augmented triliteral verb which has the pattern +; #; S= G;- = ’istaf‘ala. Some verbs can have more than one pattern. Thus c2; ?G! yuqāl has the pattern +? #? S= ;G! yaf‘ulu then it means “said”, and the pattern += >#S= ?G! yuf‘il when it means “been sacked”. 5- The root of the verb: the morphological analyzer specifies the correct root for the analyzed verb. For example, ¼ ? >;! yariṯu “he inherits” has the root ¼   w-r-ṯ, the imperative verb += ?5 qul “You! Say” has the root c  h q-w-l, and the imperative verb h> qi “You! Protect” has the root h  w-q-y. 6- Verb augmentation: the morphological analyser specifies whether the verb is unaugmented, augmented by one letter, augmented by two letters or augmented by three letters. It also specifies whether the verb has a triliteral root or quadriliteral root. For instance, the verb u; .%4; ‘allama “he taught” is a triliteral verb augmented by one letter. The verb k. ;ER; = ’iṭma’anna “he reassured” is quadriliteral verb augmented by two letters. The equivalent feature-number in the SALMA – Tag Set for verb augmentation is 20, and for number of root letters 21. 7- Person morphological feature: the morphological analyser determines whether the analyzed verb is first person, second person or third person depending on the subject-suffix pronouns and whether the short vowels appear on the analyzed verb. The verb d=‘r ; =‘r ; lāḥaẓtu “I have noticed” is a first person verb. The verb d ; lāḥaẓta “You have noticed” is a second person verb. And the verb d = ;‘r ; lāḥaẓat - 198 “She has noticed” is a third person verb. The equivalent feature-number in the SALMA – Tag Set is 10. 8- Voice morphological feature: the morphological analyser determines whether the analyzed verb is active voice or passive voice. For example, the verb 2? ( ; ?! yuṣāru “has become” is an imperfect passive verb. The equivalent feature-number in the SALMA – Tag Set is 15. 9- The mood marks: the morphological analyser determines the mood marks of the analyzed verb. The mood marks of the verb can be a short vowel (i.e. fatḥah, ḍammah, sukūn), a letter (i.e. nūn), or omission (i.e. omission of vowel letter). The equivalent feature-number in the SALMA – Tag Set is 13. 10- Full vowelization: the morphological analyser adds the correct full vowelization to the analyzed verb whatever the original vowelization of the input verb. B) Analyzing nouns The morphological analyser should extract the following morphosyntactic information assuming the analyzed word is a noun. 1- Noun prefixes: the noun prefix consists of one to five letters. The prefix letters can be homographic with the noun original letters (i.e. the root radicals of the noun). > bi+ṭāqāt “with the abilities” where the E.g. `2;52;e>" biṭāqāt; can be analyzed `2;52;+J first letter the preposition J> bi “with” is a prefix, or `2;52;e>" biṭāqāt “cards” without any prefix. The equivalent feature-number in the SALMA – Tag Set is 4. 2- Noun suffixes: genitive suffixed pronouns are the most common suffixes of nouns. The suffix letters can be a suffix on the noun or on underlying letter of the noun. E.g. the word H fkh can be analyzed ?G+‡ B ;H fakkuhu “his jaw” where ?G hu is a suffix, un or as D> ;H fakih “humorous” which has the root U 3 f-k-h and lacks any suffix. The equivalent feature-number in the SALMA – Tag Set is 5. 3- The pattern of the noun: the morphological analyser specifies the pattern of the analyzed noun. E.g. the pattern of the noun 12;)>" binā’ “building” is c2#; >H fi‘āl, the pattern of the noun |; sayyid “master” is += >#=G;H fay‘il, and the pattern of the word 6 j ? ;: un un akuff “hands” is +D #? G=H;: ’af‘ul . 4- The root of the noun: the morphological analyzer extracts the root of the analyzed noun. E.g. u = ’ism “name” has the root  M v s-m-w, k'; G=; ḥaywān “animal” has the root a ḥ-y-y, and 12;)G=> mῑnā’ “port” has the root k  w-n-y. 5- Noun sub-category: Arabic language scholars classified Arabic words into three main categories, namely noun, verb and particle. This classification is coarsegrained. More details are needed to distinguish the sub-categories of nouns, verbs - 199 and particles. The sub-categories of nouns include: common nouns, proper nouns, relative pronouns, demonstrative pronouns, nouns of time and place, adjectives, adverbs, etc. There is no agreement between part-of-speech tag sets of Arabic text on the sub-categories of nouns. The CATiB tag set groups nominals such as nouns, pronouns, adjectives and adverbs into one tag NOM, and gives proper nouns a specific tag PROP. The PATB Full tag set distinguishes between NOUN (common noun), ADJ (adjective), ADV (adverb) and NOUN_PROP (proper noun). The QAC tag set has four categories to tag nouns. These are nouns (N noun, PN proper noun, IMPN imperative verbal noun), pronouns (PRON personal pronoun, DEM demonstrative pronoun, REL relative pronoun), nominals (ADJ adjective, NUM number) and adverbs (T time adverb, LOC location adverb). (See section 5.3 for more details about part-of-speech tag sets of Arabic text). The SALMA Tag Set classifies nouns into 34 sub categories at position 2 which include more descriptions of inflected and non-inflected noun categories. See section 6.2.2 for the details of the part-of-speech subcategories of noun. ALECSO/KACST design recommendations for morphological analysis for Arabic text distinguish between 18 noun subcategories. Table 8.1 shows the subcategories of nouns with examples. Table 8.1 The 18 subcategories of nouns with examples Noun subcategory Example 1 2 Primitive noun Active participle 2> t; u = +42S u ’ism ğāmid ’ism al-fā’il J2;-> kitāb “book” > ḍārib ‘hitter’ J2­ 3 4 Passive participle Noun of place c'#Sm u k2m u ’ism al-maf’ūl ’ism al-makān 5 6 Noun of time Adjective 7 Instrumental noun 8 9 Gerund / Verbal noun Gerund of profession J?£ = ; maḍrūb ‘Struck’ ;-= ; maktab ‘office’ }>%=e; maṭla‘ start time +!' ṭawīl ‘tall’ 2@)=> minšār ‘saw’ 10 11 Gerund of instance Gerund of state 12 13 16 Proper noun Gerund/ verbal noun with initial mῑm Elative noun Intensive Active participle Generic noun 17 18 Plural generic noun Collective noun 14 15 k2“ u F.@m S( | ~ u L>%/ =  (m L42)( (m ’ism zamᾱn aṣ-ṣifah al-mušabbahah ’ism al-‘ālah al-maṣdar al-aṣlῑ al-maṣdar al-ṣinā‘ῑ \m . ( { ´ ( maṣdar al-marrah maṣdar al-hay’ah u%# u LR m (m ’ism al-‘alam al-maṣdar al-mῑmῑ + £S8 u +42S u $2  ˆ)o u L#¨ ˆ)t u }¨ u J= ­ ; ḍarb ‘hitting’ .?H furūsiyyah ‘horsemanship’ \;=‘;< naẓrah ‘one look’ i; %= t> ğilsah ‘sitting position’ > fāṭimah ‘Fatima’ R; 2H >  maw ‘id ‘date’ 4' ; ’ism tafḍῑl mubālaḡat ’ism al-fā’il +£H: ’afḍal ‘better’ a.t; ğarraḥ ‘surgeon’ ’ism al-ğins k2(> hiṣān ‘horse’ ’ism ğins ğam’ī ’ism ğam’ a2S8 tuffāḥ ‘apple’ M'5 qawm ‘folk’ - 200 6- The Morphological Features of Inflectional Morphology: Most Arabic nouns are declined nouns. However, some nouns are non-declined because they are generated from certain patterns, or they satisfy certain conditions. For example, the noun v> ; ; > S;  mafā‘il. And the madāris “schools” is non-declined because it has the pattern +42 ; noun u=> ;G=">Z ’ibrāhῑm “Abraham” is non-declined because it is not an Arabic proper name. The equivalent feature-number in the SALMA – Tag Set is 11. 7- The Morphological Feature of Gender: the morphological analyser specifies the gender of the analyzed noun; for example R; ;5 qamar “moon” is masculine; ˆ=Ð; šams “sun” is feminine; and Ÿ=!> ; ṭarῑq “road” is of common gender. The equivalent feature-number in the SALMA – Tag Set is 7. 8- The Morphological Feature of Number: the morphological analyser recognizes the number of the analyzed noun whether it is singular, dual or plural. For example, the noun k'; ( ; 4; ‘aṣawān “two sticks” is dual and its singular is 2( ; 4; ‘aṣā “one stick”; the noun k'­ ? =;: ’arḍūn “earths” is the plural form of the noun ¬=;: ’arḍ “earth”; and the noun `; ;,= / ; ṣaḥrāwāt “deserts” is the plural of the noun 1;,= / ; ṣaḥrā’ “desert”. The equivalent feature-number in the SALMA – Tag Set is 8. 9- The Relative and Diminutive Nouns: the morphological analyser specifies the noun sub-categories of relative and diminutive nouns. For example, the noun Y '> ;%‰; ẖalawyy “cellular” is a relative noun of .>%‰; ẖalyyah “cell”; and the noun .( ; 4? h ‘uṣayya “small stick” is a diminutive of 2( ; 4; ‘aṣā “stick”. The equivalent featurenumber in the SALMA – Tag Set is 2. 10- The Case Mark: the morphological analyzer specifies the case of the analyzed noun and the correct case mark. The case mark can be a short vowel (i.e. fatḥah, ḍammah, kasrah, sukūn) or a letter (i.e. ’alif, wāw, yā’). For example, 2;";: ’abā “father” is an accusative noun which has ’alif as case mark; k; '= ? .;H fallāḥūna “peasants” is a nominative noun which has wāw as case mark because it is a masculine sound plural; > ; ; ḥaḏāri “beware” is an invariable verb-like noun marked by kasrah. The equivalent feature-number in the SALMA – Tag Set is 13. 11- Vowelization of nouns: the morphological analyser adds the full vowelization to the analyzed noun regardless of the original vowelization of the input noun. For example, some of the vowelized variations of the non-vowelized noun m al-mdrst are; ; ; = R; = al-madrasat “the school”; ; | ; R? = al-mudarrisat “the female-teacher”; ; . ; R? = al-mudarrasat “the female-student”, etc. C) Analyzing Particles The morphological analyser assumes that the analyzed word is a particle and extracts the following information: - 201 1- The Prefix of the Particle: the particle’s prefix consists of one letter such as y; >Z wa’iḏā “and if” where ; wa is a prefixed conjunction, or two letters such as 2;†.?;%G;H falarubbamā “and perhaps” where the two letters +; ;H fala at the beginning of the particle represent the prefix. 2- The suffix of the particle: the suffixes are the genitive suffixed pronouns such as 2R; ? )=4; ‘ankumā “about both of you”. 3- The Inflectional Morphology Mark: particles are always invariable. The result of analyzing particles shows the inflectional morphology mark of particles. For h example, § ? =; ḥayṯu “where (adv.)” has the mark ḍamma ; += ;" bal “nay, -rather …, (and) even, but, however, yet” has the mark sukūn; and 3 ; '= ; sawfa “will” has the mark fatḥah. 8.2.3.3 ALECSO/KACST: Design Recommendations of Outputs The output should include all possible analyses of the analyzed word, assuming the analyzed word is verb, noun and particle. The recommended morphosyntactic information, discussed above, represents the core information that is displayed in the outputs of the morphological analyzer. As described by the ALCSO/KACST initiative, figure 8.1 shows examples of the output verb analyses; figure 8.2 shows examples of the output noun analyses; and figure 8.3 shows examples of the output particle analyses. w‘dt = wa‘adtu = wa‘ad+tu “I promissed” Perfect verb with active voice Unaugmented, has the pattern fa‘ala yaf‘ul and has the root (w-‘-d) Invariable verb has sukūn as inflectional morphology mark Third person verb which has a singular subject of common gender The suffix is subject suffixed pronoun tā’ w‘dt = wa‘adta = wa‘ad+ta “You (masc.) promissed” w‘dt = wa‘adti = wa‘ad+ti “You (fem.) promissed” w‘dt = wa‘adat = wa‘ada+t “She promissed” w‘dt = wu‘idtu = wu‘id+tu “I have been promissed” w‘dt = wa‘udtu = wa+‘ud+tu “And I have returned back” w‘dt = wa‘addat = wa+‘adda+t “she counted” Figure 8.1 Examples of the output verb analyses  8 + 5, I+ *+ =  8 5, I+ *+ = 5I* M'%# ¬2 +#H ( Q ” ) o C (+= ?#S= G;! +; #; G;H) k“ n%4 Q¤ k'i n%4 €  QSm u|%-m qZ )i (`) }H ‹R£" +(- ` ; + = 4; ; = ` ; = 4; ; > = 4 `> + = 4; ; = ` ;; ` = + ; 4; ; = ` = ; 4; ; > > ` ? + = 4? = ` ? = 4? ` ? ; = ` ? + = 4+ ? = 4? ; ` = + . 4+ = . 4; ; ; ; = ` = `4 = `4 = `4 = `4 = `4 = `4 - 202 wmfṣlk = wamafṣiluka = wa+mafṣilu+ka “And your joint” Prefix ; wa “And” 3 ,  + * = ‡ 3 g+ + S8  + + + 8 , +*+ = ‡ * () "2i > > mafṣilu, is a masculine noun has the pattern (maf‘il) and the root & 3) o C (+#S= ) k“ n%4  u +( ; ? S= ; (f-ṣ-l) (c h Is in nominative case and has the ḍamma case mark R£ #H 4 ”'H Is connected to the genitive suffixed pronoun kāf ( ) o ‹R£" +(- > + +( > > > wmfṣlk = wamafṣiluki = wa+mafṣilu+ki “And your (fem.) joint” ? S= ; + ; = ‡?%(S= ;; = ‡%(S > S= > +  = ‡ > > wmfṣlk = wamifṣiluka = wa+mifṣilu+ka “And your (masc.) tongue” ; + +? ( ; ; ?%(S= ; = ‡%(S > S=  +  = ‡ > wmfṣlk = wamufṣiluka = wa+mufṣilu+ka “And your (masc.) ; + +? ( ? ; ; ?%(S= ?; = ‡%(S separator” wmfṣlk = wamufṣṣiluka = wa+mufṣṣilu+ka “And your interpreter” Figure 8.2 Examples of the output noun analyses fmnkm = faminkum = fa+min+kum “and among you” The prefix is ‫ف‬ َ fa “and” > C=  min “among” is a preposition, Invariable particle, and sukūn is its inflectional morphology mark It is connected to the genitive suffix pronoun u= ? kum “you” Figure 8.3 Examples of the output particle analyses ; + +? ( | S= ? + ; = ‡ | S= ?; = ‡%(S ; ?%( H, 8 + $, 3 + T + = H, M8 ,:3 + =HM: (3) "2i > k'i n%4 €   t 3 (C= ) (u= ) ? o ‹R£" +(- 8.2.4 Discussion of ALECSO/KACST Recommendations The ALECSO/KACST recommendations for designing an Arabic morphological analyzer are morphological descriptions of the analyzed words. These linguistic descriptions involve variant analyses of the analyzed word, such as assuming the word is a noun, verb and particle, then analyzing the word according to that assumption. The descriptions clarify the tokenization of the analyzed word into morphemes, where the prefix letters or suffix letters can be homographic with the original letters of the analyzed word. Therefore, different analyses can be produced by tokenizing the word into different morphemes. The recommendations provide information about the morphological features of the analyzed words. They provide 11 morphological features for nouns and 10 morphological features for verbs. They also provide information about the root, pattern, prefixes, suffixes and vowelization of the analyzed words. On the other hand, the ALECSO/KACST recommendations lack the description of how to encode the morphological features of the analyzed words in a machine-readable way. The recommendations are not specific to a morphosyntactic tag set, and they do not provide intermediate coding to enable mapping of different morphosyntactic tagging schemes. The classification by linguists of morphological features of nouns, verbs and other information such as root, pattern and affixes does not prioritise these features, so that order of presentation can be exploited as procedural steps in the development of the morphological analyzer. - 203 - 8.3 The SALMA – Tagger Algorithm The SALMA – Tagger algorithm involves several processing steps for Arabic text. These steps, described below, are executed sequentially where each step depends on the previous one. Intermediate results can be obtained from each processing step. Figure 8.4 shows the steps and module components of the SALMA – Tagger. The SALMA – Tagger was developed according to the long-established Arabic grammar knowledge extracted from traditional Arabic grammar books. It also has the SALMA – ABCLexicon as a main component for extracting the root of the word, and for finding the different vowelization variations of the analyzed words. The SALMA – Tagger depends on the SALMA – Tag Set as a design standard. The SALMA design standard for morphological analysis of Arabic includes the ALCESO/KACST design recommendations and standards. However, the SALMA standards for designing fine-grained morphological analysis for Arabic text are more detailed, and adherent to standards of global computational linguistic knowledge and traditional Arabic grammar. The SALMA standards are not tied to a specific application, as user needs are not known yet. The standards are designed to be general purpose, can be integrated into different levels of applications, and different tag sets can be mapped to this standard to allow reusability and comparability between these different morphosyntactic annotation schemes. Following the ALECSO/KACST recommendations convention, inputs, analysis process and outputs are described in this section. The morphological analyzer accepts a single Arabic word, a sentence or an Arabic text document, whether they are vowelized, partially vowelized, or non-vowelized, as inputs to the system. The SALMA – Tagger is a morphological analyser that consists of five components. Each component can be a standalone text analytics application that performs a specific task, and they work together to process the input text and provide all morphological information of each analysis of the analyzed words. Sections 8.3.1 to 8.3.5 will discuss the component modules of the SALMA – Tagger. The outputs of morphological analyser are the full analyses of the words from the analyzed text. Full analysis means all possible analyses of the word such as all possible roots, clitics, affixes, stems, lemmas, patterns, different forms of vowelization, and the morphological features of each analysis represented by a morphological tag using the SALMA – Tag Set. The subsections of section 8.3 will discuss the outputs of each tagger’s components. Section 8.6 discusses the output formats of the SALMA Tagger. - 204 - Input Single word or document. Vowelized, partially vowelized or non-vowelized 1. SALMA Tokenizer Tokenization Clitics & Affixes lists Spelling errors detecting and correcting SALMA ABCLexicon Clitics, Affixes and Stems Function words list 2. SALMA Lemmatizer & Stemmer Root extraction Lemmatizing Patterns dictionary Proper nouns list 3. SALMA Pattern Generator Pattern matching Algorithm 1 Pattern matching Algorithm 2 Broken Plurals list 4. SALMA Vowelizer SALMA Tag Set Vowelization 5. SALMA Tagger Morphological features tag assignment Colour coding words’ morphemes Outputs Morphologically analyzed text (word morphemes, root, pattern, SALMA – Tag, vowelization and colour coded output) Figure 8.4 The SALMA Tagger algorithm 8.3.1 Module 1: SALMA – Tokenizer The first module of the SALMA – Tagger is the SALMA – Tokenizer. The main task of this module is to split the input running text into tokens. Then, the tokens are decomposed into morphemes (Attia 2007; Attia 2008). The SALMA – Tokenizer has three main parts. Each part is important for analyzing Arabic text. The Tokenization part deals with the input text files, determines what is considered an Arabic word, and stores - 205 the Arabic word in a unified format that enables the other components to deal with the word whether the word is fully vowelized, partially vowelized or non-vowelized. The Spelling Errors Detection and Correction part checks the spelling of the tokenized words and corrects the spelling of the words if the word letters do not match certain patterns. The Word Segmentation part is responsible for generating all possible variant morpheme tokenizations of the analyzed word. This part mainly depends on matching the affixes and clitics of the analyzed word and comprehensive lists of affixes and clitics. The following sections discuss these parts in detail. 8.3.1.1 Step 1, Tokenization In this section; Buckwalter’s transliteration scheme is used in the example as it illustrates 1-to-1 mapping between Arabic letters and diacratics and their equivelant in Roman letters. The tokenizer program uses the NLTK regular expression tokenizer to tokenize the input text into Arabic words, punctuation marks, currency tokens, numbers, words written in Latin letters, and HTML/XML tags. The regular expression tokenizer uses regular expression patterns that suit the Arabic text. Then the tokenizer processes the extracted Arabic words, by resolving the doubled letters S.#£m 3  al-ḥurūf al-muḍa‘‘afah h and the extensions m . is replaced by Y al-madd. The doubled letter marked by šadda \. @ two letters similar to the original letter; the first is silent marked by sukūn, and the second is vowelized by the same short vowel as appears on the original letter. For example the word n/ . ; waṣṣā waS~aY has the doubled letter & ṣ S and after processing it will be in the form n( Y al-madd ( ] ) is replaced by = ; waṣṣā waSoSaY “He enjoined”. The extension m ;/ (hamzah) and ’alif, as in the word '?)]; ’āmanū |manuwA “They believed” which will be in the form '?)1 ; ’āmanū ’AmanuwA. Only one short vowel can be associated with any letter of the word. Based on this fact, a unified data structure to store Arabic words was designed. This data structure consists of a list of tuples of size two, where each tuple stores the letter in the first position and the short vowel (if it is present) at the second position. And so on for all letters and short vowels of the word. The data structure is represented as [(C,V), (C,V),…,(C,V)], where C represents a consonant and V represents a short vowel. Figure 8.5 shows the data structure storing the words n( = ; waSoSaY and '?)1 ;/ ; ’āmanū ‘AmanuwA. This data structure is also used to match the word and the patterns. Position 0 1 2 3 4 5 o , *+ +) * ◌+ e ◌, e ◌+ U } waSoSaY w a S o S a Y -  8X + X }  } ! ◌+ 9 ◌8 * }  } - m a n u w - - ‘AmanuwA ‘ - A Figure 8.5 The word data structure A - 206 Figure 8.6 shows a tokenized sentence of chapter 29 of Qur’an. It shows the original fully vowelized word. Then the tokenizer module produces three variations of the analyzed word; the non-vowelized word, the processed word extracted from the unified word’s data structure, and the processed non-vowelized word. Non-vowelized Word M= ;: > ; i; >. C! ;  k' ; ?%R; #= ;G! >`2{|i ;. k;: 2;<'? >i= ;! ’am Or ḥasiba Think al-lḏῑna those who ya ‘malūna do as-sayyi’āt evil deeds ’an that yasbiqūnā 12; 2; k' ; R? ? z=; Sā’a they can outrun us Evil is mā what yaḥkumūn they judge M: >m i Hsb C! Al*yn k'%R#! yEmlwn `2{ i Alsy}At k: >n 2<' i! ysbqwnA 12 sA’ 2 mA k'Rz yHkmwn Processed vowelized word M= ;: >amo > ; i; Hasiba > C! ; ;%= Alola*iyna k' ; ?%R; #= ;G! yaEomaluwna >`2{> i Alsayoyi}aAti ; =; k;: >an 2;<'? >i= ;! yasobiquwnaA Processed nonvowelized word M: >m i Hsb C!% All*yn k'%R#! yEmlwn `2{ i Alsyy}At k: >n 2<' i! ysbqwnA 12; saA’ 2; maA 12 sA’ 2 mA k'Rz yHkmwn k' ; R? ? z=; yaHkumuwna Figure 8.6 A sample output of the tokenization module component after processing the Qur’an , chapter 29 8.3.1.2 Step 2, Spelling Errors Detection and Correction A large number of potential spelling errors are to be expected because of a variety of word processing tools with different spelling conventions that are used to generate Arabic text. Most word processing tools that support Arabic are not aware of what letter and diacritic combinations can appear on a letter in a given position of the word. Therefore, it is the responsibility of the editor (person) who should check the word’s spelling while writing a document or a authoring a web page. The absence of such a special module in the word processing tools that support Arabic increases the potential for mis-spelling Arabic words. Such spelling errors include adding more than one short vowel to the same letter; starting the word with taṭwīl, a special character that is used to extend the Arabic word; adding a diacritic to taṭwīl (also considered a spelling error). Another type of constraint that the word processing tools should deal with is whether a certain diacritic can appear on a letter in a given position in the word. This constraint has many rules such as; a word cannot start with a ‘silent’ letter, (i.e. sukūn cannot appear on the first letter of the word). A Similar rule is tanwīn, which appears only on the last letter of the word. The algorithm divides the Arabic word into three parts; the front part consisting of the first letter and any diacritics appearing on it; the middle part consisting of the letters - 207 starting from the second letter till the letter before the last and their diacritics; and the rear part which consists of the last letter and its diacritics. Each part has its own valid letterdiacritics combinations. The front part is checked if it matches the following 3 valid letter-diacritic combinations [(letter + šaddah + a short vowel57), (letter + a short vowel), (letter)]. Each letter-diacritic combination from the middle part is checked if it matches the following 5 valid letter-diacritic combinations; [(letter + šaddah + a short vowel), (letter + a short vowel), (letter + sukūn), (letter), (taṭwīl)]. The rear part is checked if it matches one of the following letter-diacritic valid combinations [(letter + šaddah + a short vowel), (letter + šaddah + tanwīn), (letter + a short vowel), (letter + sukūn), (letter + tanwīn), (letter)]. Figure 8.7 shows an example of applying the letter-vowelization templates to the analyzed word. The matching templates are highlighted in bold. Word ƒE@( + ( ( -?+ sayyāratun “Car” Letter vowelization templates Rear Middle part ƒE @+ 1) Letter 1) + tanwīn Letter + Short vowel 2) Letter + sukūn 3) Letter 4) Letter + šaddah + tanwīn 5) Letter + šaddah + a short vowel Front  ((( 2) Letter 4) taṭwīl 3) Letter + sukūn F - p + 5a) Letter 1) Letter + Short + šaddah vowel (O) + short vowel 5b) letter 2) Letter + šaddah 3) Letter + (ph) + šaddah (ph) + short short vowel vowel Figure 8.7 Example of applying letter-vowelization templates to a word. The matching templates are highlighted in bold. 8.3.1.3 Step 3, Word Segmentation (Clitics, Affixes and Stems) For each tokenized Arabic word, a special module divides the word into three parts: proclitics and prefixes, stem/root, and suffixes and enclitics. The first part is matched against a list of proclitics and prefixes consisting of 220 entries, and the third part is matched with a list of suffixes and enclitics consisting of 474 entries. Only the analyses that match both of the lists of clitics and affixes are taken as candidate analyses. 8.3.1.4 Which Segmentation to Use? Several morphological systems exist for Arabic text. These systems apply tokenization to the input text because tokenization is an essential prerequisite. However, 57 Short vowels are fatḥah, ḍammah and kasrah [( ◌َ ) ( ◌ُ ), ( ◌ِ )] - 208 these systems do not describe the tokenization decisions. Only Attia (2007); also Attia (2008) described the tokenization of Arabic as a challenge which needs more investigation. The SALMA Standard decomposes the tokens (word) into five parts: proclitics; prefixes; stem; suffixes; and enclitics. Each part can be a single part or multiple of more than one clitic or affix, except there is only one stem in a word. This fine-grain decomposition is required by the SALMA – Tag Set. Then, a SALMA – Tag is assigned to each morpheme. The distinction between affixes and clitics can be confusing. Clitics and affixes are defined as follows: “…affixes carry morpho-syntactic features (such as tense, person, gender or number), while clitics serve syntactic functions (such as negation, definition, conjunction or preposition) that would otherwise be served by an independent lexical item.” (Attia, 2008 p. 59) This definition distinguishes between the morphosyntactic features of affixes and the syntactic functions of the clitics. The SALMA standard bases the definition of the clitics and affixes on the patterns of the words where the morphosyntactic features of affixes and the syntactic functions of the clitics are preserved as defined by Attia (2008). Affixes are the morphemes shared between the word and its pattern, and clitics are the word’s morphemes that do not match morphemes of the pattern. Therefore, suffixed pronouns can be classified as suffixes if they are subject pronouns. On the other hand, they are classified as enclitics if they are object-suffix pronouns or genitive-suffix pronouns. This classification is based on patterns, where subject-suffix pronouns are part of the pattern. Subject-suffix pronouns carry morphosyntactic features (i.e. gender, number and person) of the verb, while object-suffix pronouns and genitive-suffix pronouns serve syntactic functions (e.g. object of the verb) that can be expressed by an independent lexical item. Figure 8.8 shows an example of tokenization of some words. dH frmt dH farmata “he formatted” ` + MH faram+ti “you (2SF) chopped” ` + M + 3 fa+ ram+t “you (2SF) throwed ” ḥasaba “he computed” i i ḥsb +"i8 tsrbl – wirāṯa t 2F2)t“ zwğnākhā +" + ` \ + ¼ ta+sarbala “he dressed” t wirāṯa + “inheretance” 2 + + 2< + “ u whm u wahm “delusive imagination” u+ wa+hum “and they” ˆ: ˆ: ’ams “yesterday” ’ms ˆ + : i! ysr i!  + ’a+ massa “did he touched?” yasir “ease, prosperity” ya+sirru “he telld a secret” zawwağ+nā+ka+hā “we allowed you to marry her” Figure 8.8 Example of tokenization of some words - 209 8.3.1.5 Constructing the Clitics and Affixes Dictionaries Using traditional Arabic language grammar books (Dahdah 1987; Dahdah 1993; Wright 1996; Al-Ghalayyni 2005; Ryding 2005), lists of proclitics (e.g. conjunctions, prepositions, vocative particles, interrogative particles, particle of futurity, definite article58), prefixes (e.g. imperfect prefix, imperative prefix), suffixes (e.g. relative yā’, emphatic nūn, nūn of protection, dual letters, masculine sound plural letters, feminine sound plural letters), and enclitics (e.g. suffixed pronouns, tā' marbūṭah, tā' of feminization, tanwῑn) were constructed. These lists were provided to a generating program which generates all the possible combinations of proclitics and prefixes together, and suffixes with enclitics. The generated lists of these combinations were extremely large because the generation process produced all possible combinations of proclitics and prefixes; and suffixes and enclitics. These generated lists were checked by analyzing words in four corpora; the Qur’an text corpus, the Corpus of Contemporary Arabic, the Penn Arabic Treebank, and the Corpus of Traditional Arabic Dictionaries. Then, two lists were constructed; first, a list of proclitics and prefixes containing 220 entries, and second, a list of suffixes and enclitics containing 474 entries. Khoja’s stemmer contains 11 prefixes and 28 suffixes (Khoja 2003). BAMA has a prefixes file containing 299 prefixes and a suffixes file containing 618 suffixes. BAMA provides a morphological compatibility table containing 598 prefix-suffix combinations (Maamouri and Bies 2004; Maamouri et al. 2004). The Alkhalil morphological analyzer has 65 prefixes and 65 suffixes. The prefixes and suffixes are stored in separate XML files (Boudlal et al. 2010). The clitics and affixes dictionaries add more morphosyntactic features to each entry. The entry is compound (i.e. consists of one or multiple clitics or affixes representing distinct morphemes). Instead of one tag for the clitic and affix entry, multiple tags were added. Each part (morpheme) is assigned a SALMA – Tag where the morphological features of that part are encoded. The nature of that part whether it is a proclitic (proc), a prefix (pref), a suffix (suf) or an enclitic (enc) is distinguished. Whether that part is part of a pattern or not is also determined. This information is useful for tokenization and pattern matching. The prefix-stem-suffix agreement is illustrated by adding the main partof-speech information for each part. n indicates that part of clitic and affix entry can be used on a noun stem and other noun clitics and affixes parts. v indicates verb part. And x indicates the part is either noun or verb. 58 The definite article al- is classified as proclitic because it does not appear in the patterns and it is not part of the underlying letters of the word. The definite article al- is also different than other proclitics such as prepositions and conjunctions because al- cannot appear as a stand-alone morpheme. - 210 Figures 8.9 and 8.10 show samples of these lists with the morphosyntactic information added to each entry in the list. mn mnqlibp mn d2H ' -2H fAst fAstbqwA dm2 kAl ƒ#-m2 kAlmtEjb c2 H: +2 2 H: >fbAl >fbAlbATl 3 f d Ast k c Al d mt : > 3 f J b c Al Part of pattern Morphemes C Stem POS Example  %) Morpheme type Prefix C SALMA – Tag r---p----------------- pref n y p--c------------------ proc x n r---p----------------- pref v y p--l------------------ proc n n r---d----------------- proc n n r---p----------------- pref n y p--i-----s------------ proc x n p--c------------------ proc x n p--p------------------ proc n n r---d----------------- proc n n Description R% c: * \Q2!“ Prefix 6e4 3 Conjunction R% c: * \Q2!“ prefix  @8 3 Simile particle 6!#8 \Q: Definite article R% c: * \Q2!“ Prefix M2FS- 3 Interrogative particle 6e4 3 Conjunction t 3 Preposition 6!#8 \Q: Definite article Figure 8.9 Sample of the proclitics and prefixes with their morphological tags, attributes and descriptions - 211 - u ktAbhm €= > R; .%4; hm k n r---r-mpts-s---------- r---n----s-s---------- tmA 22 ; ;<2R; ?-=;e4= ;: >ETytmAnAhA 2< nA 2 ‡- <2insAnytk An y ` ‡- < Anytk t +0 * ( A2¥ }¨ ) +(- ‹R­ (t : (<) !25' k'< enc v n Eallamany y 2¦ tmAnAhA enc x n Description Suffixed pronoun (MP3) r---r-xsfs-s---------- 22<2¦ Part of pattern Stem POS u³2- Morpheme type Morphemes ¢ ny Example Suffix u hm SALMA Tag enc x n Nūn of protection +0 * (u%-  QS)+(- ‹R­ (< Suffixed pronoun (XS2) r---r-xdss-s---------- suf v y r---r-x?fs-s---------- suf v y }H +0 *( 2› s ) +(- ‹R­ Suffixed pronoun (XD1) }H +0 * (u%- }¨)+(- ‹R­ Suffixed Pronoun (XP1) * ( A2¥  QS ) +(- ‹R­ r---r-fsts-s---------- t +0 enc x n Suffixed pronoun (MS3) r---s----------------- suf n y r---y----------------- enc n n r---f-fs-s-s---------- suf n y R% ‰] * \Q2!“ Suffix  i) 12! Relative yā' ('" 128 C4  %)) § t; ğāmi‘ Mosque • }? R= ;o alğam‘u addition • }; ;¨; ğama‘ a collected • } R> ƒ= .- at-tağmῑ‘ collection un • D”2R; >-t = ’ğtimā‘ meeting • D”2;¨= >Z ’iğmā‘ un agreement • .#¨; ğama‘iyyah association h • .#¨; ğama‘iyya association • }? R; Ñ=; tağma‘u you are collecting • }D R. ;¤? muğmma‘un A complex Lemma: Lj >#2> t; ğāmi‘yyun • Lj >#2> t; ğāmi‘yyun University degree holder (masc.) • k' ; B>#2> t; ğāmi‘yyūn University degree holders • .#2> t; ğāmi‘yyah University degree holder (fem.) >> • `2 D .#2t; ğāmi‘yyāt University degree holders un • ”' D R? ¤=; muğmū‘ A summation Figure 8.12 Example set of words grouped to root and lemma 8.3.2.1 The Use of the SALMA ABCLexicon The SALMA – ABCLexicon, as discussed in chapter 4, is a broad-coverage lexical resource which provides prior knowledge to support the development and to improve the accuracy of morphological analysis. The SALMA – ABCLexicon is constructed by extracting information from disparate formats and merging 23 traditional Arabic lexicons by following agreed criteria for constructing morphological lexical resources from raw text. The SALMA – ABCLexicon contains 2,774,866 word-root pairs representing 509,506 different vowelized words and 261,125 different non-vowelized words. - 215 The SALMA – ABCLexicon is stored in three alternative formats: XML files, a relational database; and tab-separated column files. The lexicon is provided with a search facility that enables searching for a certain lexical entry in the lexicon, to return an object LexiconEntry representing an encapsulation of the word and its root. A specialized interface is provided to enable the morphological analyzer to communicate with the lexicon file. The dictionary data structure of the lexicon is in this format: Lexicon = [nv_word:[LexiconEntry,...],...] The Lexicon class interface represents the actual lexicon data and the communication facility between the lexicon and the morphological analyzer. It has procedures that check whether the passed non-vowelized Arabic word is found in the lexicon and returns a list of LexiconEntry objects for the found non-vowelized words. Section 4.4.5 discussed the lexicon data structure and how the lexicon is searched to retrieve the lexicon objects. 8.3.2.2 Step 1, Root extraction The system mainly depends on the SALMA – ABCLexicon to extract the root of the analyzed word. The SALMA – ABCLexicon contains 12 different biliteral roots, 8,585 different triliteral roots, 4,038 different quadriliteral roots, 63 different quinquiliteral roots, and 31 different sextiliteral roots. After selecting the candidate analyses that match the first part of the word with the proclitics and prefixes list, and the third part of the word with the suffixes and enclitics list, the analyzer searches the second part in the SALMA – ABCLexicon and retrieves all the LexiconEntry objects representing word-root pairs. For each candidate analysis from the word segmentation step in the previous module the SALMA – Tokenizer, the second part of the segmented word, stem/root, is searched in the SALMA – ABCLexicon. If the non-vowelized stem/root is found in the lexicon then all vowelized word-root combinations are retrieved and attached to that analysis, which is accepted as a candidate analysis. The common (i.e. highly frequent) root for each analysis is specified. Also, the common root of the word’s analyses is specified. Figure 8.13 shows examples of extracting the root of the different segmentation candidate analyses. The common root of the word and the common root of each analysis are shown in the figure. Word Word 9+ 8 :+ , (+& 9+ 8 :+ , (+& 9+ 8 :+ , (+& 9+ 8 :+ , (+& First part Second part yaEomaluwna k'%R#! yEmlwn yaEomaluwna +R#! yEml k'%R4 Emlwn yaEomaluwna y F S:I Eml y Figure 8.13 Example of root extraction module 9+ 8 :+ , (+& yaEomaluwna S:I E-m-l Common Root Third Part k wn Root Long stem +R4 E-m-l 9+ 8 :+ , (+& 9+ 8 :+ , (+& +R4 E-m-l Root is not found 9* wn S:I E-m-l 9+ 8 :+ , (+& - 216 8.3.2.3 Step 2, Function Words Function words are words with little semantic content. They serve as important clues to the structure of sentences. They define the grammatical relationships with other words within a sentence. They also signal the structural relationships that words have to one another60. Function words include pronouns, prepositions, determiners, conjunctions, auxilliary and modal verbs (Baker et al. 2006). A function word has a special morphological analysis wherever it appears in the text. The percentage of function words in any typical Arabic text is around 40%. The system contains a list of 523 function words collected from a traditional Arabic grammar book (Diwan 2004). The morphological analyzer searches for the word in the function words list, and if it is founded, the analyzer adds the morphological analysis associated with it to the set of analyses generated by the morphological analyzer. Then the analyzer processes the next word. Figure 8.14 shows a sample of function words. /< >nA me F Al*y who C7 Hwl $/ nHn we oI ElY on  fy r hy she 5I End next to :6 h&lA’ they ‡ˆ *lk that $6 XQWr about $I En about in lc6 bDE few bmA Although o6 blY yes byn between l mE with Figure 8.14 Sample of the function words list 8.3.2.4 Step 3, Lemmatizing In this step, the second part of each analysis, which represents the stem or root, is searched for in three other linguistic lists: a list of function words; a named entities list (Benajiba et al. 2008); and a list of broken plurals61. If the stem/root of any analysis matches one of these lists, then a new analysis entry along with its morphological analysis is added to the candidate analyses of the word. The function word list, as discussed in the previous section, consists of 523 function words. The named entity list is the ANERGazet (Benajiba et al. 2008), which consists of three gazetteers: Locations gazetteer containing names of continents, countries, cities, etc; People gazetteer containing names of people collected manually from different Arabic websites; and Organizations gazetteer containing names of organizations like companies, football teams, etc. The Locations gazetteer contains 1,543 names; the People gazetteer contains 2,099 names; and the Organizations gazetteer contains 316 names. Figure 8.15 shows examples of the three gazetteers. 60 61 Wikipedia: Function words http://en.wikipedia.org/wiki/Function_words Khaled Elghamry (2007) Broken Plural List http://sites.google.com/site/elghamryk/arabiclanguageresources - 217 - 6 K ’iṯyūbiyā E%rn Al-qāhira h Locations gazetteer Ethiopia Q2 '" ’abū hammād Abu Hammad Cairo ’uksfurd Oxford Q'Si N%n:&5 / M &@ : ğomhūryyat al-konḡū ad-dῑmoqrātiyyah Democratic Republic of the Congo ’ibrāhῑm People gazetteer Abraham \“ zahrah Zahra 5BI ‘abdullāh Abdullah ḡrāhām Graham  f @B# Organizations gazetteer ’aẖbār al-ẖalῑğ Gulf News ! Ò riyāl madrῑd wikalat ’anbā’ al-batrā’ Petra News Agency Hr%6 X%B XB/ * M2¥ Real Madrid F.C Figure 8.15 Examples of the three named entities gazetteers The third list used is the broken plural list. The list is compiled using the broken plural lists of Elghamry (2007). These lists were automatically extracted from three Arabic Dictionaries: C-m al-mutqan “The professional”,  ' al-wasῑṭ “The median”, and €$ al-ḡanῑ “The rich”. As a singular form is hard to guess from the broken plural form of the word, the lemmatizer is provided with a list of broken plural words of Arabic consisting of 11,367 broken plurals. Each broken plural entry in the list is provided with the root and the singular form of the broken plural which represents the lemma. Figure 8.16 shows examples from the broken plural list. Broken plural ’abwāq O 6< h h7 ḥafaẓa Horns Ones who know Qur’an by heart Confused people U@ + 7+ ḥayārā Hd# ẖayāšῑm Noses; gills nusaẖ Copies s./ Singular h'" būq Horn ÓH2 ḥāfaẓ One who knows Qur’an by heart k‹ ḥayrān To become confused M'@ ‰ ẖayšūm Nose ’i< nusẖa h Copy Figure 8.16 Examples of broken plurals The SALMA – Lemmatizer and Stemmer has been applied to lemmatize a large and varied Arabic Internet Corpus consisting of 176 million words of documents collected from the web (Sawalha and Atwell 2010b). Chapter 10 discusses the application of the SALMA – Lemmatizer and Stemmer used to lemmatize the Arabic Internet Corpus. See section 2.3.4.2 for the definition of lemma, lemmatizing and stem. For further distinctions between concatenative morphology and templatic morphology see Habash (2010). 8.3.3 Module 3: SALMA – Pattern Generator The templatic morphology of Arabic words is based on three elements: root, pattern and vowelization (vocalisim). Roots are the three, four or five underlying letters of words. Roots are classified according to the number of their radicals into: triliteral, quadriliteral - 218 or quinquitiliteral (Habash 2010). The previous section 8.3.2 defines roots and explains the methodology followed to extract the roots of the analyzed words. Patterns are the templates of combinations of consonants and vowels. The consonants represent slots for the root radicals to be inserted and the vowels represent the vocalism. The pattern is represented by sequences of Cs representing the consonants and Vs representing the vocalism. For instance, the pattern mVC1C2VC3 where the vocalisim V=a. Using this pattern and the root - (k-t-b) “to write”, the word maktab ;-; “office” is derived. The CV approach for representing patterns is widely used a cross languages (McCarthy and Prince 1990b; McCarthy and Prince 1990a; Smrz 2007; Attia 2008; Habash 2010). Hundreds of years ago, patterns were defined by Arabic grammarians as *( kl m al-mῑzān aṣ-ṣarfῑ “the morphological scale”. The root letters of the patterns are represented by three letters 3 fā’ f, ” ‘ain E and c lām l representing the first, second and third radicals of the word respectively. The purpose of using the patterns is to standardize the morphological description including the root letters and the vocalism of the derived words. The patterns group derivations of different roots into a template that describes the derivation process, the vocalism and the changes that might happen to the word during derivation (Ali 1987; al-Saydawi 2006). The patterns are templates that enable root letters to be slotted in. Therefore, there are patterns that have three slots to suit triliteral roots (e.g. the word ;´; lahab “flame” has the pattern +#; G;H fa‘al faEal, the word ui= t> ğism “body” has the pattern +#= >H fi‘l fiEl, and the word 3'i? ? kusūf “eclips” has the pattern c'#? G?H fu‘ūl fuEuwl). If the root is quadrilateral having four radicals - then the fourth radical is represented by (‫ ل‬lām l), which is a repetition of the third radical. For example, the word '?%#= / ? ṣu‘lūk “robber” has the quadriliteral root gcg”g& (ṣ-‘-l-k) and the pattern c'?%#= G?H fu‘lūl fuEluwl). Second, if one of the triliteral root letters is doubled, then the symbol that represents that letter in the pattern is also doubled. For example the word M2. ; rassām “painter” which is derived from the triliteral root Mgvg r-s-m “to paint”, has the pattern c2#. G;H fa‘‘āl faEEaAl). In general, if a letter is added or doubled in the word, then the same letter is added or the corresponding letter is doubled in the pattern (Ali 1987; al-Saydawi 2006). The pattern not only has slots for root letters and vocalism to be inserted, it also captures morphosyntactic and semantic characteristics of the derived words. These characteristics are the basis for grouping Arabic words into families of formally and semantically related forms (Ali 1987). These morphosyntactic features are inherited by the derived word of that pattern. The next section 8.3.3.1 describes the construction of the pattern dictionary. The pattern dictionary depends on the SALMA morphosyntactic standards to describe the morphosyntactic attributes of the patterns which are propagated - 219 to the derived words. Therefore, knowing the analyzed word’s pattern results in knowing most of the morphological feature values. Two pattern matching algorithms are used to extract the correct pattern of the analyzed word. These algorithms depend on the pattern dictionary to match the word with its possible patterns. Sections 8.3.3.2 and 8.3.3.3 discuss the pattern matching algorithms. Pattern matching has been investigated by many researchers and several pattern matching algorithms have been proposed to match the word with possible patterns. The Xerox Arabic morphological analyzer depends only on finite-state operations (Beesley 1996; Beesley 1998). Alkhalil depends on large morphophonemic patterns (Mazroui et al. 2009; Boudlal et al. 2010). ElixirFM uses the morphophonemic patterns pertaining to the morphological stem and reflects its phonological qualities (Smrz 2007). The choice of using morphosyntactic patterns or morphophonemic patterns depends on the ability of the pattern matching algorithm to deal with the three types of changes that might happen to the word during the derivation. Matching the morphophonemic pattern with the word can be easier than matching with morphosyntactic patterns. However, the number of patterns in the patterns dictionary will be very large, and it is hard to collect, encode and describe the features of each pattern. On the other hand, morphosyntactic patterns are easier to collect, encode and describe the features of each pattern entry. However, the pattern matching algorithm must deal with the three types of changes: incorporation or assimilation, substitution and deletion of vowel letters. Thus, a more sophisticated pattern matching algorithm needs developing. Incorporation is a common phonological process by which the sound of one letter blends with the sound of the following letter. For example, the word 2.)]; ’āmannā “we believe” has two incorporations: maddah which represents incorporation of the letter hamzah and the following ’alif, and the doubled ‫ ن‬nūn, which involves incorporation of the nūn (i.e. the last letter of C= ]; ’āman) and the following letter nūn (i.e. the first letter of the subject suffixed pronoun 2;< nā). The word 2.)]; ’āmannā |Aman~aA will match the pattern 2;)%= 42 ; ;H fā‘alnā fAElnaA. After resolving the two incorporations, the word will be 2;)G)=1 ; ’āmannā >AmanonaA. Incorporation appears in the written script of the word and it is marked by šaddah. Substitution is the process of changing one of the root radicals into another letter during the derivation process. Substitution happens to weak root letters;  wāw and yā’ tun are changed into ’alif or hamzah. The ’alif in the word D\; / ; ṣalā “a prayer” is underlyingly  wāw in its root gcg& ṣ-l-w. Substitution happens to other letters of the pattern such as ` tā’ in the pattern +; #; G;-G=H>Z ’ifta‘ala >ifotaEala. Where the first radical is “ zāy or & ṣād the ` tā’ is changed into Q dāl or f ṭah respectively. This kind of substitution happens because it is hard to pronounce the /t/ sound after /z/ or /sˤ/. The word 2; Q>“=>Z - 220 ’izdihār >izodihaAr “prosperity” has the root (gUg“) z-h-r and the pattern c2#; >-=H>Z ’ifti‘āl >ifotiEaAl. Here the third letter of the word Q dāl has changed from the letter ` tā’ in the pattern. M; ; ;e/ = >Z ’iṣṭadama >iSoTdama “clashed” has the root (MgQg&) ṣ-d-m and the pattern +; #; G;-G=H>Z ’ifta‘ala >ifotaEala. Here the third letter of the word f ṭah has changed from the letter ` tā’ in the pattern. Deletion of vowel letters or nūn is a mood mark; section 6.2.12 discussed the case and mood marks including deletion. A vowel letter at the end of an indicative verb is deleted if the verb is in the imperative or jussive mood. For example, !ˆ ; )=G;8 r lā tansa! ‘Don’t forget!’, The verb ˆ ; )=G;8 tansa ‘forget’ is in the jussive mood marked by deleting the vowel letter ‫’ ى‬alif from the end of the original verb ni)=G;8 tansā. The nūn at the end of indicative verbs which follow one of the five common verb patterns i; R= ;T = c2#; G=H; al-’af‘āl al-ẖamsah, is deleted in subjunctive or jussive mood. For example, 'R? ;)=$G;8 ^‹‰ ''5 qūlū ẖayran taḡnamū ‘If you speak well, you will get benefits’, the verb 'R)$8 taḡnamū “you will get benefits” is in the jussive mood. Therefore, the final letter nūn is deleted from the verb to indicate the jussive mood. The same verb in the indicative mood is k' ; R? ;)$= G;8 taḡnamūna. 8.3.3.1 Constructing the Patterns Dictionary The construction of the pattern dictionary started by collecting the morphosyntactic patterns from traditional Arabic grammar books (Ya‘qūb 1996) which provided the vowelized patterns and the morphosyntactic description in Arabic for each pattern. The morphosyntactic attributes of each pattern were determined and encoded using the SALMA – Tag Set standards. Also, the full vowelization (vocalism) of each pattern was added. The dictionary of morphosyntactic patterns contains 2,730 verb patterns and 985 noun patterns. Figure 8.17 shows sample entries of the patterns dictionary. We chose to construct a pattern dictionary that contains morphosyntactic patterns, rather than morphophonemic patterns or CV patterns and vocalisms, because the morphosyntactic patterns are easier to collect, encode and describe the features of each pattern entry. The two words ; ;= ; ;8 tadaḥrağ tadaHraj “rolled” and ; ?= ; ;8 tadaḥruğ tadaHruja “rolling” have the same CV pattern CVCVCCVC. It ia thus impossible by this means to distinguish between the third person singular perfect verb ; ;= ; ;8 tadaḥrağ tadaHraj “rolled” and the gerund ; ?= ; ;8 tadaḥruğ tadaHruja “rolling”. However, the two words have the morphosyntactic patterns +;%#= S; G;8 tafa‘lal tafaElal and +?%#= S; G;8 tafa‘lul tafaElul respectively. The two patterns match the previous words and distinguish between the morphosyntactic features of each word. Unaugmented triliteral perfect verbs have the morphosyntactic pattern +; #; G;H fa‘ala faEala which also indicates a third person masculine singular subject as in: the verbs c2 ; ;5 qāla qaAla “he said”, and ; ;-; kataba kataba “he wrote”. However, they have two morphophonemic patterns c2 ; ;H fāla faAla and +; #; G;H fa‘ala faEala respectively. - 221 A pattern matching algorithm matches the analyzed words with their morphosyntactic patterns in the pattern dictionary. The morphosyntactic attributes are represented as a SALMA – Tag and the vowelization of the matched patterns are propagated to the analyzed words. Two pattern matching algorithms were developed. Both of them mainly depend on the pattern dictionary. The next sub-sections discuss the pattern matching algorithms. A syllabified version of the pattern was stored alongside the pattern to be used in a future Arabic prosody project, (see chapter 11 for future work). Dashes were used to separate the syllables of the patterns. Verb Patterns d ? %= #; G;H 2;)=%#; G;H d ; =%#; G;H > %= #G;H d ; 2R; ?-=%#; G;H faEalotu faEalonaA faEalota faEaloti faEalotumaA Noun Patterns  ; #? G=H?: c >#=H> >ufoEulAwaY 1r'42H ? k#? =%#? ?GH fAEuwlA’ 1=#. ?GH AifoEiylAl fuEuloEulAn fuE~ayolA’ Syllabification ` ; ? g+= 4; g3 2;`g+4g3 =; ; 2; g` ; ? g+= 4; g3 Syllabification ; grg” = ?: ? g3 crgL4> g3 = > 1rg'4? g2H krg” ? ? g+= 4? g3 1rgL= 4; g}= ?H SALMA Tag v-p---nsfs-s-an??dst?v-p---npfs-s-an??dst?v-p---msss-s-an??dst?v-p---fsss-s-an??dst?v-p---xdss-s-an??dst?SALMA Tag n?----??-v???---?dqt-? ng----??-v???---?dtt-? n?----??-v???---?dqt-? n?----??-v???---?dqt-? n?----??-v???---?dqt-? Figure 8.17 Sample of the patterns dictionary 8.3.3.2 Pattern Matching Algorithm 1 The first pattern matching algorithm depends on the word itself and its root as inputs. The algorithm replaces the root letters in the word with the pattern letters 3 fa’ f, ” ‘ain E, and c lām l. Then it searches in the patterns dictionary for the generated pattern and returns the morphosyntactic attributes and the vowelization of the analyzed word. However, the process of replacing the root letters with the letters 3 fa’ f, ” ‘ain E, and c lām l is not easy, as some root letters might be changed. The changes include incorporation, turnover, defection and replacement. The algorithm must deal with these changes and extract the correct pattern of the word. The algorithm follows these steps to match the pattern which deals with the changes that happen to the word during derivation: 1. Determine the root letters in the word: a) Find the index or indices of each root letter in the word. If the root letter is ’alif, wāw, yā’ or hamzah then add -1 to the indices list of that - 222 root letter. The -1 value indicates that the root radical has changed. See figure 8.18 step 1a. b) Construct the candidate root indices lists by generating all possible permutations of the indices of the root radicals (step 1a), by selecting an index from each indices list of the root radicals into one combined list. See figure 8.18 step 1b. c) Select the candidate root indices lists that satisfy the linguistic rule of derivation where root letters must appear in the same order in the derived words. This means that the index of the first root radical must be less than the index of the second root radical, and they must be less than the index of the third root radical. The -1 value in the list does not violate the rule. See figure 8.18 step 1c. 2. Replace the root letters in the words with the pattern letters 3 fa’ f, ” ‘ain E, and c lām l. The indices of the the root letters in the words are determined from the previous step (1c). See figure 8.18 step 2. 3. Search for the candidate pattern in the patterns dictionary. If the pattern is found in the list, the SALMA – Tag associated with the pattern in the list is assigned to the analyzed word. 4. If the word is fully vowelized or partially vowelized, then match the vowelization of the word with the vowelization of the pattern. Select only the vowelization of the patterns which best match the vowelization of the word. The algorithm is repeated for each analysis of the candidate analyses produced by the previous analyzer module. The patterns and the morphosyntactic attributes are added to each analysis. 8.3.3.3 Pattern Matching Algorithm 2 The second method of extracting the pattern of the word is based on the Pattern Matching Algorithm (PMA) (Alqrainy, 2008). This algorithm matches partially vowelized word, with the last diacritic mark only, with a pattern lexicon without doing any analyses for the clitics and affixes of the word. Pattern matching algorithm 2 searches the patterns list for patterns of similar size as the analyzed word after removing the clitics of the word. For example, a form - ktb has a size of 3 according to the data structure we used, whether the word is fully-vowelized, partially-vowelized or non-vowelized. It matches the following patterns ( +#= G;H FaEol, +#; G;H faEal, +#? G;H faEul, +>#;H faEil, +#= ?GH fuEol, +#; ?GH fuEal, +?#?GH fuEul, +>#?H fuEil, +#= >H fiEol). In the - 223 second step, the algorithm replaces the letters of the word corresponding to the letters 3 fa’ f, ” ‘ain E, and c lām l of the pattern. Then these generated patterns are searched in the pattern list. If the pattern is found in the pattern list, then it is a candidate pattern of the word, and the morphological tag associated with the pattern in the list is assigned to the analyzed word. Figure 8.19 shows example of extracting the pattern of the word using this method. Figure 8.20 shows examples of matches pattern and their SALMA Tags. The pattern matching algorithm 2 steps are the following: 1. Get the patterns, from the patterns list, which have a similar size to the analyzed word after removing the clitics of the word. 2. Choose the patterns that share the maximum number of letters with the analyzed words. This will reduce the number of patterns to be processed. 3. Replace the letters of the word corresponding to the letters 3 fa’ f, ” ‘ain E, and c lām l of the pattern. 4. Search the candidate generated patterns in the pattern list. If the pattern is found in the pattern list, then the SALMA – Tag associated with the pattern in the list is assigned to the analyzed word. 5. If the word is fully vowelized or partially vowelized, then match the vowelization of the word with the vowelization of the pattern. Select only the vowelization of the patterns that best match the vowelization of the word. Both pattern matching algorithms are used by the SALMA – Pattern generator to match the analyzed with its pattern from the patterns dictionary. The pattern matching algorithm 1 requires the root information to be available, while the pattern matching algorithm 2 depends only on the patterns dictionary. The pattern matching algorithm 1 was developed mainly to solve the problems of the incorporation, deletion, and substitution of the root radicals during the derivation process. The pattern matching algorithm is an improved version of the PMA of Alqrainy (2008). The original PMA matches the word with the patterns of provided with a dictionary containing 8,718 patterns most of them verb patterns. The PMA does not deal with clitics and affixes. This requires providing the algorithm with a large pattern dictionary of all possible combinations of clitics and affixes attached to the pattern types. The SALMA – Pattern generator uses only the matching steps of the PMA to match the word with patterns stored in our patterns dictionary after removing the clitics and affixes that are marked as they are not part of the pattern; see section 8.3.1.5 for the details of the clitics and affixes dictionaries. The removal of the unwanted clitics and affixes generalize the pattern matching algorithm to a - 224 finite set of patterns represented by the patterns dictionary that we have constructed. Step 1 Determine the root letters in the word Word C; i; = ;: ’aḥsana >aHosana “better” Root kgvga ḥ-s-n H-s-n Find the index or indices of each root letter in the word Step 1a st Indices of 1 Root radical ( H) [( : >)0, (a H)1, (v s)2, (k n)3] [1] nd (p s) [2] Word Indices of 2 Root radical (short vowels are not shown) [3] (9 n) Step 1b Construct the candidate root indices [1, 2, 3] Candidate indices list Select the candidate root indices lists that satisfy the linguistic rule Step 1c [1, 2, 3] Indices list Replace the root letters in the words by the with the pattern letters Step 2 Word [( : >)0, (a H)1, (v s)2, (k n)3] Pattern [( : >)0, (3 f)1, (” E)2, (c L)3] +#H: >fEl ’f‘l Search for the candidate pattern in the patterns dictionary Step 3 Matched patterns n@----m?-v???---?dat-? >afoEal +; >#=H?: >ufoEila v-c---xsfdaf-an??dat?S+ (, +< Indices of 3rd Root radical += >#=H?: += >#=H;: >ufoEilo v-c---xsfdjs-an??dat?- >ufoEilo v-i---msss-s-an??dat?- v-c---xsfdjs-an??dst?- +? #; G=H?: >ufoEalu v-c---xsfdnd-pn??dtt?- v-c---xsfdnd-an??dst?- +; #? G=H?: >ufoEula v-c---xsfdaf-pn??dtt?- v-c---xsfdaf-an??dst?- += #? G=H?: >ufoEula v-c---xsfdjs-pn??dtt?- v-c---xsfdjs-an??dst?- +? #; G=H?: v-c---xsfdnd-pn??dat?- S8 + (, +< >afoEalu v-c---xsfdnd-an??dst?- +; #; G=H?: v-c---xsfdaf-pn??dat?- S+ + (, +< >afoEala v-c---xsfdaf-an??dst?- += #; G=H?: v-c---xsfdjs-pn??dat?- S+ (, < >afoEal nj----m?-v???---?dat-? S8 8 (, +< >afoEulu v-c---xsfdnd-an??dst?- S, 8 (, +< >afoEulo S8 3 , +< >afoEilu S+ 3 , +< >afoEila S, 3 , +< >afoEilo S, + (, +< >afoEalo v-c---xsfdjs-an??dst?Step 4 Match the vowelization of the word with the vowelization of the pattern n@----m?-v???---?dat-? v-c---xsfdaf-an??dst?+; #; G=H;: S+ (, +< nj----m?-v???---?dat-? S+ (, < Figure 8.18 Example of extracting the pattern of the words using the first method (the word and its root) - 225 Step 1 Word Patterns Step 2 Patterns Get the patterns, from the patterns list, which have similar size as the analyzed word k' ; ?%R; #= ;G! ya‘malūna yaEomaluwna “They work” word length = 6 > k' ; ?%#; S= ;G! yaf 'alūna yafoEaluwna, k> ;#; S= ;G! yaf‘alāni yafoEalaAni, x ; %#; S= G;8 taf‘alῑn tafoEaliyna, k> ;#; S= G;8 tafo‘alāni tafoEalaAni, k> ;#? S= ;G! yaf‘ulān yafoEulaAn,…etc. Choose the patterns that share the maximum number of letters with the analyzed words > > k' ; ?%#; S= ;G! = 4, k> ;#; S= ;G! = 3, k> ;#? S= ;G! = 3, x ; %#; S= G;8 = 2, k;#; S= G;8 = 2 Replace the letters of the word corresponding to the letters (3 fa’ f, ” ‘ain E, and c lām l) of the pattern. Step3 Word k' ; ?%R; #= ;G! y0 ” E1 M m2 c l3  w4 k n5 yaEmlwn Pattern k' ; ?%#; S= ;G! y0 3 f1 ” E2 c l3  w4 k n5 yfElwn Generated pattern 9 & y0 3 f1 ” E2 c l3  w4 k n5 yfElwn Search the candidate generated patterns in the pattern list Step 4 9+ 8 8, (+& 9+ 8 3 , (+& 9+ 8 + , (+& 9+ 8 3 , (8& 9+ 8 + , (8& yafoEuluwna v-c---mptdnn-an??dst?- yafoEiluwna v-c---mptdnn-an??dst?- yafoEaluwna v-c---mptdnn-an??dst?- yufoEiluwna v-c---mptdnn-an??dat?- yufoEaluwna v-c---mptdnn-pn??dtt?- Step 5 Match the vowelization of the word with the vowelization of the pattern yafoEaluwna v-c---mpt--ian?-st? Pattern 9+ 8 + , (+& Figure 8.19 Example on Pattern Matching Algorithm 2 processing steps Word SALMA Tag Pattern  ktb  ktb  ktb  ktb  S+ + (+ S+ 3 + faEala v-p---msts-a-an??dst?- faEila v-p---msts-f-an??dst?- S+ 8 (+ S+ 3 8 faEula v-p---msts-f-an??dst?- fuEila v-p---msts-f-pn??dtt?- ktb S, (+ faEol nj----m?-v???---?dst-?  ktb FaEal ng----m?-v???---?dst-?  S+ (+ ktb faEul n?----??-v???---?dst-?  ktb S8 (+ S3 + faEil nx----??-v???---?dst-?  ktb fuEol ng----??-v???---?dst-?  S, (8 ktb fuEal n?----??-v???---?dst-?  ktb fuEul n?----??-v???---?dst-?  ktb fuEil n?----??-v???---?dst-? S+ (8 S8 (8 S3 8 Figure 8.20 Example of using the Pattern Matching Algorithm 2 - 226 - 8.3.4 Module 4: SALMA – Vowelizer Vowelization is an important characteristic of the Arabic word. Vowelization helps in determining some morphological features of the words. The presence of the short vowel on the last letter helps in determining the case or mood of the word. The presence of the vowels on the first letter determines whether the verb is active or passive. The presence of other diacritics such as šaddah and maddah (extension) solve some ambiguities of words. After matching the patterns and the analyzed word, in the previous step, taking into account that the patterns are fully vowelized, the analyzer adds the short vowels which appear on the patterns to the analyzed word, whether it is partially-vowelized or nonvowelized. The result is a correctly fully vowelized list of words with the possible analyses. Figure 8.21 shows the process of adding vowels to the non-vowelized words. Patterns Analyzed word  ktb Vowelization S, (+ faEol , + katob S+ (+ FaEal ++ katab S8 (+ S3 + faEul katub faEl 8+ 3+ S, (8 fuEol , 8 kutob S+ (8 fuEal +8 kutab S8 (8 S3 8 fuEul 88 38 kutub S, 3 S3 3 fuEil fiEol fiEil , 3 33 katib kutib kitob kitib Figure 8.21 Vowelization process example 8.3.5 Module 5: SALMA – Tagger The SALMA – Tagger is built on top of the previous modules: the SALMATokenizer, the SALMA – Lemmatizer and Stemmer, the SALMA – Pattern Generator and the SALMA – Vowelizer. Each module processes input words and produces direct results such as: root, lemma and pattern, and intermediate results which are passed to the next module. The previous intermediate results are necessary to perform the specified tasks of that module. For instance, the SALMA – Pattern Generator accepts the root from the SALMA – Stemmer and the input word’s tokenization resulting from the SALMA – Tokenizer, as inputs and uses the patterns dictionary to provide the necessary - 227 morphosyntactic information to find the pattern of the word. Figure 8.4 shows the complete SALMA – Tagger algorithm and the relations of its component modules. The SALMA – Tagger module is the last module which is responsible for adding the SALMA Tags to the analyzed word morphemes. Each morpheme is assigned a single SALMA Tag. The initially-assigned SALMA – Tags were given to the word’s morphemes by matching the morpheme with its equivalent from the morphosyntactic dictionaries included in the system. The initial morphological features tag assignment is discussed in the next sub-section 8.3.5.1. A rule-based system was developed and integrated to the SALMA – Tagger to predict the value of the morphological features which are not assigned in the initial tag assignment process. Sub-section 8.3.5.2 discusses the different kinds of rules that were used to predict the morphological features of the analyzed word. It gives examples of the rules used to predict the morphological features. Section 8.4 gives two examples of the complete set of linguistic rules used to predict the morphological features of person and rationality. Section 8.3.5.3 shows the colour-coded tags for the word’s morphemes. 8.3.5.1 Initially-assigned SALMA Tags Most Arabic words are complex words consisting of multiple morphemes. Each morpheme carries morphological features and belongs to a specific part of speech category. The SALMA-Tagger assigns a tag for each morpheme of the word; given that the linguistic lists used by the morphological analyzer all have the morphological feature tags assigned to each entry in these lists. The previous SALMA – Tokenizer and SALMA – Pattern Generator modules assign an initial SALMA – Tag for each morpheme of the analyzed words. As discussed before, words should be decomposed into five parts: proclitics, prefixes, stem or root, suffixes and postclitics. The morphological analyser should then add the appropriate linguistic information to each of these parts of the word; in effect, instead of a tag for a word, we need a subtag for each part (and possibly multiple subtags if there are multiple proclitics, prefixes, suffixes and enclitics) (Sawalha and Atwell 2009a). The SALMA – Tokenizer implements the above definition and segments the analyzed word into five parts. It assigns a SALMA – Tag for each clitic or affix by searching in the clitics and affixes dictionaries. Once the clitic or affix is found in the clitics and affixes dictionaries, the SALMA Tag associated with that dictionary entry is assigned to the clitic or affix of the word. See section 8.3.1.6 for more details about matching the word segments with the clitics and affixes dictionary entries. The SALMA Tags assigned to the clitics and affixes of the analyzed words represent the initial tag assignment. - 228 The SALMA – Pattern Generator extracts the pattern of the word by applying two pattern matching algorithms that depend on a pattern dictionary. The pattern dictionary associates a SALMA – Tag with each pattern entry. This tag will be assigned to the analyzed word as an initial tag, which will represent the tag of the stem of the word. The initially-assigned SALMA – Tags specify whether a morphological feature category is applicable to the morpheme or not applicable represented by “-” in the tag string. If the feature is applicable, then the value of that feature is either determined and represented by a single letter, or cannot be initially-predicted and represented by “?”. Figure 8.22 shows an example of assigning the initial tags to a word. The example shows that morphological features of Transitivity, Rational and Verb Root cannot be predicted at this stage of analysis. H, 8 (-(+&[3 Y, ++*+ SALMA - Tokenizer H, r8 hum 9nna F + [3 , nağziya p--z-----s-f---------r---r-mpts-s---------Suffixes & enclitics dictionary walananağziyannahum walanajoziyan~ahum And we will surely reward them 9+ C+ na la *+ Root: lt Long Stem: wa C. ;!l> =µ; p--c-----------------p--z-----s-f---------r---a----------------Proclitics & prefixes dictionary SALMA – Pattern Generator $- +&[3 Y, +/ Initial tag v-c---xpfs-f-an??vst?- nağziyanna Pattern: C. ;%#; S= ;G< naf‘alanna nafoEalan~a v-c---xpfs-f-an??vst?Patterns dictionary Figure 8.22 Example of assigning initial SALMA Tags to all word’s morphemes 8.3.5.2 Rule-Based System to Predict the Morphological Feature Values of the Word’s Morphemes A rule-based system was developed to predict the values of the morphological features of the analyzed word. A set of rules was extracted from traditional Arabic grammar books that predict the value of each morphological feature category. The SALMA – Tagger validates the initially-predicted values of the morphological features and predicts the value of the morphological features which were not assigned in the previous step. Figure 8.23 shows examples of the linguistic rules applied to validate and predict the values of the morphological features which were assigned for these particular - 229 words in context. The example shows how other morphological feature values help in distinguishing a given morphological feature. Different rules will apply to different words in context. Section 8.4 gives examples of two sets of rules used to predict the morphological features of Person, Rational and Noun Finals. Analyzed word Initial SALMA Tag Categorey Inflectional Morphology Case or Mood Case and Mood Marks Transitivity Rational Verb Roots C. ;!l> µ=; nağziyanna najoziyan~a “surely reward” v-c---xpfs-f-an??vst?Tag Linguistic Rule Applied If the imperfect verb (1, “v”), (3, “c”) is emphasized s (15, “n”), has the suffix k= n or k. nna the emphasis f nūn as one of the word’s morphemes If the verb (1, “v”) has an object suffixed-pronoun in o its suffixes then it is transitive to one object. Rational is set as default value for verbs (1, “v”). h The root is lt ğ-z-y has the template C1-C2-Y x The analyzed word C. ;!l> µ =; is assigned the following SALMA Tag: v-c---xpfs-f-anohvstxAnalyzed word Initial SALMA Tag Categorey un D( = ;< naṣr “victory” ng----??-v???---?dst-? Tag Linguistic Rule Applied Masculine is a default value, if the word does not include femeinine suffixes \ tā’ marbūṭah,  ’alif Gender m maqṣūrā or 1 madd extension. If the word is declined noun (1, “n”), (10, “v or p”) Number and the word does not have any of dual or plural s suffixes and it is not found in the broken plural list. If the word ends with tanwῑn, then the word is a Inflectional Morphology v Triptote. Case and Mood n Case and Mood Marks If the word ends with tanwῑn al-ḍamm d Definitness i Irrational is the default value for Gerund (1, “n”), Rational n (2, “g”) If the last letter of the word is a consonant and it is Noun Finals s not a hamzah, then the word is sound noun. The analyzed word D( = ;< is assigned the following SALMA Tag: ng----ms-vndi---ndst-s Figure 8.23 Examples of the linguistic rules applied to validate and predict the values of the morphological features - 230 8.3.5.3 Colour Coding the Analyzed Words To visualize the analysis, the word morphemes can be colour-coded. The colourcoding scheme depends on the morphological information of the analyzed word. The SALMA – Tokenizer and the SALMA – Tagger modules specify each of the word’s morphemes, its class (i.e. proclitic, prefix, stem, suffix and enclictic) and the part-ofspeech category for each morpheme. The part of speech category of the stem was used to colour the stem. If the part-of-speech of the stem is a verb, noun, particle, other (residual) or punctuation mark, then it is coloured in green, purple, blue, dark grey or black respectively. Morpheme class is used to colour-code the word’s morphemes of type proclitic, prefix, suffix and enclitic. Each part was coded in a different colour (and possibly multiple colours if there are multiple proclitics, prefixes, suffixes and enclitics). Four colours are used to colour prefixes and suffixes: SlateBlue, LightCoral, Violet and Gold. And four colours are used to colour proclitics and enclitics: MediumTurquoise, SteelBlue, PowderBlue and MediumAquaMarine. Figure 8.24 shows the different colours used to colour-code the word’s morphemes. Figure 8.25 shows an example of a colourcoded word from the Qur’an Gold Standard. Figure 8.29 shows colour-coded visualization of a full text - Qur’an Chapter 29 and a MSA sample from CCA, showing just the morphemes, without full SALMA – Tags; this illustrates morpheme boundaries. Proclitics Prefixes Stem Suffixes 4 MediumAquaMarine 3 PowderBlue 2 SteelBlue Punctuation - Black 1 MediumTurquoise Other (Residual) - DarkGrey 4 Gold Particle - Blue 3 Violet Noun - Purple 2 LightCoral 1 SlateBlue 1 SlateBlue 2 LightCoral 3 Violet 4 Gold 1 MediumTurquoise 2 SteelBlue 3 PowderBlue 4 MediumAquaMarine Verb - Green Enclitics Figure 8.24 Colour codes used to colour code the morphemes of the analyzed words Root HI Stem u; >%4; T + p--c------------------ C+ p--z-----s-f---------- F + r---a----------------- H+ + I, v-c---msts-f-anohvtta- 9- r---z----s-f---------- $- :+ + , (++ (+ Long stem Pattern Word-by-word translation C. R; ;%#= ;G! C. ;%#; S= ;G! and-allah-will-surely-make- |;qI T%7| T%7 Particle |Conjunction | |b-S/ ,-S| € |  '8 3| 3 Particle |Emphatic particle | Invariable (v, n) |fatḥah | | 42£ 3| ‰: Other (Residual) |Imperfect prefix | > > > 3|( ž ;-| ; ;-| ? ; qZ # ? +524|  c'#S ? .—? +#H| M'?%#= R% ; b-S / ,-S| € | A2;$| QS| | ”2£ +#H| +#H ; €> =| > |b ,/| L–?–| 3: >;–;>" =!l| ; 6!(.- M28 +#H – Verb |Imperfect verb |Masculine |Singular |Third Person | Invariable (v, n) |fatḥah |Active voice |Emphatic verb |Singly transitive |Rational |Conjugated / fully conjugated verb |Augmented by three letters |Triliteral |Intact verb | |b-S / ,-S| € |  '- k'<| ‰: Other (Residual) |Emphatic nūn | Invariable (v, n) |fatḥah | Figure 8.25 Colour-coded example of a word from the Qur’an gold standard - 231 - 8.4 Rules for Predicting the Morphological features of Arabic Word Morphemes A rule-based system was designed to predict the morphological features of the analyzed word’s morphemes. It depends on linguistic knowledge extracted from traditional Arabic grammar books (Dahdah 1987; Wright 1996; Al-Ghalayyni 2005; Ryding 2005). For each morphological feature category of the SALMA – Tag Set, a set of rules were extracted and encoded in the SALMA – Tagger. The SALMA – Tagger executes these rules to predict and validate the values of the morphological features of the initial tags assigned to the word’s morphemes. Sophisticated linguistic knowledge was encoded as a rule-based system within the SALMA – Tagger. The encoded rules represent a variety of linguistic knowledge types. In the following, SALMA – Tagger features are cross-referenced to subsections defining them. First come, rules that depend on data lists or dictionaries. These rules search the analyzed word in the data dictionaries to predict the value of a given feature. The rulebased system includes several data lists: the broken plural list contains 9,513 entries used in predicting the morphological feature of Number (section 6.2.8); the named entities list includes personal names list which contains 2,099 entries, the location names list which contains 1,715 entries, and the organization names list which contains 384 entries. This is used to predict the morphological feature attribute of proper name and the morphological feature of Rational (section 6.2.17). The transitive verbs lists (i.e. the doubly transitive verb list contains 2,889 verbs and the triply transitive verbs list contains 1,065 verbs) are used to predict the values of the morphological feature of Transitivity (section 6.2.16). The five nouns list contains 21 entries including all the variations of the five nouns that can be found in a text. The list is used to predict the morphological feature attribute of the five nouns and some attributes of the morphological features of Case or Mood (section 6.2.11) and Case and Mood Marks (section 6.2.12). The non-conjugated and partiallyconjugated verbs lists are used to predict some values of the morphological features category of Declension and Conjugated (section 6.2.18). These lists include: a partiallyconjugated verb list which contains 13 entries; a non-conjugated/restricted to the perfect verb list containing 42 verbs, a non-conjugated/restricted to the imperfect verb list containing 4 verbs, and a non-conjugated/restricted to the imperative verb list containing 13 verbs. Second come, rules that depend on the affixes and clitics of the words. Rules for predicting the morphological features of Gender (section 6.2.7), Number (section 6.2.8) and Person (section 6.2.9) of verbs check the combinations of prefixes and suffixes in the analyzed word. The number of nouns is predicted depending on both the suffixes of the analyzed word and on searching the analyzed word in the broken plural list. The - 232 morphological feature of emphasized and non-emphasized (section 6.2.15) depends on the presence and absence of the emphatic nūn suffix in the analyzed word. An emphasized verb which has emphatic nūn as a suffix, is an invariable verb, the morphological feature of Case or Mood (section 6.2.11) is not applicable and the Case and Mood Mark (section 6.2.12) is always fatḥah. A definite noun has a definite article as a proclitic. Third come, rules which depend on the pattern of the analyzed word. Some rules of predicting intransitive verbs (section 6.2.16) depend on patterns such as +; #; G;-G=H ’ifta‘ala AfotaEala, +; 42 ; S; G;8 tafā‘ala tafaAEala and +; #. S; G;8 tafa‘‘ala tafaEEala. Determining whether h the verb has one of the five-verb patterns i; R= ;T = c2 ? #; G=H; al-’af‘āl al-ẖamsa is essential to predict the values of the morphological features of Gender (section 6.2.7), Number (section 6.2.8), Person (section 6.2.9), Inflectional Morphology (section 6.2.10), Case or Mood (section 6.2.11) and Case and Mood Mark (section 6.2.12). The SALMA – Pattern Generator is used to extract the pattern of the analyzed word. Fourth come, rules depend on the root and stem of the analyzed word. The SALMA – Stemmer and Lemmatizer is used to extract the root of the analyzed word. The root is essential to predict the values of the morphological features of Number of Root Letters (section 6.2.20) and Verb Roots (section 6.2.21). The SALMA – Tokenizer defines the analyzed word’s morphemes including the stem and the long stem of the word. The stem is the middle part of the analyzed words after removing both the clitics and affixes morphemes, while the long stem is the middle part of the analyzed word after removing the clitics only. Long stem is used to predict the value of the morphological feature of Noun Finals (section 6.2.22). It is also used with the root to predict the morphological feature of Unaugmented and Augmented (section 6.2.19). Finally come, rules which depend on the vowelization of the word. The main Case and Mood Marks (section 6.2.12) attributes are specified by the final short vowel appearing on the final letter of the word. A noun that has tanwῑn on its final letter is an indefinite noun. A passive voice verb has ḍammah on its first letter. A default value was selected for each morphological feature category. The default value is used when the rules of predicting the attribute value of a certain morphological feature are not applicable. The selection of the default value was determined by the linguistic knowledge of the attribute values of the morphological features, rather than statistical analysis of the most frequent attribute values in a tagged corpus. A corpus analysis approach is not applicable because of the absence of a tagged Arabic corpus using the full SALMA – Tag Set. Examples of default values are: the default value of the verb mood (section 6.2.11) is set to be indicative; the default value for the Rational (section 6.2.17) is rational for verbs and irrational for nous; and the default value of the - 233 Number of Root Letters (section 6.2.20) is triliteral as most roots of Arabic words are triliteral. In this section, three examples are represented to show the complexity of designing and implementing the rule-based system to predict the values of the morphological features of the word’s morphemes. Section 8.4.1 shows the rules for predicting the values of the morphological feature of Person (section 6.2.9). It also shows other morphological features where their value can be predicted using these rules: the Gender (section 6.2.7) and Number (section 6.2.8) of verbs. Section 8.4.2 shows an example of hard-to-predict morphological features, Rational (section 6.2.17). This example focuses on the need to construct comprehensive dictionaries and linguistic lists. It also gives a good example of selecting the default value for Rational. Section 8.4.3 discusses the rules of the morphological feature of Noun Finals (section 6.2.22). These rules depend on the long stem of the analyzed word. 8.4.1 Rules for Predicting the Morphological Feature of Person An Arabic verb has three main person attribute values; first person Hi M+ +: 8 al3 mutakallim, second person +Nf+ : 8 al-muẖāṭab and third person >+  al-ḡā’ib. First person refers to the person or people speaking. Second person refers the person or people who are present and sharing the talk or speech. Third person refers to the person or people who are absent and do not participate in the talk or speech (Ryding 2005). The rules for predicting the morphological feature of person mainly depend on the combinations of prefixes and suffixed pronouns attached to the end of the verbs. Subject suffixed-pronouns and genitive suffixed pronouns describe the reference person of the verb and agree with the number and gender of the doer of the verb. The subject suffix-pronouns are part of the circumfix (long stem), as the subject suffix-pronouns are part of the verb pattern, while the genitive suffix-pronouns are treated as enclitics. The values of the morphological features of Gender, Number and Person of the subject suffix-pronouns agree with their equivalent of the doer of the verb (the subject), while genitive suffixed-pronouns agree with the object of the sentence (i.e. the person or thing who received the action done by the subject of the verb) in the values of the morphological features of Gender, Number and Person. Subject suffix-pronouns and genitive suffix-pronouns can appear together in the same verb, and the agreement is maintained with the subject and the object of the sentence. For instance, the word 2F; ;G<=P?;= ;G! yaqra’ūnahā ‘they read it’ has the prefix ( ; ) yā’ and the subject suffixed-pronoun (k) ūn. The combination of prefix and suffix pronouns indicates third person, masculine gender and plural number of the verb, while the genitive suffix-pronoun 2 hā indicates third person, feminine and singular object (it). - 234 Tables 8.3-8.5 list the rules for predicting the values of the morphological feature of Person, and the values of the other related morphological features: Gender and Number of perfect, imperfect and imperative verbs respectively. Table 8.3 Rules for predicting the values of the morphological features of Person, Number and Gender for perfect verbs Position 9 Perfect verb Person Category First Person u|%; ;-m ? al-mutakallim Second Person ;2’m ? al-muẖāṭab Third Person 3>+  al-ḡā’ib Person (9) Number (8) Gender (7) ` ? tu 2;< nā ¢> nῑ f s x 2;< nā f p x ` ; ta 2;?¦ tumā ; ka 2R? kumā s s m s d x u? kum > ki s p m s s f C. ? kunna U? hu s p f t s m t d x u? hum t p m 2 hā t s f C. ? hunna t p f ¿? tum > ti ` (1, “v”) (3, “p”) Person Q2)r al-’isnād Subject Genitive suffixed-pronoun suffixed-pronoun C. ?8 tunna ā  ū k; na 2Ã? humā Table 8.4 Rules for predicting the values of the morphological features of Person, Number and Gender for imperfect verbs Person Category First Person u|%; ;-m ? al-mutakallim Imperfect verb Second Person ;2’m ? al-muẖāṭab Subject suffixed-pronoun Person (9) Number (8) Gender (7) ;: ’a - f s x k; na - f p x ` ; ta ` ; ta k> āni s s m s d x k ; ūna C; =! ῑna s p m s s f k; na k> āni s p f t s m t d m k ; ūna C; =! ῑna k> āni t p m t s f t d f k; na t p f ` ; ta ` ; ta (1, “v”) (3, “c”) Prefix Aoristic letter Third Person >A2;$ al-ḡā’ib ` ; ta ; ya ; ya ; ya ` ; ta ` ; ta ; ya - 235 Table 8.5 Rules for predicting the values of the morphological features of Person, Number and Gender for imperative verbs Imperative verb (1, “v”) (3, “i”) Person Category Second Person +Nf: 8 almuẖāṭab Prefix Imperative letter Subject suffixed-pronoun Person (9) Number (8) Gender (7) ’ - s s m ’ ā s d x ’  ū s p m ’ = ῑ k; na s s f s p f ’ 8.4.2 Rules for Predicting the Morphological Feature of Rational The Morphological feature of Rational (see section 6.2.17) is important in deriving the sound plural from rational or irrational nouns (i.e. an adjective describing an irrational masculine word, may forme its feminine sound plural by adding ` āt to the end of the > ¯ + t ğabalun šāhiqun “high mountain” has the plural of `2; 2 > D t> ğibālun adjective, as in ŸD 2 ; D ;; D ¯; c2 ; šāhiqātun high mountains). Rules for predicting the morphological feature of Rational depend on the main and sub part-of-speech categories of the analyzed word. Table 8.6 lists the set of rules used to predict the value of the morphological feature of Rational. The morphological feature of Rational is hard to predict automatically depending on the rules of the main and sub part-of-speech of the word. Classifying words into rational or irrational depends on the semantics of the word itself and its context. For example, an adjective should agree in terms of rationality with the person or thing being described. If un un > the adjective describes a person as in +! D '; +D t? ; rağul ṭawῑl “a tall man”, then the un > > adjective +! D '; ṭawῑl “tall” is rational. But if the adjective describes a thing such as ŸD =!; un un > > +! D '; ṭarῑqun ṭawῑl “a long road”, then the adjective +! D '; ṭawῑl “long” is irrational. Therefore, a comprehensive dictionary which includes Rational information for each dictionary entry is needed to determine the correct attribute value of rational for the described nouns. An agreement algorithm is also needed to match Rational attributes of the adjective and the described nouns. Other types of agreement such as verb-subject agreement are also applicable to predict the value of Rational. The set of rules designed to predict the value of the morphological feature of Rational depends on assigning a default value of rational or irrational to words depending on their sub part of speech, especially for words that need dictionary lookup to find their morphological features. Some words which belong to sub part-of-speech category such as demonstrative pronouns can be gathered and classified into rational and irrational. Table 8.6 shows some of these rules. If these rules are not applied then a default value is - 236 assigned depending on the sub part of speech of the analyzed word. Table 8.7 shows the types of nouns that accept rational as a default value, and the types of nouns that accept irrational as a default value. The default value of Qur’an verbs is rational. Table 8.6 Rules for predicting the values of the morphological features of Rational Position 17 Category Rational S3I ‘āqil (h) Irrational S3I+ %,+Z ḡayr ‘āqil (n) Rational S %Z* S al-‘āqil wa ḡayir al-‘āqil Rule Singular proper nouns (personal names) Some demonstrative pronouns Some conditional nouns Some relative pronouns Some interrogative pronouns Allusive nouns Singular proper nouns (organization and location names) Some demonstrative pronouns Some conditional nouns Some relative pronouns Some interrogative pronoun Allusive nouns n d n r, c b Personal nouns list ‡{: ’ulā’ika “Those” C man “who?” C man “who” y C C man, man ḏā “who?, who is?” a n Organizations list and Locations list d ‡%8 tilka “that” h 2  2RF mā, mahmā “what, whatever” r, c 2 mā “what” b y2 2 māḏā,mā “what” a Table 8.7 Default value of Rational and Irrational for sub part-of-speech categories of nouns, with a tag symbol at position 2 Category Rational Irrational Noun types • Pronoun (p) • Active participle (u) • Intensive Active participle (w) • Passive participle (k) • Gerund / Verbal noun (g) • Gerund with initial mῑm (m) • Gerund of instance (o) • Gerund of state (s) • Gerund of emphasis (e) • Gerund of profession (i) • Allusive noun (a) • Adverb (v) • Adjective (j) • Noun of place (l) • Noun of time (t) • Five nouns (f) • Relative noun (*) • Diminutive (y) Instrumental noun (z) Generic noun (q) Numeral (+) Verb-like noun (&) Form of exaggeration (x) Collective noun ($) Plural generic noun (#) Elative noun (@) Blend noun (%) • Ideophonic interjection (!) • • • • • • • • • - 237 - 8.4.3 Rules for Predicting the Morphological Feature of Noun Finals Nouns are classified into six categories according to their final letters. Nouns that end with a consonant letter are called sound nouns. Semi-sound nouns end with a vowel letter proceeded by a silent letter. A noun with a shortened ending ends with ’alif or ’alif maqṣūrā, if the last letter of the root is wāw or yā’. If the noun ends with an added’alif and hamzah then it is called a noun with extended ending. A Noun with a curtailed ending ends with yā’ proceeded by a letter that has the short vowel of kasrah. Finally, a noun with a deleted ending has fewer letters than its root. See section 6.2.22. Table 8.8 shows the rules for predicting the morphological feature of Noun Finals and the related features. The rules for predicting the value of the morphological feature of Noun Finals mainly depends on the long stem and the root of the analyzed word. The rules check the final letters of the long stem against a set of conditions that classify nouns into 6 categories. Knowing the value of the Noun Finals feature helps in specifying other features such as morphological features of Inflectional Morphology and Case and Mood Marks. Case marks cannot appear on the last letter of the nouns with shortened ending, and only fatḥah, the mark of the accusative case appears on the last letter of nouns with curtailed ending. - 238 Table 8.8 Rules for predicting the values of the morphological features of Noun Finals Category Rule Tag s Sound noun %#j =) H?Q al-’ism ṣahῑh al-‘āir The last letter of the long stem is a consonants and not hamzah. Semi-sound noun = 4Bd H?Q al-’ism šibh aṣ-ṣaḥῑḥ The last letter of the stem is a vowel and the previous letter is silent (i.e. has sukūn as short vowel). i Noun with shortened ending @ n: H?Q al-’ism al-maqṣūr The last letter of the stem is either ’alif or ’alif maqṣūrā, and the last letter of the root is wāw or yā’. t Noun with extended ending "*5:: H?Q al-’ism al-mamdūd The last letter of the stem is either added ’alif, or the last two letters of the stem are added ’alif followed by hamzah or added ’alif followed by wāw, and the last letter of the root is not wāw or yā’. e Noun with curtailed ending e n: H?Q al-‘ism al-manqūṣ The last letter of the stem is yā’ proceeded by a letter that has the short vowel kasrah, and the last letter of the root is yā’. c Noun with deleted ending %#j T* H?Q al-‘ism maḥḏūf al-‘āẖir The stem consists of two letters, or the stem consists of three letter where the third letter is tā’ marbūtah, and the word has a triliteral root where the last root letter is a vowel. d Other features • Inflectional Morphology: noun is triptote / fully declined (10, ‘v’). • Case marks appear on the last letter of the long stem. • Inflectional Morphology: noun is triptote / fully declined (10, ‘v’). • Case marks appear on the last letter of the long stem. • Inflectional Morphology: noun is triptote / fully declined (10, ‘v’). • Case markers do not appear on the last letter of the stem. • Inflectional Morphology: noun is triptote / fully declined (10, ‘v’). Except, if the root is quadriliteral or quinquiliteral, then the noun is non-declinable (10, ‘p’). • Case markers appear on the last letter of the stem. • Inflectional Morphology: noun is triptote / fully declined (10, ‘v’). Except, if the word is a broken plural (8, ‘b’), then the noun is non-declinable (10, ‘p’). • Only accusative case marker appears on the last letter of the stem. Nominative and genitive case markers do not appear. • Inflectional Morphology: noun is triptote / fully declined (10, ‘v’). • Case marks appear on the last letter of the long stem. 8.5 Output Format The final outputs of the SALMA – Tagger include the input word and all possible analyses. Each analysis includes information about the root, the lemma, the pattern, the full vowelized form, the tokenization of the word into morphemes, and the detailed description of the morphosyntactic information of each morpheme using SALMA – Tag. The output of the SALMA – Tagger covers all types of information recommended by the ALCCSO/KACST standards. Moreover, the SALMA – Tagger assigns a SALMA – Tag to each morpheme which captures the detailed and fine-grained morphosyntactic information of that morpheme whether it is a proclitic, prefix, stem, suffix or enclitic. The ALECSO/KACST standards recommend the description of the morphosyntactic - 239 information of the whole word or main stem only. Intermediate results can also be obtained from the different modules of the SALMA – Tagger such as root, lemma, pattern and possible vowelized forms of the word. Several formats are available to format the analyses resulted by SALMA – Tagger. The results are output as a tab-separated file, as XML file and/or HTML page. The alternative formats and file types are provided to ensure wider re-use of the results of the SALMA – Tagger in different text analytics applications for Arabic. We want to tag an Arabic Corpora with fine-grained morphosyntactic information. Therefore, these formats were selected to be compatible with accepted standards for storing text corpora. These standard formats also allow the results to be easily integrated with corpus analysis software where simple tokenization, concordancing and corpus query language can be used to investigate the results of the SALMA – Tagger. A widely-used format to store text corpora is the tab-separated column text-file. This format has been used since the first version of Brown and LOB corpus. The SALMA – Tagger formats its outputs in a tab-separated column file which represents a compatible result format with the widely-used corpus format. The SALMA – Tagger follows the same format as the MorphoChallenge 2009 Qur’an gold standard, see chapter 9. This format stores a word and its analyses per line. The first column contains the input word, and then the analysis is broken down into three columns: the root, the pattern, and the morphemes. A SALMA – Tag is assigned to each morpheme separated by a single space. The morphemes are comma separated. Figure 8.26 shows sample of the SALMA – Tagger results formatted in a tab separated column file. 2;)G=/ . ;; >= k2 ; i; ==!; >';>" L/ 2;)=%#. G;H k; #= >H > ;H +42 ˆ<:  2^)i= ? Ci +#= G?H ; p--c------------------, L= / . ; v-p---mpfs-s-amohvtt&-, 2;< r---r-xpfs-s---------c= r---d-----------------, k2 ; i; =<>Z nq----ms-pafd---hdbt-s > p--p------------------, ; > nq----ms-pafd---hdbt-s, r---r-xdts-s----------, J = ; >U r---r-msts-k---------C; i= ? ng----ms-vafi---ndst-s, ^ r---k------f---------- Figure 8.26 SALMA – Tagger output formatted in a tab separated column file The second format uses XML files to store the results of the SALMA – Tagger. XML technology has become a widely-used and accepted standard to store text corpora when adding structures to the stored corpus. XML tags are used to provide the appropriate structure to the data stored in XML files. The format has a hierarchical structure where the word is at the top of the XML document object model. Several analyses are provided by the SALMA – Tagger to each word of the input text. Each analysis contains the root, the lemma, the long stem, the pattern and the morphemes of the word. For each morpheme the morphosyntactic information is stored. This is: the - 240 morpheme string, the SALMA – Tag, and the Arabic and English descriptions of the morphological features encoded in the tag. If the morpheme is a clitic or affix, then information such as morpheme kind, part of pattern and type are stored with the morpheme structure. Figure 8.27 shows the format of a word’s analysis stored using XML file. َ &ْ * 'َ ‫َو‬ ‫<و‬/word_str> 2'‫<و‬/root> :* 'َ ‫<و‬/lemma> َ &ْ * 'َ ‫<و‬/long_stem> ْ* َ & $َ  َ ‫<و‬/morph_str> p--c------------------ PROC x n ‫<| =< ف| ف‬/ar_desc> Particle |Conjunction | ْ 2* 'َ ‫<و‬/morph_str> STEM v-p---mpfs-s-amohvtt&- 2&"?|  Eَ ُ  ‫ا‬ |  D BC| ‫آ‬A?| ‫ض‬ ٍ ? 3$| 3$ ُ َ 3ِ K | ِ‫ُ ل وا‬ $4? : ‫إ‬ $  ? |  ‫آ‬H ?   G 3$ | ‫م‬  $  2ِ &ْ "? َ * I ُ َ َ| ‫ ن‬E+ ‫|ا‬ ُ ْ َْ َ 6َ ? ‫وق‬4? <4 | 2ِ ! ُ| ‫ِ أف‬  َ!َ ِ # /L  ‫  م ا‬3$ – ‫ف‬5 َ| Verb |Perfect verb |Masculine |Sound plural |First Person | Invariable (v, n) |sukūn (Silence) |Active voice |Non-emphatic verb |Singly transitive |Rational |Conjugated / fully conjugated verb |Augmented by three letters |Triliteral |Separated doubly-weak verb |  َ SUFF r---r-xpfs------------ SUF v y  Eَ ُ  ‫ا‬ |  D BC| QH? ‫آ أو‬A?| 36? P| ‫ى‬O‫أ‬ ‫ ن‬E+ ‫ |ا‬2&"?| Other (Residual) |Suffixed pronoun |Common gender |Sound plural |First Person | Invariable (v, n) |sukūn (Silence) | Figure 8.27 SALMA – Tagger outputs format stored in XML file The third format uses HTML files to store and display the results of the SALMA – Tagger. HTML technology is used to display the results in a visualized way that shows - 241 the analyses of the words directly to the end user. This type of formatting is needed when an online interface is used to run the SALMA – Tagger by end users. However, the enduser has still got the choice to store the results in a tab-separated column file or XML file, to be downloaded directly after the user finishes the execution of the analyzer. The HTML format also allows the hyper-linking of the results with other online applications. For instance, the root of the analyzed word is linked with the web interface of the SALMA-ABCLexicon.The HTML output file contains the morphosyntactic information of the analyzed words such as: the root, the lemma, the long stem, the pattern, the word type and the word’s morphemes. The morpheme type, the SALMA Tag and the Arabic and English descriptions are shown for each morpheme. Figure 8.28 shows a sample HTML page displaying some results of the SALMA – Tagger. Word Root Lemma Long stem Pattern 2;)G=/ . ;; L/ n/ . ; 2;)G=/ . ; 2;)=%#. G;H # 1 Morpheme Type PROC ; Word type SALMA Tag p--c-----------------|6e4 3| 3 Particle |Conjunction | Arabic description English description 2 L= / . ; ( 2;)G=/ . ; ) Arabic description English description 3 2;< Arabic description English description Word Root Lemma >= k2 ; i; =Z # Morpheme 1 c= STEM _ +#H| +#H qZ # ž ;-? | .—? ‹= ¥; +#H| M'?%#= R% ; €> =; | k'i| € | u|%; ;-?m | w2 }¨| | ¬2 > > > > |hS 6 S| L–?–| 3: ;–;>" =!l; | 6!(- . M28 +#H – 3|( ; ;-? | +524|  c'#S ? ; Verb |Perfect verb |Masculine |Sound plural |First Person | Invariable (v, n) |sukūn (Silence) |Active voice |Non-emphatic verb |Singly transitive |Rational |Conjugated / fully conjugated verb |Augmented by three letters |Triliteral |Separated doubly-weak verb | SUF r---r-xpfs-s---------k'i| € | u|%; ;-m | w2 }¨| §<— : | +(- ‹R­| ‰: ? Other (Residual) |Suffixed pronoun |Common gender |Sound plural |First Person | Invariable (v, n) |sukūn (Silence) | Long Pattern Word type stem k2 ; i; =<>Z Type PROC English description k2 ; i; =<>Z ( k2 ; i; =<>Z ) Arabic description English description k; #= >H SALMA Tag r---d----------------| 6!#8 \Q:| ‰: Other (Residual) |Definite article | Arabic description 2 v-p---mpfs-s-amohvtt&- STEM nq----ms-pafd---hdbt-s +>524| ;H> #= ; | b-S / ,-S| J'()| 3( C ”')Œ – J#? | QS| | ˆ)o u| u |‰~ b ,/ ur| L>–?–| x= G;H;±> =!l; | Ÿj ;-@= ? u g 3|( ; ;-? | Noun |Generic noun |Masculine |Singular |Non-declinable |Accusative (n), Subjunctive (v) |fatḥah |Definiteness |Rational |Inflected / Derived noun |Augmented by two letters |Triliteral |Sound noun | - 242 Word Root Lemma >=!; >';>"  ; >; # Morpheme 1 > J Long stem Pattern Type SALMA Tag > = ; ; PROC Word type > ;H +42 p--p-----------------| t 3| 3 Particle |Preposition | Arabic description English description ; >; 2 ( = ; >; ) Arabic description English description 3 = Arabic description English description U> 4 Arabic description English description Word Root Lemma 2^)i= ? Ci C; i= ? # Morpheme 1 C; i= ? ( ^2;)i= ? ) Arabic description English description 2 ^ Arabic description English description STEM nu----md-vgki---ndbt-s > > u g 3|( ; -;? | +524; ‹= ¥; | \;;<| \i| ¤| 3() g J#? | s | | +42S u| u ‰~ b ,/ ur| L>–?–| x= G;H;±> =!l; | Ÿj ;-@= ? Noun |Active participle |Masculine |Dual |Triptote / fully declined |Genitive (n) |kasrah |Indefiniteness |Irrational |Inflected / Derived noun |Augmented by two letters |Triliteral |Sound noun | SUF r---r-xdts-s---------|k'i| € | >A2;$| s | §<— : | +(- ‹R­| ‰: Other (Residual) |Suffixed pronoun |Common gender |Dual |Third Person | Invariable (v, n) |sukūn (Silence) | ENC r---r-msts-k---------|\i| € | >A2;$| QS| | +(- ‹R­| ‰: Other (Residual) |Suffixed pronoun |Masculine |Singular |Third Person | Invariable (v, n) |kasrah | Long Pattern Word type stem ^2;)i= ? Type STEM +#= G?H SALMA Tag ng----ms-vafi---ndst-s > > g 3|( ; -;? | +524; ‹= ¥; | \;;<| b-S / ,-S| J'()| 3() g J#? | QS| | (m| u |‰~ b ,/ ur| L>–?–| Q.;¤? | Ÿj ;-@= ? u Noun |Gerund |Masculine |Singular |Varied (n) |Accusative (n), Subjunctive (v) | fatḥah |Indefinite |Non-human |Derivable – Derived noun (n) |Unaugmented |Tri-literal |Sound noun | SUF r---k------f---------| b-S / ,-S| C!')8| ‰: Other (Residual) |tanwῑn |fatḥah | Figure 8.28 SALMA – Tagger outputs formatted in HTML file Finally, the colour-coding module is used to visualize the morphosyntactic information such as the word’s morphemes and its part of speech coded in colours. This colour-coding output format visualizes the complexity of the Arabic words, and the number and types of morphemes that forms a single word. Each morpheme is coloured depending on its type and part of speech. The details of the colouring scheme were discussed in section 8.3.5.3. The coloured outputs are displayed to the end-user through a web interface as coloured-coded text. The hyper-linking properties of web applications allow us to show the detailed analyses of each word of the displayed text by following the link assigned to each word. Figure 8.25 in section 8.3.5.3 shows an example of detailed - 243 analysis of the colour-coded word. Figure 8.29 shows two samples of colour-coded text, the top text is a Qur’an text – chapter 29, and the second sample is a MSA text taken from the CCA. 33 $- :+ + , (++ (+ H, 3 3 B, (+ $, 3 $& +  8 n8 (+& 9, +<  8%+ (, 8(& 9, +< p +  +(, (+(+ 5, n++*+ 9+ 8(+, (8& Q+ H, r8 *+ +(, _ 8 + + .7+ +< H 3 3- 3  9+ 8 :, (& $&3 - .3 7 !, +< $36ˆ 3 9+ :8 M8 , +& + X + $& + ++ + ?+ +/ n8 B., +& 9, +< +bi. + + + M+ , $- :+ + , (++*+  85+ ) +  84  + 3 Y& :-/’3+ 5+ r $* H3 , l:3 . r*  3 3 n+3 %(& 9 9- 3 43 .3 , (+3 58 r + + , + + 8 + 8 + + 8 + “ X+ C+ 4-  S+ + +< 9- ’3+ 4-  X +8 + 8 , + + + $, + + 3 3  8 :3 I+ *  8_ $&3 -* $:3 +, $3 I+ Š 3+ + +4-  $+ .+ 7, +< H, 8 (-(+&[3 Y, ++*+ H, 3 3+bi?+ H, 8 (, I+ 9- %+ i M+ 8+  + + + + + + + + 9+ 8 :+ , (+&  8/+ F3 - , 3 “ n+ (, :3 I+ S“ – +t+ + + (,*3 Cn+ :+ , + r+ , 3*+ , 3+63 , +\, *+ 3 + 3b?, +\3 E%+ (,3L8 5” +:, 8 “ ,*+ o, +3*+ 8 :+ + , + , + n+ (B,+?+ 3 3 “ 3 3 3 +B3I 3 , (+6 r+@ , +<5+ +6 . 3 + 3b?, +\, 3^3 r+ m , 36 3 :+ + , + ,Q3 Œ 8 *i%+ (8 +6+M, $,  ƒE5+ ,&5+ ƒ + , + , E%+ (, (+ 8 , 8 ,  3 3 3 3 3 3 3 3 3 3 3 ƒ :+/ g Œ + , 8 *i%+ 8(& +6+M, ^r+ $,  ˜ + + +r8 *+ . i M,&%, +\, 3 ,qn8 , Si — , %3 '+ +B, E++  5+ ,&5Y+ , S+ M, ' 3 -3/ M+ , 3 &%n+ , 3 3E,Q3 S+L+\, ˜ 3 3 36  3 %I, +\, "3 5i (+ i M3 ,&%3 , +\, ˜3 :+,Q3 , + 3 + n+ (+L,*+ O +8 , + , , + + + , + :+ +, 4) + + 3 343>[+< $(6 ! 3 3  8 , :, H8h8/*  3 3 3 3 3, S3>?+ *+*+  + , + , + 3 I, 0 + + :8 ,*+ Q+  + i0, S8 >?+ *+ , +6@+ + , - E5+ ,&5Y+ , + , + 8 + + ) 8 3 + 3+f, :, 8 Figure 8.29 Colour coded output of the analyzed text samples of the Qur’an and MSA. 8.6 Chapter Summary Morphological analyses and part of speech (PoS) tagging are very important and basic applications of Natural Language Processing. In this chapter we highlighted the importance of morphosyntactic analyses in a wide range of NLP applications. Arabic has many morphological and grammatical features, including sub-categories, person, number, gender, case, mood, etc. More fine-grained tag sets are often considered more appropriate. The additional information may also help to disambiguate the (base) part of speech. The SALMA – Tagger is a morphological analyzer for Arabic text which depends on pre-stored lists of prefixes, suffixes, roots, patterns, function words, etc. These lists were extracted by referring to traditional grammar books. The affixes lists were verified by analyzing the Qur’an, the Corpus of Contemporary Arabic, the Penn Arabic Tree bank and the text of the 23 traditional Arabic lexicons as a fourth corpus. The prefixes list contains 220 prefixes. The suffixes list contains 474 suffixes and the patterns list contains 2,730 verb patterns and 985 nouns patterns. - 244 The morphological analyzer was developed to analyze the word and specify its morphological features. The SALMA – Tag Set is used as standard for the development of the morphological analyzers. The morphological analyzer uses the tokenization scheme of Arabic words that distinguishes between five parts of word’s morphemes (i.e. proclitics, prefixes, stem, suffixes and enclitics). Each part is given a fine-grained SALMA Tag that encodes 22 morphosyntactic categories of the morpheme (or possibly multiple tags if the part has multiple clitic or affix). The morphological analyzer uses linguistic lists of functional words, named entities and broken plural lists. It also used the broad-coverage lexical resource constructed by analyzing 23 traditional Arabic lexicons. The coverage of the constructed broad-coverage lexical resource showed that about 85% of the words processed using the lemmatizer referenced the broad-coverage lexicon and retrieved correct analyses for the analyzed words. The SALMA – Tagger algorithm involves a pipeline of processing stages, as shown in figure 8.4: Tokenization, Spelling error detecting and correcting, Clitics and affixes matching, Root extraction, lemmatizing, Pattern matching, Vowelization, Morphological features tag assignment and Colour-coding word’s morphemes. These processing stages are useful on their own, such that users can choose the tool that suits their applications. The SALMA – Tagger is an open-source fine-grain morphological analyzer for Arabic text. It only depends on open-source materials: lexicons, word lists and linguistic knowledge. The SALMA – Tagger consists of several modules which can be used independently to perform a specific task such as root extraction, lemmatizing and pattern extraction. Or, they can be used together to produce full detailed analyses of the words. - 245 - Chapter 9 Evaluation for the SALMA – Tagger This chapter is based on the following sections of published papers: Section 4 is based on section 5 in Sawalha and Atwell (2009a) and section 5 in Sawalha and Atwell (2009) Section 5.1 is based on section 3 in Sawalha and Atwell (2011) and section 5 in Sawalha and Atwell (Under review) Chapter Summary The evaluation for the SALMA - Tagger depends on developing proposed standards for evaluating morphological analyzers for Arabic text, based on our experiences and participation in two evaluation contests: the ALECSO/KACST initiative for developing and evaluating morphological analyzers; and the MorphoChallenge 2009 competition. A reusable general purpose gold standard (the SALMA – Gold Standard) was constructed for evaluating the SALMA – Tagger. It can be reused to evaluate other morphological analyzers for Arabic text and to allow comparisons between the different analyzers. The SALMA – Gold Standard is adherent to standards, enriched with fine-grained morphosyntactic information of each morpheme of the gold standard text samples, contains two text samples of about 1000-word each representing two different text domains and genres of both vowelized and non-vowelized text taken from the Qur’an – chapter 29 and the CCA, and it is stored in several standard formats to allow wider reusability. The SALMA – Gold Standard was used to evaluate the SALMA-Tagger. The evaluation focused on measuring the prediction accuracy of the 22 morphological features encoded in the SALMA – Tags for each of the gold standard’s text sample morphemes. The results show that 53.50% of the Qur’an text sample morphemes and 71.21% of the CCA text sample morphemes were correctly tagged using “exact match” with the gold standard’s morpheme tags. The evaluation reported the accuracy, recall, precision, f1-score and the confusion matrix for each morphological feature category to report for users who will use/reuse the SALMA – Tagger or parts of it, the prediction accuracy of the attributes of each morphological feature category. The prediction accuracy scored highly for 15 morphological feature categories at 98.53% -100% for the CCA test sample and 90.11% - 100% for the Qur’an test sample, while slightly lower accuracy was scored by the other 7 morphological feature categories at 81.35% - 97.51% for the CCA test sample and 74.25% - 89.03% for the Qur’an test sample. - 246 - 9.1 Introduction Several morphological analyzers for different languages and especially for English are available online, such as: EMERGE, SProUT, FLEMM, FreeLing, POSTAG, ROSANA, TWOL, and XeLDA, see section 2.3. The high accuracy results achieved by the morphological analyzers is due to: the availability of standard tag sets used to encode the morphosyntactic features of the analyzed words; the availability of morphosyntactically annotated corpora for free use by the research community; and the availability of the evaluation methodologies and standards for evaluating the results of the morphological analyzers and allowing comparative evaluations between them (Hamada 2010). However, there are no evaluation prerequisites (i.e. standards and resources) available for Arabic whether automatic or manual. Therefore, the evaluation of morphological analyzers for Arabic text is not an easy task, and needs more investigation of the specific morphosyntactic features of Arabic, development of a morphosyntactically tagged representative corpus and the proposal of agreed standards to encode the results of the morphosyntactic features of the output analyses. Two community-based experiences for evaluating morphological analyzers for Arabic text and proposed guidelines for evaluation are the ALECSO/KACST initiative62 (Hamada 2010) and the MorphoChallenge63 competition (Kurimo et al. 2009). The ALECSO/KACST initiative aimed to encourage the development of open-source morphological analyzers for Arabic text which are high-accuracy, and easy to develop, can be integrated into higher-level text analytics applications, and adhere to agreed standard guidelines. The MorphoChallenge competition aims to develop unsupervised morphological analyzers to be used for different languages including English, French, German, Finish, Turkish and Arabic. The competition evaluates the participant systems against previously prepared gold standards for each language. The unsupervised morphological analyzer that achieves the highest accuracy results in its outputs applied to the 6 languages wins the competitions. The two experiences are discussed in sections 9.2 and 9.3 respectively. This chapter focuses on evaluation techniques for morphological analyzers for Arabic text. The chapter reflects our experiences on evaluating morphological analyzers as participants in the ALECSO/KACST initiative and the MorphoChallenge 2009 competition. The chapter develops and proposes applicable standard guidelines for evaluating morphological analyzers for Arabic text. These guidelines were applied to 62 The workshop of morphological analyzers experts for Arabic language ( $% YH(  "'2  `Y%˜ 1·‰ ”2R-t  "#) 26 -28 April 2009, Damascus, Syria http://www.alecso.org.tn/index.php?option=com_content&task=view&id=1234&Itemid=1002&lang=ar 63 MorphoChallenge 2009 http://research.ics.tkk.fi/events/morphochallenge2009/ - 247 evaluate the SALMA – Tagger. The evaluation procedure and results are discussed in the chapter. 9.2 ALECSO/KACST Initiative Guidelines for Evaluating Morphological Analyzers for Arabic Text The ALECSO/KACST initiative aimed to encourage the development of opensource morphological analyzers for Arabic text which are high-accuracy, and easy to develop, can be integrated into higher-level text analytics applications, and adhere to agreed standard guidelines. The organizers invited world-wide Arabic morphological analyzer experts from universities, research institutions, software companies, a private legal institution and a non-governmental research funding organization along with Arabic language scholars to a workshop held in the Arabic Language Academy of Damascus, Syria in April 2009. The participants presented the specifications of their morphological analyzers, the development methodologies, the initial results of evaluation, and demos of the developed systems. The ALECSO/KACST initiative evaluation committee presented the specifications of the required morphological analyzer for Arabic text (Al-Bawaab 2009; Hamada 2009a); see section 8.2. The evaluation committee also presented the evaluation methodology. Then the participants discussed the proposed evaluation methodology and agreed on the evaluation guidelines and procedures that would be followed to fairly evaluate and compare the different morphological analyzers. The discussions were based on the proposed evaluation methodologies presented by the participants (Dichy 2009; Hamada 2009b; Sawalha and Atwell 2009b). The ALECSO/KACST initiative agreed to organize a competition between the participants’ analyzers. The evaluation committee provided the output format of the morphological analyzer and a test dataset consisting of selected words to represent most morphological and inflectional cases of Arabic words. A period of two months was given to the researchers to format the output of their analyzers to match the recommended format. On the day of the competition, the evaluation committee provided the participants with the test dataset containing 15 words. The participants ran their morphological analyzers on this test list and they returned the results of their systems one day after receiving the test list. Then the evaluation committee evaluated the results received and announced the winner of the competition. However, the procedure they followed to evaluate the morphological analyzer was not reported, and the comparative evaluation results from participants’ analyzers in respect to the agreed evaluation guidelines were not revealed. This section describes in detail the ALECSO/KACST initiative standards and guidelines for evaluating morphological analyzers for Arabic text. - 248 The evaluation process involves analyzing the outputs of the analyzers given a test dataset consisting of selected words which represent most morphological and inflectional cases of Arabic words. The outputs of the morphological analyzers are evaluated according to two criteria: linguistic analyses and technical specifications (i.e. the approach to implementation, the extent to which it is user-friendly, the database management, the copyright and licensing issues and the accuracy metrics of recall and precision) (Hamada 2009b). 9.2.1 Evaluation of the Linguistic Specifications The evaluation according to linguistic specifications checks the ability of the morphological analyzer to specify the morphosyntactic features of the analyzed words. The evaluation criteria are mainly based on the recommended morphosyntactic requirements for developing robust morphological analyzers for Arabic text (Al-Bawaad 2009; Hamada 2009b, Zaied 2009) and the development standards agreed by the participants, see section 8.2. The evaluation criteria include (Hamada 2009b): • The ability to analyze all forms of words (i.e. fully vowelized, partially vowelized and non-vowelized). • The ability to tokenize the analyzed word and to specify the word’s morphemes (i.e. proclitics, prefixes, stem, suffixes and enclitics). The ability to extract all correct roots and patterns of the analyzed word. The ability to specify the main part of speech of the analyzed word. • • • • • The ability to add the correct vowelization to the analyzed word. The ability to identify the morphological features of verbs such as: transitivity, augmented or unaugmented, number of root letters, person, voice and mood. The ability to identify the morphological features of nouns such as: gender, number, relative noun or noun of diminution, and variability and conjugation. 9.2.2 Evaluation of the Technical Specifications The guidelines for evaluating the technical specifications contain five evaluation criteria. These criteria are: the approach to implementation, user friendliness, database management, copyright and licensing, and the accuracy metrics of recall and precision: 9.2.2.1 The Approach to Implementation • • • The clarity and simplicity of the morphological analyzer algorithm and development approach. The novelty of the algorithm. The ability to integrate the morphological analyzer or parts of it into other Arabic text analytics applications. - 249 - • The availability of complete documentation that describes the morphological analyzer development approach and usage. 9.2.2.2 User Friendliness • The user interface of morphological analyzer. • The speed performance when analyzing words (word/second). • The programming language used to develop the morphological analyzer. 9.2.2.3 Database Management • • The independence of the database (dictionaries) from the actual programs of the morphological analyzer. The ability to update the database (insert/delete/update) by the user, without running the morphological analyzer, or during the execution. 9.2.2.4 Copyright and licensing This criterion checks whether the morphological analyzer depends on open-source resources or closed-source resources developed by others. 9.2.2.5 Evaluation Metrics of Recall and Precision Recall and precision can be used to compute the accuracy of the results for each morphological analyzer. Then, the accuracy results can be ranked for comparative evaluation of morphological analyzers. Recall and precision are defined in the following formulas 9.1 and 9.2. Recall =  Precision =           5 6 (  5 6)       9 6 5 6     ………………………(9.1) ……………………..……(9.2) 9.3 MorphoChallenge Guidelines for Evaluating Morphological Analyzers for Arabic Text The Morpho Challenge task is to develop an unsupervised learning algorithm which can return the morpheme analyses of each word given lists of words of in a number of target languages. In 2009, these were Arabic, English, Finish, German and Turkish. The algorithm should be as language-independent as possible. All words in the training corpus occur in sentences, so the algorithm might utilize information about word context (Kurimo et al. 2009). The training corpora were 3 million sentences for English, Finnish and German, and 1 million sentences for Turkish in plain unannotated text files. The training corpus for Arabic was the Qur’an, which is a small corpus consisting of only 78K words. The text of - 250 the Qur’an corpus is available in both vowelized and non-vowelized formats. For Arabic, the participants could test their algorithms using the vowelized words or the unvowelized, or both. The algorithms were separately evaluated against the vowelized and the nonvowelized gold standard analyses. For all Arabic data, the Arabic writing scripts were provided as well as the Roman script (Buckwalter transliteration64). However, only morpheme analyses submitted in Roman script were evaluated (Kurimo et al. 2009). MorphoChallenge 2009 established three competitions for evaluating the morpheme analyses. Competition 1 evaluated the proposed morpheme analyses against a linguistic gold standard. It included all five test languages. The winners were selected separately for each language according to the highest F-measure of accuracy. Competition 2 evaluated the proposed morpheme analyses against information retrieval (IR) experiments, where the search was based on morphemes instead of words. The words in the documents and queries were replaced by their proposed morpheme representations. This competition included three of the test languages (Finish, German and English). Competition 3 evaluated the proposed morpheme analyses using a machine translation (MT) model where the translation was based on morphemes instead of words. The words in the source language document were replaced by their morpheme representation. This competition included two of the test languages (Finish and German). Translation was done from the test language to English. The performance was measured with BLEU scores (Kurimo et al. 2009). 9.3.1 MorphoChallenge 2009 Competition 1: Evaluation using Gold Standard In Competition 1 the proposed unsupervised morpheme analyses were compared to the correct grammatical morpheme analyses of the linguistic gold standard. The gold standard morpheme analyses were prepared in the same format as the result file the participants were asked to submit, alternative analyses being separated by commas. The Qur’an gold standard included each word in a separate line. Each line contains the word, the root, the pattern and then the morphological and part-of-speech analysis (Kurimo et al. 2009). 64 Buckwalter transliteration http://www.qamus.org/transliteration.htm - 251 Unsupervised learning algorithms for analyzing Arabic text were only evaluated in competition 1. “… The basis of the evaluation is, thus, to compare whether any two word forms that contain the same morpheme according to the participants’ algorithm also has a morpheme in common according to the gold standard and vice versa. In practice, the evaluation is performed by randomly sampling a large number of morpheme sharing word pairs from the compared analyses. Then the precision is calculated as the proportion of morpheme sharing word pairs in the participant’s sample that really has a morpheme in common according to the gold standard. Correspondingly, the recall is calculated as the proportion of morpheme sharing word pairs in the gold standard sample that also exist in the participant’s submission ...” (Kurimo et al. 2009) The F-measure, which is the harmonic mean of precision and recall, was selected as the final evaluation measure: : − <(*=>?( = @ A A % BCDEFGFHI JDEKLL …………………………………(9.3) 9.3.2 MorphoChallenge 2009 Qur’an Gold Standard We developed the gold standard of the Qur’an to be used to evaluate morphological analyzers in Morphochallenge 2009 competition 165, which aimed to develop an unsupervised morphological analyzer to be used for different languages including Arabic. The gold standard size is 78,004 words. The Qur’an gold standard contains the full morphological analysis for each word, according to the morphological analysis of the Qur’an in the Tagged database of the Qur’an developed at the University of Haifa (Dror et al. 2004). Figure 9.1 shows a sample of the Qur’an gold standard. 65 Qur’an dataset http://www.cis.hut.fi/morphochallenge2009/datasets.shtml - 252 Vowelized Arabic script u> i= >" u > Y% None C> ;G=  . > u>  . u u None J+Prep , u+Noun+Triptotic+Sg+Masc+Gen , None ; +Noun+ProperName+Gen+Def , k# ; ;H k2;; +Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , + >#;H u >; +Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , Non-Vowelized Arabic script ui" u  None None J+Prep , u+Noun+Triptotic+Sg+Masc+Gen , None +Noun+ProperName+Gen+Def , CG u k#H k2+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , u  u + #H u +Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , Vowelized Romanized script using Buckwalter transliteration scheme bisomi All~hi Alr~aHomani Alr~aHiymi sm None rHm rHm None None faElaAn faEiyl b+Prep , sm+Noun+Triptotic+Sg+Masc+Gen , llaah+Noun+ProperName+Gen+Def , raHmaan+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , raHiim+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , Von-vowelized Romanized script using Buckwalter transliteration scheme bsm Allh AlrHmn AlrHym sm None rHm rHm None None fElAn fEyl b+Prep , sm+Noun+Triptotic+Sg+Masc+Gen , llAh+Noun+ProperName+Gen+Def , rHmAn+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def rHym+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def , Figure 9.1 A sample of the MorphoChallenge2009 Qur’an gold standard, in 4 alternate formats 9.4 Gold Standard for Evaluation As with other NLP tasks, it is customary to use gold standards for evaluating morphological analyzers. This is discussed in section 2.3.2 of this thesis, along with construction of gold standard data sets for the Qur’an and MSA in section 3.4. This section proposes guidelines for constructing and using a gold standard for evaluation of a fine-grained morphological analyzer for Arabic text. Gold standards are used to evaluate and measure the accuracy of automatic systems. The evaluation can be used to compare between different systems or algorithms on the same problem domain. It shows the successes and failings of an algorithm. Gold standards can be used to compute similarity between systems by highlighting the cases of agreed analyses and the cases when a tie resulted. Moreover, a gold standard can be used to determine the specifications of the morphological analyzers by specifying which morphological features it can or cannot handle. This is another way to evaluate morphological analyzers, by describing their specifications. - 253 To construct a gold standard for evaluation, we need to determine the problem domain of the algorithms to be evaluated, the corpus to be used as gold standard, the format of the gold standard, its size, the script used and transliteration scheme, and the phases of constructing the gold standard. 9.4.1 Problem domain The gold standard will be used to evaluate morphological analyzers and part-ofspeech taggers for Arabic text. The gold standard should have morphological information and part-of-speech tags for each word of the selected corpus. 9.4.2 The Corpora Corpora are used to build gold standards. Many Arabic language corpora have been developed. But to build a widely used general purpose gold standard, corpora of different text domains, formats and genres of both vowelized and non-vowelized Arabic text are needed. Two open-source corpora are recommended to be used. First, the Qur’an corpus can be used in the construction of the gold standard. The Qur’an text is Classical Arabic, representing a genre-specific corpus which is morphologically different from Modern Standard Arabic. It represents a challenge to morphological analyzers for Arabic text because of its complex morphosyntactic features. The Qur’an sample is fully vowelized text. Second, the Corpus of Contemporary Arabic (CCA) is an open-source Arabic corpus representing Modern Standard Arabic (Al-Sulaiti and Atwell 2004; Al-Sulaiti and Atwell 2005; Al-Sulaiti and Atwell 2006).This corpus contains 1 million words taken from different genres collected from newspapers and magazines. It contains the following domains; Autobiography, Short Stories, Children's Stories, Economics, Education, Health and Medicine, Interviews, Politics, Recipes, Religion, Sociology, Science, Sports, Tourist and Travel and Science. The text in the CCA is non-vowelized. 9.4.3 Gold Standard Format The gold standard will include detailed morphosyntactic information for each word of the gold standard. The analysis divides the words into their morphemes: proclitics, prefixes, stem, suffixes and enclitics. For each morpheme fine-grain morphological features information will be provided. The SALMA – Tag Set is recommended to be used to encode the morphological features of the word’s morphemes (Hamada 2010). Moreover, the gold standard will contain the basic morphological information such as: the root, the lemma and the pattern of the words. The gold standard will be stored using different file formats to meet the wider-user specifications. Both tab-separated column files and XML files are recommended. A visual representation of the gold standard such as HTML tables is recommended. The visual representation allows the end-user to view the morphosyntactic information of the gold standard. Unicode utf-8 encoding is - 254 recommended to be used in all files (Bird et al. 2009 p.93) to enable a unified representation for Arabic letters on different platforms. 9.4.4 Gold Standard Size The gold standard should be large enough to cover most cases that morphological analyzers have to handle. The gold standard size is measured by the number of words it contains. 9.5 Building the SALMA – Gold Standard This section discusses the process of building the SALMA - Gold Standard for evaluating morphological analyzers for Arabic text. The proposed standards are based on the agreed standards and guidelines and our experiences and contributions to the ALECSO/KACST initiative and MorphoChallenge 2009 competition for developing and evaluating morphological analyzers for Arabic text. The SALMA – Gold Standard is aimed at the wider research community for evaluating morphological analyzers for Arabic text, and comparisons between their outputs. Therefore, it includes detailed morphosyntactic information that can be produced by morphological analyzers such as: the input word, its root, lemma, pattern, word type and the word’s morphemes. For each of the word’s morphemes, the standard shows the morpheme type classified into proclitic, prefix, stem, suffix and enclitic, and a finegrained SALMA – Tag which encodes 22 morphological feature categories of each morpheme. These morphological features are described in Arabic and English. The format of the gold standard is an important issue. The proposed gold standard is formatted in different formats to meet a range of user needs. XML technology allows storage of the gold standard in a machine-readable structured format that increases its reusability. Tab separated column files are widely used by researchers. They are used following the Morphochallenge 2009 recommendations for constructing gold standards. Other formats are used to display the information of the gold standard for the end users. These formats include HTML files and the visual display of the gold standard in colourcoded format. The SALMA – Gold Standard for evaluating Arabic morphological analyzers is an open-source resource that is available to download. Two text samples were selected to construct the SALMA – Gold Standard. The first text sample is Chapter 29 of the Qur’an representing classical Arabic. Section 9.5.1 discusses the construction of the Qur’an gold standard. The second text sample is taken from the CCA representing Modern Standard Arabic. Section 9.5.2 discusses the construction of the CCA gold standard. Both samples were selected to represent a wider range of text types, formats and genres. - 255 - 9.5.1 The Qur’an Gold Standard The SALMA Gold Standard Qur’an text sample was constructed by mapping from an existing specific format and broad tag set to the standardized format and fine-grained SALMA – Tag Set see section 7.2. The Quranic Arabic Corpus sample text chosen was chapter 29, consisting of about 1000 words. An automated mapping algorithm was developed to map the Quranic Arabic Corpus script, morpheme tokenization and morphological tags to meet our proposed standards and guidelines. After that, the automatically mapped results including the morphological feature tags were manually verified and corrected, to provide a new finegrain Gold Standard for evaluating Arabic morphological analyzers and part-of-speech taggers. The mapping from the Quranic Arabic Corpus format and morphological tag set to the proposed standards and guidelines for constructing gold standards and the SALMA – Tag Set was done by the following six-step procedure: 1. Mapping classical to modern character-set: the Quranic Arabic Corpus uses the classical Othmani script of the Qur’an (77,430 words) which was mapped to Modern Standard Arabic (MSA) script (77,797 words). This was achieved by applying one-to-one mapping except for some cases where one word in Othmani script is mapped to two words in MSA such as the word nÍ ' ; ?º;Í yāmūsā ‘O Musa “Moses”!’ - in Othmani script this is one word but it is written as two words in MSA script: n' ; ? 2;! yā mūsā. 2. Splitting whole-word tags into morpheme tags: the morphological tag in the Quranic Arabic Corpus is a whole-word tag, composed by combining the prefix with the stem and suffix morphological tags, separated by (+) signs. The words and their morphological tags were automatically divided into morphemes and morpheme tags. 3. Mapping of feature-labels: the mnemonics of the Quranic Arabic Corpus tags were mapped to their equivalent in the SALMA Tag Set. Then, SALMA Tag Set templates were applied to specify the applicable and non-applicable morphological features of the analyzed morpheme. 4. Adjustments to morpheme tokenization: due the differences between the underlying word tokenization model used in the Quranic Arabic Corpus and the one required for the SALMA Tag Set, we replaced the mapped tags of the prefixes and suffixes with SALMA tags by matching them to the clitics and affixes lists used by the SALMA Tagger. 5. Extrapolation of missing fine-grain features: for morphological features which are not included in the Quranic Arabic Corpus tag set, automatic “feature- - 256 prediction” procedures applied linguistic knowledge extracted from traditional Arabic grammar textbooks, encoded as a computational rule-based system, to automatically predict the values of the missing morphological features of the word. 6. Proofreading and correction: the mapped SALMA tags were manually proofread and corrected by an Arabic language expert. The result is a sample Gold Standard annotated corpus for evaluating morphological analyzers and part-of-speech taggers for Arabic text. Sections 7.3 and 7.4 discuss the mapping process in detail. The exact match of the prediction of all 22 features for a morpheme whole tags for the test sample is 53.5%, but some of the errors were very minor such as replacing one ‘?’ by ‘-’. The error-rate of individual features scored 2.01% for main part of speech, between 3% and 15% for morphological features coded in the QAC tags, and between 2% and 24% for features which do not exist in the QAC tags but can be automatically predicted. 9.5.1.1 Specifications of the Qur’an part of the SALMA Gold Standard The construction of the SALMA – Gold Standard applied the proposed guidelines and standards for constructing gold standards for evaluating morphological analyzers of Arabic text. This section shows their application on the Qur’an sample of the SALMA – Gold Standard. 1- Problem domain The Qur’an part of the SALMA – Gold Standard was constructed to evaluate morphological analyzers and part-of-speech taggers on Classical Arabic. This information includes the input word, root, lemma, pattern, and the appropriate segmentation of the word into its morphemes. The morphological features for each of the word’s morphemes were encoded using SALMA – Tags. The detailed and finegrain morphosyntactic information was provided to enable the wider research community to evaluate their morphological systems using a unified standard that enables comparisons between the various evaluated systems. 2- The Corpus This is text sample of the Qur’an, chapter 29 `' )# \' sūrat al-‘ankabūt. The Qur’an text represents a genre specific corpus which is morphologically different from Modern Standard Arabic. It represents a challenge to morphological analyzers for Arabic text because of its complex morphosyntactic features. The Qur’an sample is fully vowelized text. A non-vowelized copy is provided to evaluate morphological analyzers which do not accept vowelization for their input text. Morphological - 257 analyzers of Arabic text are expected to perform better on Modern Standard Arabic text than the Qur’an text. 3- Gold Standard Format The SALMA – Gold Standard is stored using a variety of file formats. Firstly, XML files were used for storage. Suitable xml-tags were added to describe the detailed information of the analyses for words and their morphemes. Figure 9.3 shows an example of the SALMA – Gold Standard, Qur’an part, stored using XML files. Secondly, widely used tab separated column files were used to store the gold standard following the Morphochallenge 2009 recommendations for constructing gold standards. Each word and its analysis were stored in a line where the word occupied the first column, followed by the root, the pattern and the morphemes on separate columns. The last column contains each morpheme which is followed by its SALMA Tag separated by a comma. Figure 9.2 shows an example of the SALMA – Gold Standard, Qur’an part, stored using a tab separated column file. Other formats are used to display the information of the gold standard for end users. These formats include HTML files and the visual display of the gold standard in colour-coded format. The SALMA – Gold Standard for evaluating Arabic morphological analyzers is an open-source resource that is available to download. See section 8.5 output format of the SALMA – Tagger. 4- Gold Standard Size The size of the gold standard is measured by the number of words it contains. The SALMA – Gold Standard, Qur’an part contains 976 words, of 603 word types. These words were generated from 243 different roots, 367 different lemmas and 175 different patterns. The number of morphemes in this part is 1,942 having 471 different SALMA – Tags. > ; i; ;: v2 ? .) k= ;: '?;G-=?G! k= ;: '?'? ;G! 2.)]; u= ?; r; k' ; ?)G;-S= ?G! > ; i; v2 ? ;< +; >#;H +#= G;H ;: p--i-----s------------, ; i> ; v-p---msts-f-amohvstac r---d-----------------, v2 ? ;< n#----mj-vndd---htst-s 8 ;;G;8 '?%#; S= ?G! ? r---a-----------------, ?;G=8 v-c---mptdao-pmohvtta-,  r---r-mpts-s---------k= ;: p--g-----s-s---------- c'5 C: c2 ; ;5 C; ]; '?%#? S= ;G! 2;)=%42 ; ;H ? ?5 v-c---mptdao-amohvtto-,  r---r-mpts-s---------; r---a-----------------, c' C= ]; v-p---mpfs-s-amohvttc-, 2;< r---r-xpfs-s---------- • ;; G;H k' ; ?%#; S= ?G! ; r---r-mpts-f---------?; G=H v-c---mptdnn-pmohvtta-, k ? r---a-----------------, • i v'< •H k= ;: p--g-----s-s---------- ; p--c-----s-f----------, u= ? np----mpts-si---hn---r; p--n-----s-s---------- Figure 9.2 A sample of the SALMA – Gold Standard, Qur’an part, stored using text file - 258 ; i> ; ;: i ; i> ; ; i> ; +; >#;H ;: PROC p--i-----s------------ x n € | M2FS- 3| 3 | Particle |Interrogative particle |Structured (v, n) | ; i> ; STEM v-p---msts-f-amohvsta- > > > _ +#H| +#H 3|( ž ;-| ; ;-| ? ; qZ # ? +524|  c'#S ? .—? ‹= ¥; +#H| M'?%#= R% ; b-S / ,-S| € | A2;$| QS| | ¬2 ; €> =| b ,/| L>–?–| Q.;¤| ? 6!(.- M28 +#H –| Verb |Past verb |Masculine |Singular |Third Person |Structured (v, n) |fatḥah |Active voice |Nonemphatic verb |Transitive to one object |Human |Derivable- complete derived verb |Unaugmented |Tri-literal |Sound | v2 ? .) v'< v2 ? ;< v2 ? ;< +#= G;H c PROC r---d----------------- n n 6!#8 \Q:| ‰: | Residual |Definite article | v2 ? ;< STEM n#----mj-vndd---htst-s > – 3|(;-| +>524| ;H> #| u£ / R£| ”'H| 3() g J#| \  }¨| | L#¨ ˆ)t u Q.;¤| ? `y u g2t =; ; ? ? ‰~ b ,/ ur| L>–?–| | Noun of genus in plural form |Masculine |Major plural |Varied (n) |Nominative (n), Indicative (v) |ḍammah |Definite |Human |Inert/ Concrete noun (n) |Unaugmented |Tri-literal |Sound noun | Figure 9.3 A sample of the SALMA – Gold Standard, Qur’an part, stored using XML file - 259 - 9.5.2 The Corpus of Contemporary Arabic Gold Standard The SALMA – Gold Standard CCA text sample was constructed by using the SALMA – Tagger, then manually selecting and correcting the analysis of each word according to its context. This semi-automatic approach was followed because of limitations of time, funds and availability of professional annotators. Therefore, manual annotation was not practical. On balance, it was more practical to run the SALMA – Tagger which produced the initial analyses necessary to construct the gold standard. Mapping from non-open-source part-of-speech tagged corpora such as the PATB was avoided because it contradicted the aim of constructing the SALMA – Gold Standard as an open-source resource available for the wider research community. A 1000-word text sample was selected from the CCA. This MSA text sample was selected from three genres of the CCA: politics, sport and economics, the main three genres of newspaper articles. The selected text sample is non-vowelized. The construction of the SALMA – Gold Standard from the CCA text sample was done by selecting and correcting the outputs of the SALMA – Tagger run on this text sample. The SALMA – Tagger provided the detailed morphosyntactic information required by the gold standard such as: root, lemma, long stem, pattern, vowelized word and the word’s morphemes. A SALMA Tag was provided for each morpheme as well. The manual selection and correction was done because the SALMA – Tagger generates all possible analyses for each word. Therefore, one analysis suitable for the context was selected as a candidate analysis. Then, manual correction was carried out. The correction process involves verifying and correcting the detailed information about root, lemma, pattern, fully vowelized form of the word and the word’s morphemes. The SALMA – Tag for each morpheme was then proofread and corrected. The exact match of the prediction of all 22 features for a morpheme whole tags for the test sample is 71.12%, but some of the errors were very minor such as replacing one ‘?’ by ‘-’. 9.5.2.1 Specifications of the CCA part of the SALMA Gold Standard A similar methodology was followed to construct the SALMA – Gold Standard CCA part that applied the proposed guidelines and standards for constructing gold standards for evaluating morphological analyzers of Arabic text. This section shows their application on the CCA sample of the SALMA – Gold Standard. 1- Problem domain The CCA part of the SALMA – Gold Standard was constructed to evaluate morphological analyzers and part-of-speech taggers on MSA text. The SALMA – Gold Standard contains detailed analysis of each word of the gold standard. This - 260 information includes the input word, root, lemma, pattern, fully vowelized form of the word, and the appropriate segmentation of the word into its morphemes. The morphological features for each of the word’s morphemes were encoded using SALMA – Tags. The detailed and fine-grain morphosyntactic information was provided to satisfy a wider research community to evaluate their morphological systems using a unified proposed standard that enables comparisons between the various evaluated systems. 2- The Corpora A text sample of the CCA consisting of about 1,000 words was selected. The CCA is a 1-million word open-source MSA corpus collected from newspapers and magazines which contains 14 genres. The selected sample was selected from politics, sport and economics, the main three genres of newspaper articles. The words of the CCA are morphologically simpler that the Qur’an text. However, this still represents a challenge to morphological analyzers for Arabic text. Possible challenges of the CCA text to morphological analyzers are borrowed word, named entities, new vocabulary, transliterated words and relative nouns. The CCA sample is non-vowelized text. Fully-vowelized forms of the words are provided in the gold standard. The morphological analyzers for Arabic text are required to produce the fully-vowelized form of the analyzed words. 3- Gold Standard Format The SALMA – Gold Standard, CCA part used the unified file format which is used to store the Qur’an part of the gold standard. Both XML files provided with the appropriate xml-tags that describe the information stored in the gold standard, and tab separated column files where each column contains a piece of information stored in the gold standard, were used to format the detailed information of the gold standard. Figure 9.4 shows example of the SALMA – Gold Standard, CCA part, stored using XML files. Figure 9.5 shows example of the SALMA – Gold Standard, CCA part, stored using a tab separated column file. Other formats are used to display the information of the gold standard for the end users. These formats include HTML files and the visual display of the gold standard in colour-coded format. 4- Gold Standard Size The size of the gold standard is measured by the number of words it contains. The SALMA – Gold Standard, CCA part contains 1,122 tokens distributed into 1,015 Arabic words, 99 punctuation marks and 8 numbers. The sample contains 775 token types distributed into 756 Arabic word types, 13 punctuation marks and 6 numbers. - 261 The Arabic words in the sample were generated from 421 different roots, 594 different lemmas and 215 different patterns. The number of morphemes in this part is 2,172 having 452 different SALMA – Tags.  ; ;  ; ; ; ; Arabic Word Stop Word ; ; STEM nd----ms-s-si---nns--- > ; \> ;<| k'i| € | QS| | \2¯N u| u | Q.;¤| ? 3|(; ? ‹¥| +524; ‹= ¥| ; Noun |Demonstrative pronoun |Masculine |Singular | Invariable (v, n) |sukūn (Silence) |Indefiniteness |Irrational |Non-Inflected (n, v) |Unaugmented | c2m c2; m ; c'5 c2; ; c2; ; +#; S= ; Arabic Word c= PRE r---d----------------- proc n n 6!#8 \Q:| ‰: | Other (Residual) |Definite article | c2; ; STEM nq----fb-v??d---ntat-s > ; ;H> #| 3() g J#| ‹i8 }¨| §<—| ˆ)o u| u > L>–?–| 3±; > =!l| =; ; ;-| ; `y u g2t – 3|( ? +524; ‹= ¥| ? ‰~ b ,/ ur| | Noun |Generic noun |Feminine |Broken plural |Triptote / fully declined |Definiteness |Irrational |Primitive / Concrete noun |Augmented by one letter |Triliteral |Sound noun | Figure 9.4 A sample of the SALMA – Gold Standard, CCA part, stored using XML file - 262 *  c2m S5 +E8  R4 * Ž#"  %{ . *= >; ; ; c2; m ; ; S; G=5> +_ B;E;8 _ ; G=R> 4; *= > > #= ;G" Ž > > ; > ;%>{= ; . *  c'5 65 +: ŸR4 * Ž#"  cE *= > ; ; c2; ; ; S; G=5> +_ B;E;8 _ ; G=R> 4; *= > > #= ;G" Ž > > ; c—; ? Word Word +#; S= ; ;%#= >H +#B S; G;8 ;%G=>#;H Word +#= G;H Word ;%>#=H;: Word Word Word Word Word Word Punct. ; p--c------------------, *= > p--p-----s-?-----n---; ; nd----ms-s-si---nns--- c= r---d-----------------, c2; ; nq----fb-v??d---ntat-s 6 ; =5> ns----fs-vafi---ndat-s, ;\ r---t-fs-------------+_ B;E;8 ne----ms-vgki---ndbt-s Ÿ; =R> 4; nj----fs-v??i---hdbt-s, _\ r---t-fs-------------*= > p--p-----s-?-----n---- > #= G;" n+----ms-vgki---nnst-s Ž >>  nd----mb-s-si---nns--; c= r---d-----------------, +; >{= ;: nq----mb-vgkd---ntbt-s, >\ r---t-fs-------------. u----s---------------- Figure 9.5 A sample of the SALMA – Gold Standard, CCA part, stored using text file 9.6 Deciding on Accuracy Measurements The ALECSO/KACST initiative evaluated morphological analyzers for Arabic text according to both linguistic and technical specifications of the morphological analyzer and its outputs. However, no gold standard for evaluation was provided. They relied on linguists to assess the linguistic information produced by the morphological analyzers for examples of challenging words. The technical specifications were assessed by a computational linguist. Even though no evaluation results were reported by the ALECSO/KACST initiative for evaluation of morphological analyzers, they recommended to use recall and precision metrics to compute the accuracy of the morphological analyzers according to formulas 9.1 and 9.2. Section 9.2 discusses the ALECSO/KACST initiative for evaluating morphological analyzers. The MorphoChallenge 2009 competition 1 evaluates the proposed morpheme analysis against a linguistic gold standard. The results of the participants’ algorithms were compared with the gold standard by checking whether any two words have a morpheme in common. The best morphological analyzer was selected according to the highest Fmeasure of accuracy calculated using formula 9.3. The F-measure score is the harmonic mean of recall and precision. Precision was defined as the proportion of word pairs that share the same morpheme and that have a morpheme in common in the gold standard. Recall was defined as the proportion of morphemes sharing word pairs in the gold standard also found in the participants’ results. In general, morphological analyzers of Arabic text are required to produce all possible analyses of the word form out of context. The SALMA – Tagger produces all possible analyses of the analyzed word form. The absence of a gold standard for evaluating morphological analyzers that contains all possible and correct analyses and their morphosyntactic information (i.e. root, lemma, pattern, vowelization, word’s - 263 morphemes and their morphological feature descriptions) makes such an evaluation of an Arabic morphological analyzer impractical. On the other hand, the SALMA – Gold Standard contains one correct analysis for each word suitable to its context. The evaluation of a morphological analyzer using the SALMA – Gold Standard, will check whether the correct analysis of the gold standard is among the possible analyses of the morphological analyzer. One analysis produced by the morphological analyzer that matches the correct word segmentation into morphemes and possibly the SALMA – Tags of each morpheme is selected. Then the tags for each morpheme of the selected analysis are compared with their equivalents in the gold standard. The percentage of the correct whole morpheme tags is computed and reported. In the following evaluation, scores are for the “best” analysis, chosen by hand from the set of possible analyses output by the SALMA – Tagger. Accuracy, precision, recall and F-measure are applicable to measure the accuracy of the individual morphological categories of the morpheme tags. The computed accuracy metrics measure the capacity of a morphological analyzer to predict the detailed morphological features information encapsulated within the analyzed word. They also show the interdependency and the interrelationships between the different morphological categories of the morphemes. The next section discusses the evaluation of the SALMA – Tagger using the gold standard concentrating on the application of evaluation metrics to measure the accuracy of the individual morphological feature categories. Chapter 10 discusses the evaluation of the SALMA – Lemmatizer and Stemmer on the Qur’an and the Arabic Internet Corpus. 9.7 Evaluating the SALMA – Tagger Using Gold Standards The focus in evaluating the outputs of the SALMA – Tagger is to evaluate the prediction accuracy of the 22 morphological feature categories of each morpheme using the SALMA – Gold Standard. Other intermediate outputs can be evaluated separately e.g. the evaluation of the SALMA – Lemmatizer and Stemmer; see section 10.2. Therefore, for each word of the test samples (i.e. the Qur’an text sample and the CCA text sample) the analysis that best matches its equivalent in the SALMA – Gold Standard was chosen as a candidate analysis for evaluation. Then the evaluation metrics of accuracy, recall, precision and F-measure were computed. Two aspects for measuring the accuracy of the SALMA – Tagger were investigated: • Applicability: equates to whether or not a value is entered at the expected position in the tag string. - 264 - • Correctness: equates to the correct value for a feature, mapped to the correct position in the tag string. These aspects were used to define the elements of the confusion matrix. One advantage of a confusion matrix is counting and visualizing when the system is confusing two classes (i.e. commonly giving one tag as another). Another advantage of a confusion matrix is to compute the values of accuracy, recall, precision and f-measure of the SALMA – Tagger outputs. The confusion matrix elements are TP (True Positive), TN (True Negative), FP (False Positive) and FN (False Negative), see figure 9.6. These elements were defined according to the observations of the outputs as follows: • TP (True Positive): True and applicable; the case was applicable and predicted correctly. Two conditions of applicability and correctness are needed to classify the prediction as TP. First, the morphological feature is applicable. Second, the prediction of the attribute value of that morphological feature is correctly predicted. • TN (True Negative): True and not applicable cases; the case was not applicable and predicted as not applicable. • FN (False Negative): False prediction of applicable cases; the case was applicable but predicted as not applicable. • FP (False Positive): False prediction of not applicable cases; the case was not applicable but predicted as applicable by giving a tag in the expected position. Applicable FN TP Not Applicable Confusion Predicted Predicted Matrix positive Negative Positive cases TP FN Negative cases FP TN FP TN Figure 9.6 The confusion matrix aspects and elements The definition of the confusion matrix elements depends on two conditions: applicability and correctness. These conditions overlap in some cases where the positive cases are given a wrong tag other than “-”. Using a confusion matrix, the analyses are classified into four categories but the observations made from analysing the output data distinguish between 5 categories: 1- Applicable case and predicted correctly, which represents the TP category. E.g. predicting the gender of a noun as singular ‘s’ which matches the gender feature of the same noun in the gold standard, which is tagged as singular ‘s’. - 265 2- Not applicable case and predicted not applicable, which represents the TN category. E.g. the morphological feature category of person is not a feature for proper nouns. Hence, proper nouns have “-” in the ninth position of their tags. A case is classified as TN, if the morphological analyzer predicts the value of the person feature as non-applicable and gives a “-” tag. 3- Applicable case and predicted not applicable tagged by “-”, which can fit into the FN category. This case happens if the morphological analyzer gives a “-” tag for the morphological feature of number which is an applicable feature for proper nouns. The gold standard has a valid tag for the number feature of proper nouns either ‘s’ (singular), ‘d’ (dual), ‘p’ (sound plural), ‘b’ (broken plural). 4- Not applicable cases tagged in the gold standard by “-” and predicted as applicable, which can fit into the FP category. Theoretically, this case should not occur. However, some morphological features such as Inflectional Morphology, Case or Mood, and Case and Mood Marks depend on each other. Predicting the value of inflectional morphology for a perfect verb as ‘d’ (conjugated) will affect the prediction of Case or Mood by giving a tag for a non-applicable morphological feature. 5- Applicable cases and predicted wrongly by tagging with a tag other than “-”. E.g. predicting the number of a proper noun as singular by giving the tag ‘s’ while that proper noun is broken plural which is tagged by ‘b’ in the gold standard. The last observation (O5) can fit into the FP category because it is part of the positive predictions made by the analyzer, and the FN category because it is summed with the number of positive cases in the gold standard. According to the definition of precision and recall, see formula 9.5 and 9.6, the fifth observation will affect both the recall and the precision of the system. However, the confusion matrix will only allow data to be classified into one of its four categories. An extended version of the confusion matrix where the data of the five observations fit only into one category was developed. The development of the extended confusion matrix required normalizing the tags of the gold standard and the outputs of the analyzer were normalized to three symbols (‘C’ (character), ‘W’ (wrong), ‘-’ (not applicable)). According to the above observations new tags for the gold standard and the outputs of the analyzer were generated by mapping the original tag into the three tags used for evaluation. These three evaluation tags are not shown in the outputs of the analyzer. They are only used to extend the confusion matrix that can fit 5 categories instead of the ordinary four categories. Figure 9.7 illustrates the mapping rules of the original tags into the three tags for evaluation depending on the above five observations. Figure 9.8 gives an example of the mapping process and the normalized evaluation tags - 266 for the word k2;-=>'?"'?“'= ? kuzmūbūlītān ‘cosmopolitan’ a borrowed word which represent a challenging example for the morphological analyzer to predict its morphological features. However, it is good example because it contains all the five observations and demonstrates the mapping process. Observations Original tags Gold Applicable case and predicted correctly Not applicable case and predicted not applicable Applicable case and predicted not applicable Not applicable cases and predicted as applicable Applicable cases and predicted wrongly O1 O2 O3 O4 O5 a b d Normalized tags Predicted Gold Predicted a c e C C C C C W Figure 9.7 Normalizing the gold standard and predicted tags into (-, C, W) evaluation tags Original tags Gold Standard Predicted tags Normalized Gold Standard tags Predicted tags k2;-=>'?"'?“'= ? cosmopolitan k2;-=>'?"'?“'= ? k2;-=>'?"'?“'= ? k2;-=>'?"'?“'= ? nj--x-xb----i---hns--s nq----ms-v??i---nts--s CC--C-CC----C---CCC--C CW----WW-CCCC---WWC--C Figure 9.8 Example of normalizing the gold standard and predicted tags into (-, C, W) evaluation tags The new extended confusion matrix will contain three rows and three columns marked by (-, C, W). Then the confusion matrix is filled by the values by comparing the normalized tags. The 5 observations will fit directly in the confusion matrix. Figure 9.9 shows the skeleton of the confusion matrix and where the five observation values fit in the matrix. The five observations are marked by O1-O5 where the numbers 1-5 represent the observation number as listed above. The other entries in the confusion matrix are always zero marked by ‘.’ because the output tags of the analyzer cannot be classified into these entries. The figure shows the entries of the confusion matrix that are used to compute the values of the accuracy, precision and recall. Figures 9.10 and 9.11 show the confusion matrices generated for each morphological feature category of the morphemes SALMA – Tags. - 267 Confusion Matrix Entries used to compute ‘Accuracy’ C W O4 . C O3 O5 W . . <.> (row = reference; col = test) C W O4 . C O3 O5 W . . <.> (row = reference; col = test) Entries used to compute ‘Precision’ Entries used to compute ‘Recall’ C W O4 . C O3 O5 W . . <.> (row = reference; col = test) C W O4 . C O3 O5 W . . <.> (row = reference; col = test) Figure 9.9 The confusion matrix and the entries used to compute the evaluation metrics Using the extended confusion matrix, the values of the accuracy metrics were computed and reported. The first accuracy metric computed is Accuracy. The accuracy is defined as the percentage of correct predictions made for a certain morphological feature category. Formula 9.4 is used for the computation of accuracy. Accuarcy =  &%      = MA % MN       ………………….(9.4) Recall is defined as the percentage of applicable cases that are correctly predicted from the total number of actual positive cases in the gold standard. Formula 9.5 illustrates the computation of recall. '()*++ =            6 6    O     P 6 6 6 = & &%, = MA MA %(MQ % MR ) ….(9.5) Precision is defined as the percentage of applicable cases which are correctly predicted from the total number of positive predictions. Formula 9.6 illustrates the computation of precision. Precision =           O     6 6 6 = & &%,& = MA MA %(MS % MR ) …… (9.6) F-measure (F1 score) is the harmonic mean that combines precision and recall. It is interpreted as a weighted average of the precision and recall. F1 score reaches its best value at 1 (100%) and worst score at 0 (0%). Formula 9.7 illustrates the computation of F1 score. & :@ score = 2. &  .  %  ……………………………………………………….…(9.7) Results reported err on the side of caution by adding the number of cases of O5 to both recall and precision equations. - 268 - (1) Main Part-of-Speech (2) Part-of-Speech: Noun (3) Part-of-Speech: Verb | C W | --+----------------+ - | <.> . . | C | . <2170> 1 | W | . . <.>| --+----------------+ | C W | --+----------------+ - |<1454> 1 . | C | . <708> 8 | W | . . <.>| --+----------------+ | C W | --+----------------+ - |<2057> . . | C | . <112> 2 | W | . . <.>| --+----------------+ (4) Part-of-Speech: Particle (5) Part-of-Speech: Other (6) Punctuation marks | C W | --+----------------+ - |<1798> . . | C | 1 <372> . | W | . . <.>| --+----------------+ | C W | --+----------------+ - |<1301> . . | C | 1 <861> 8 | W | . . <.>| --+----------------+ | C W | --+----------------+ - |<2072> . . | C | . <93> 6 | W | . . <.>| --+----------------+ (7) Gender (8) Number (9) Person | C W | --+----------------+ - | <970> 10 . | C | .<1137> 54 | W | . . <.>| --+----------------+ | C W | --+----------------+ - | <970> 10 . | C | .<1122> 69 | W | . . <.>| --+----------------+ | C W | --+----------------+ - |<1940> 8 . | C | 4 <177> 42 | W | . . <.>| --+----------------+ (10) Inflectional Morphology (11) Case or Mood (12) Case and Mood Marks | C W | --+----------------+ - | <942> 9 . | C | 1<1205> 14 | W | . . <.>| --+----------------+ | C W | --+----------------+ - |<1457> 12 . | C | 8 <602> 92 | W | . . <.>| --+----------------+ | C W | --+----------------+ - |<987> 9 . | C | 1 <779> 395 | W | . . <.>| --+----------------+ (13) Definiteness (14) Voice (15) Emphasized emphasized | C W | --+----------------+ - |<1425> 18 . | C | . <725> 3 | W | . . <.>| --+----------------+ | C W | --+----------------+ - |<2049> 8 . | C | . <105> 9 | W | . . <.>| --+----------------+ | C W | --+----------------+ - |<2049> 8 . | C | . <114> . | W | . . <.>| --+----------------+ (16) Transitivity (17) Rational (18) Declension and Conjugation | C W | --+----------------+ - |<2049> 8 . | C | . <114> . | W | . . <.>| --+----------------+ | C W | --+----------------+ - |<1340> 5 . | C | . <695> 131 | W | . . <.>| --+----------------+ | C W | --+----------------+ - |<1085> 1 . | C | 1<1080> 4 | W | . . <.>| --+----------------+ (19) Unaugmented and Augmented (20) Number of Root Letters (21) Verb Root | C W | --+----------------+ - |<1344> 8 . | C | 3 <795> 21 | W | . . <.>| --+----------------+ | C W | --+----------------+ - |<1398> 3 . | C | 4 <765> 1 | W | . . <.>| --+----------------+ | C W | --+----------------+ - |<2058> . . | C | . <112> 1 | W | . . <.>| --+----------------+ (22) Noun Finals | C W | --+----------------+ - |<1500> 6 . | C | . <656> 9 | W | . . <.>| --+----------------+ For all confusion matrices in this figure (row = reference; col = test) Figure 9.10 Confusion matrices for the CCA test sample and Non- - 269 - (1) Main Part-of-Speech | C W | --+----------------+ - | <.> . . | C | 11<1903> 28 | W | . . <.>| --+----------------+ (4) Part-of-Speech: Particle | C W | --+----------------+ - |<1422> 4 . | C | 9 <447> 60 | W | . . <.>| --+----------------+ (7) Gender | C W | --+----------------+ - |<769> 91 . | C | 23 <960> 99 | W | . . <.>| --+----------------+ (10) Inflectional Morphology | C W | --+----------------+ - | <522> 41 . | C | 59<1196> 124 | W | . . <.>| --+----------------+ (13) Definiteness (2) Part-of-Speech: Noun | C W | --+----------------+ - |<1438> 2 . | C | 2 <235> 265 | W | . . <.>| --+----------------+ (5) Part-of-Speech: Other | C W | --+----------------+ - |<1270> 9 . | C | 27 <573> 63 | W | . . <.>| --+----------------+ (8) Number | C W | --+-------------+ - |<768> 91 . | C | 23<768>292 | W | . . <.>| --+-------------+ (11) Case or Mood | C W | --+----------------+ - |<1094> 370 . | C | 2 <454> 22 | W | . . <.>| --+----------------+ (14) Voice | C W | --+----------------+ - |<1435> 13 . | C | . <437> 57 | W | . . <.>| --+----------------+ (16) Transitivity | C W | --+----------------+ - |<1682> 2 . | C | . <254> 4 | W | . . <.>| --+----------------+ (19) Unaugmented and Augmented | C W | --+----------------+ - |<1300> 5 . | C | 8 <549> 80 | W | . . <.>| --+----------------+ (22) Noun Finals | C W | --+----------------+ - |<1440> 121 . | C | . <372> 9 | W | . . <.>| --+----------------+ | C W | --+----------------+ - |<1682> . . | C | . <233> 27 | W | . . <.>| --+----------------+ (17) Rational | C W | --+----------------+ - |<1175> 9 . | C | . <657> 101 | W | . . <.>| --+----------------+ (20) Number of Root Letters (3) Part-of-Speech: Verb | C W | --+----------------+ - |<1681> . . | C | 1 <260> . | W | . . <.>| --+----------------+ (6) Punctuation marks | C W | --+----------------+ - |<1942> . . | C | . <.> . | W | . . <.>| --+----------------+ (9) Person | C W | --+----------------+ - |<1312> 47 . | C | 21 <519> 43 | W | . . <.>| --+----------------+ (12) Case and Mood Marks | C W | --+----------------+ - |<533> 34 . | C | 56 <909> 410 | W | . . <.>| --+----------------+ (15) Emphasized and Nonemphasized | C W | --+----------------+ - |<1682> . . | C | . <259> 1 | W | . . <.>| --+----------------+ (18) Declension and Conjugation | C W | --+----------------+ - |<1179> 2 . | C | 1 <571> 189 | W | . . <.>| --+----------------+ (21) Verb Root | C W | --+----------------+ - |<1298> 5 . | C | . <639> . | W | . . <.>| --+----------------+ | C W | --+----------------+ - |<1687> . . | C | . <255> . | W | . . <.>| --+----------------+ For all confusion matrices in this figure (row = reference; col = test) Figure 9.11 Confusion matrices for the Qur’an – chapter 29 test sample - 270 The SALMA – Tagger was evaluated using two samples of text documents: chapter 29 of the Qur’an and a sample from the CCA. The outputs of analysing the two samples were evaluated using the SALMA – Gold Standard. The confusion matrix of each morphological feature category was generated. Then the four accuracy metrics were computed. The confusion matrices of the morphological feature categories of the two test texts are shown in figures 9.10 and 9.11. The accuracy metrics are shown in tables 9.1 and 9.2. The figures of the evaluation metrics are shown in figures 9.12 and 9.13. The results are discussed in the next section 9.8. Found P represents the positive predictions made by the SALMA – Tagger where it gave a tag other than ‘-’ at the expected position. Actual P represents the positive cases in the gold standard. Found N represents the non-applicable predictions made by the SALMA – Tagger where it gave the tag ‘-’. Actual N represents the non-applicable cases in the gold standard tagged by ‘-’. Table 9.1 Accuracy metrics for evaluating the CCA test sample # Category Found (P) Actual (P) Found (N) Actual (N) Accuracy Recall Precision F1-score 1 Main Part-ofSpeech 2170 2171 0 0 99.95% 99.95% 99.95% 99.95% 2 Noun 708 717 1454 1455 99.59% 98.88% 98.74% 98.81% 3 Verb 112 114 2057 2057 99.91% 98.25% 98.25% 98.25% 4 Particle 372 372 1798 1798 99.95% 99.73% 100.00% 99.87% 5 Other 861 869 1301 1301 99.59% 98.97% 99.08% 99.02% 6 Punctuations 93 99 2072 2072 99.72% 93.94% 93.94% 93.94% 7 Gender 1137 1201 970 980 97.05% 95.47% 94.67% 95.07% 8 Number 1122 1201 970 980 96.36% 94.21% 93.42% 93.81% 9 Person 177 227 1940 1948 97.51% 79.37% 77.97% 78.67% 1205 1228 942 951 98.89% 98.77% 98.13% 98.45% 10 Inflectional Morphology 11 Case or Mood 602 706 1457 1469 94.84% 85.76% 85.27% 85.51% 12 Case and Mood Marks 779 1183 987 996 81.35% 66.30% 65.85% 66.07% 13 Definiteness 725 746 1425 1443 99.03% 99.59% 97.19% 98.37% 14 Voice 105 122 2049 2057 99.22% 92.11% 86.07% 88.98% 15 Emphasis 114 122 2049 2057 99.63% 100.00% 93.44% 96.61% 16 Transitivity 114 122 2049 2057 99.63% 100.00% 93.44% 96.61% 17 Rational 695 831 1340 1345 93.74% 84.14% 83.63% 83.89% 1080 1085 1085 1086 99.72% 99.54% 99.54% 99.54% 795 824 1344 1352 98.53% 97.07% 96.48% 96.77% 765 769 1398 1401 99.63% 99.35% 99.48% 99.42% 20 Declension and Conjugation Unaugmented and Augmented Number of Root Letters 21 Verb Root 112 113 2058 2058 99.95% 99.12% 99.12% 99.12% 22 Noun Finals 656 671 1500 1506 99.31% 98.65% 97.76% 98.20% 18 19 - 271 Table 9.2 Accuracy metrics for evaluating the Qur’an – Chapter 29 test sample # Category Found (P) Actual (P) 1 Main Part-of-Speech 2 Found (N) Actual (N) Accuracy Recall Precision F1-score 1903 1931 0 0 97.99% 97.99% 98.55% 98.27% Noun 235 502 1438 1440 86.15% 46.81% 46.81% 46.81% 3 Verb 260 260 1681 1681 99.95% 99.62% 100.00% 99.81% 4 Particle 447 511 1422 1426 96.24% 86.63% 87.48% 87.05% 5 Other 573 645 1270 1279 94.90% 86.43% 88.84% 87.61% 6 Punctuations 0 0 1942 1942 100.00% 0.00% 0.00% 0.00% 7 Gender 960 1150 769 860 89.03% 88.72% 83.48% 86.02% 8 Number 768 1151 768 859 79.09% 70.91% 66.72% 68.76% 9 Person 519 609 1312 1359 94.28% 89.02% 85.22% 87.08% 10 Inflectional Morphology 1196 1361 522 563 88.47% 86.73% 87.88% 87.30% 11 Case or Mood 454 846 1094 1464 79.71% 94.98% 53.66% 68.58% 12 Case and Mood Marks 909 1353 533 567 74.25% 66.11% 67.18% 66.64% 13 Definiteness 437 507 1435 1448 96.40% 88.46% 86.19% 87.31% 14 Voice 233 260 1682 1682 98.61% 89.62% 89.62% 89.62% 15 Emphasis 259 260 1682 1682 99.95% 99.62% 99.62% 99.62% 16 Transitivity 254 260 1682 1684 99.69% 98.45% 97.69% 98.07% 17 Rational 657 767 1175 1184 94.34% 86.68% 85.66% 86.16% 18 Declension and Conjugation Unaugmented and Augmented Number of Root Letters 571 762 1179 1181 90.11% 75.03% 74.93% 74.98% 549 634 1300 1305 95.21% 86.19% 86.59% 86.39% 639 644 1298 1303 99.74% 100.00% 99.22% 99.61% 21 Verb Root 255 255 1687 1687 100.00% 100.00% 100.00% 100.00% 22 Noun Finals 372 502 1440 1561 93.31% 97.64% 74.10% 84.26% 19 20 - 272 - Main Part-of-Speech 99.95% 99.95% 99.95% 99.95% Noun 99.59% 98.88% 98.74% 98.81% Verb 99.91% 98.25% 98.25% 98.25% 99.95% 99.73% 100.00% 99.87% Particle 99.59% 98.97% 99.08% 99.02% Other 99.72% 93.94% 93.94% 93.94% Punctuations Gender 97.05% 95.47% 94.67% 95.07% Number 96.36% 94.21% 93.42% 93.81% 79.37% 77.97% 78.67% Person 97.51% 98.89% 98.77% 98.13% 98.45% Inflectional Morphology 85.76% 85.27% 85.51% Case or Mood Precision F1-score Definiteness 99.03% 99.59% 97.19% 98.37% Voice 99.22% 92.11% 86.07% 88.98% Emphasis 99.63% 100.00% 93.44% 96.61% Transitivity 99.63% 100.00% 93.44% 96.61% 84.14% 83.63% 83.89% Rational 93.74% 99.72% 99.54% 99.54% 99.54% Declension and Conjugation 98.53% 97.07% 96.48% 96.77% Unaugmented and Augmented Number of Root Letters 99.63% 99.35% 99.48% 99.42% Verb Root 99.95% 99.12% 99.12% 99.12% Noun Finals 99.31% 98.65% 97.76% 98.20% 0% 10% 20% 30% 40% 50% 60% 70% Accuracy Recall 81.35% 66.30% 65.85% 66.07% Case and Mood Marks 94.84% 80% Figure 9.12 Accuracy metrics for evaluating the CCA test sample 90% 100% - 273 - 97.99% 97.99% 98.55% 98.27% Main Part-of-Speech 86.15% 46.81% 46.81% 46.81% Noun 99.95% 99.62% 100.00% 99.81% Verb 96.24% 86.63% 87.48% 87.05% Particle 94.90% 86.43% 88.84% 87.61% Other 100.00% 0.00% 0.00% 0.00% Punctuations 89.03% 88.72% 83.48% 86.02% Gender 70.91% 66.72% 68.76% Number 79.09% 94.28% 89.02% 85.22% 87.08% Person 88.47% 86.73% 87.88% 87.30% Inflectional Morphology 79.71% Case or Mood 53.66% 94.98% 68.58% Recall 74.25% 66.11% 67.18% 66.64% Case and Mood Marks Precision F1-score 88.46% 86.19% 87.31% Definiteness 96.40% 98.61% 89.62% 89.62% 89.62% Voice Emphasis 99.95% 99.62% 99.62% 99.62% Transitivity 99.69% 98.45% 97.69% 98.07% 86.68% 85.66% 86.16% Rational 75.03% 74.93% 74.98% Declension and Conjugation 94.34% 90.11% 86.19% 86.59% 86.39% Unaugmented and Augmented 95.21% Number of Root Letters 99.74% 100.00% 99.22% 99.61% Verb Root 100.00% 100.00% 100.00% 100.00% Noun Finals 74.10% 0% 10% 20% 30% 40% 50% 60% 70% 80% Accuracy 93.31% 97.64% 84.26% 90% 100% Figure 9.13 Accuracy metrics for evaluating the Qur’an – Chapter 29 test sample - 274 - 9.8 Discussion of Results The results of evaluating the SALMA – Tagger for two different text genres: the MSA text from the CCA and the Classical Arabic text from the Qur’an, showed the applicability of the SALMA – Tagger to process different types of text types, domains and genres of both vowelized and non-vowelized Arabic text. The SALMA – Tagger can be used to POS-tag Arabic text corpora and to provide detailed fine-grained analysis for each morpheme of the corpus words. The SALMA – Tagger divides the analyzed word into 5 parts (i.e. proclitics, prefixes, stem, suffixes and enclitics) and gives each part a detailed morphological feature tag (SALMA - Tag) or possibly multiple tags if the parts have multiple clitics or affixes. Each SALMA – Tag consists of 22 morphological feature categories that encode fine-grain morphological information about each morpheme of the analyzed words. The evaluation of the SALMA – Tagger using MSA text showed better overall results than the evaluation using the Qur’an text. The measure of accuracy is “exact match”. The exact match of the prediction of all 22 features for a morpheme whole tags for the CCA test sample is 71.21% and for the Qur’an – chapter 29 test sample is at 53.5%, but some of the errors were very minor such as replacing one ‘?’ by ‘-’. This shows that the Qur’an text has a more complex morphological structure than the MSA text. These complex morphological structures need more future work that investigates the differences between the two genres. As long as, there is no disambiguation facility of the SALMA – Tagger, and the best match analyses were selected manually for the purpose of evaluation. The achieved accuracy results of evaluation represent the highest accuracy scores that can be achieve by the SALMA – Tagger to predict the values of the morphological feature categories attributes. The accuracy scores for part of speech tagging system as surveyed in section 2.4.1 and reported by their developers, range from 91% for the AMT tagger by Alqrainy (2008) to 97% for the HMM part-of-speech tagger for Arabic developed by Al-Shamsi and Guessoum (2006). Errors of a disambiguation tool, that will be added to the SALMA – Tagger as future work, will decrease the overall accuracy results between 3% and 9%. The focus of this evaluation is to show the applicability of the SALMA – Tagger in distinguishing the fine-grain morphological features of the Arabic text corpus words. The evaluation shows which morphological feature the SALMA – Tagger can distinguish. It - 275 also shows the accuracy rate for each morphological feature category. The purpose of this evaluation is to report for users who will use the SALMA – Tagger or parts of it on the SALMA – Tagger capability in distinguishing the fine-grain morphological features of the words. For instance, anaphora resolution applications can benefit from the morphological features of main part of speech, gender, number, person and rational outputs of the SALMA – Tagger to maintain agreement of these features between verbs and pronouns in sentences. Limitations, examples of hard cases and methods for improvements are discussed for each morphological feature category. 9.8.1 Results of Predicting the Value of Main Part of Speech The results show high accuracy in predicting the main part of speech of the analyzed morphemes. 99.05% of the Qur’an sample morphemes and 97.99% of the CCA sample were correctly predicted. The prediction of the main part of speech of the morphemes depends on both: (i) maintaining agreement between the word’s affixes and clitics where the clitics and affixes dictionaries contain the part-of-speech information that matches them, see section 8.3.1.5; and (ii) the patterns dictionaries where the main part of speech information is encoded within the SALMA – Tag given to each pattern; see section 8.3.3.1. The clitics and affixes dictionaries are used in the prediction of the main part of speech for all morphemes of the analyzed word, while the patterns dictionary is mainly used to predict the main part of speech of the stem morpheme. 9.8.2 Results of Predicting the Value of the Part-of-Speech Subcategory of Noun The prediction of the part-of-speech subcategory of Noun scored an accuracy of 99.59% for the CCA text, while it scored a lower accuracy of 86.15% for the Qur’an test sample. The prediction of the part-of-speech subcategory of noun was not easy for the Qur’an text sample due to the nature of Quranic Arabic. The Qur’an text sample involves repeated use of old personal names such as k; '= 4; =>H fir‘awn ‘firaun’ and places such as Q' ; ?Ç; ṯamūd ‘thamud’, while the list of the proper nouns used by the SALMA – Tagger was constructed from MSA newswire corpus; see section 8.3.2.4. The MSA text sample contains many relative nouns such as *2 | > ; G.= aṯ-ṯaqāfī ‘cultural’ and gerunds of profession such as ;.>);'; = al-waṭaniyyah ‘nationality’, which are repeated frequently in the CCA text sample. These two types of repeated nouns are frequently used in MSA text. They are formed by adding the relative yā’ and tā’ marbūtah as suffixes. Therefore, the rule for - 276 predicting these attributes is simple. The Qur’an sample does not contain any examples of these two noun types. 9.8.3 Results of Predicting the Value of the Part-of-Speech Subcategories of Verb and Particle High accuracy for predicting the part-of-speech sub category of verbs was scored about 99.95% accuracy for both the Qur’an and the CCA text samples. The prediction of verbs depends on the analysis of the prefixes and suffixes and the matching of the stem morpheme with a patterns dictionary entry. High accuracy was scored for the part-ofspeech subcategory of particle as well. An accuracy of 99.95% was scored for the CCA text sample and 96.24% for the Qur’an text sample. Most particles are stored in the function words list; see section 8.3.2.3. However, some particles in the Qur’an text sample are complex particles which consist of more than one morpheme such as =w;;;: ’a-walam ‘and not’ which consists of three morphemes. Such complex particles need to be included in the function words list to improve the accuracy of the predicting particles. 9.8.4 Results of Predicting the Value of the Part-of-Speech Subcategory of Others (Residuals) The accuracy of predicting the part-of-speech subcategory of others (residuals) scored 99.59% for the CCA test sample and 94.24% for the Qur’an test sample. The residuals are part of the clitics and affixes. The prediction of these affixes depends on matching the morphemes of the analyzed word with the entries of the clitics and affixes dictionaries. The errors made in the Qur’an sample are due to the use of ambiguous enclitics which can be classified into different categories such k. nna and k= n which can be feminine suffixed pronoun or emphatic nūn. The CCA text sample contains numbers, currency and Arabized words which belong to the ‘others’ category but the SALMA – Tag Set does not include them yet. Section 9.10 (below) discusses the extension of the SALMA – Tag Set to include these attributes. 9.8.5 Results of Predicting the Value of Punctuations The Qur’an test sample has no punctuation; therefore predicting that the punctuation category is not applicable for the analyzed words morphemes scored 100% accuracy. The CCA test sample contains punctuation. The accuracy of prediction was 99.72%. The prediction of punctuation is done in the tokenization step; see section 8.3.1. Special characters are used in the MSA text which cannot be classified as a word or a morpheme - 277 and not part of the standard punctuation described in section 6.2.6. These special characters such as ‘/’ slash are given a new tag ‘o’ which represents other punctuation marks. 9.8.6 Results of Predicting the Value of the Morphological Features of Gender, Number and Person The prediction of the morphological features of gender, number and person scored 97.05%, 96.36% and 97.51% for the CCA test sample respectively, and 89.03%, 79.09%, 94.28% for the Qur’an test sample, respectively. The three morphological features are related to each other and share the same prediction methodology. Nouns have the morphological features of gender and number but not person, except for pronouns. Verbs have all three features. The prediction of the morphological features of gender and number for nouns depends on suffix analysis. Feminine and singular words have the suffix ta’ marbutah. Dual words are marked by k ān or C! ayn. Masculine sound plural words have the suffix k wn or C! ayn, while feminine sound plural words have the suffix ` āt. Broken plural words are searched in the broken plural list and the investigation of the gender feature is done on the retrieved singular form of the matched words. For example, the gender for 12;¾= ;: ’anḥā’ “directions; regions” which is a broken plural of the > ;< nāḥiyat “directions; regions”, is feminine because the singular feminine singular ;2 suffix ta’ marbutah appears on the singular form of the analyzed word. However, if the word is a broken and not found in the broken plural list, then the assigned tags ‘ms-’ (masculine, singular and not applicable) are wrong. The prediction of the three morphological features for verbs depends on the combinations of prefixes and suffixed pronouns attached to the end of the verbs. Subject suffix-pronouns and genitive suffix-pronouns describe the reference person of the verb and agree with the number and gender of the doer of the verb; see section 8.4.1. False predictions of the morphological features of gender, number and person of verbs occur because some verbs are ambiguous. These verbs such as  ? >"= G;8 tarbiṭu “you are tying / she is tying” can be masculine, singular and second person, or feminine, singular and third person. The SALMA – Tagger predicts/assigns the tags ‘xs?’ (of common gender, singular, applicable feature) to these kind of verbs. The difference comes by comparing against the gold standard where these features match the context of the words. These - 278 wrong predictions can be solved by applying contextual rules that define the agreement between the verb and its doer (the subject of the sentence). Contextual rules are also needed to disambiguate the number of verbs where singular verb forms have following > — |G! wa yurawwiǧu hā’ulā’i “and those who are plural subjects such as the phrase 1r ? ; ? ;? ; > spreading”, the verb  ? |;?G! yurawwiǧu “spreading” is in singular form while the subject 1r—? ; hā’ulā’I “those” is a plural demonstrative pronoun. 9.8.7 Results of Predicting the Value of the Morphological Features of Inflectional Morphology, Case or Mood, and Case and Mood Marks The prediction accuracy of the morphological features of inflectional morphology, case or mood, and case and mood marks scored 98.89%, 94.84% and 81.35% for the CCA test sample and 88.47%, 74.71% and 74.25% for the Qur’an test sample respectively. The prediction of morphological feature of inflectional morphology for verbs depends on the part-of-speech subcategory of verbs and analysis of suffixes for imperfect verbs to determine whether the verb is conjugated or invariable. The disambiguation of nouns into declined or invariable depends on applying many rules that deal with the part-of-speech subcategory of nouns, noun finals and patterns. These rules classify the declined nouns into fully declined or non-declined. The prediction of the morphological feature of case and mood depends on the result of the prediction of the morphological feature of inflectional morphology, where a declined noun has case (i.e. nominative, accusative or genitive) and a conjugated verb has mood (i.e. indicative, subjunctive, or imperative/jussive), while case and mood are not applicable to invariable nouns and verbs. The prediction of a noun’s case investigates the proclitics attached to the beginning of the noun which might affect the case and its syntactic mark such as prepositions and jurative particles. Prediction rules also investigate the dual and plural suffixes which change according to the case of the noun. For example, k wn is a masculine plural suffix of nominative case, while C! ayn is a masculine plural or dual un un suffix of accusative or genitive case. The five nouns J D ;: ’ab ‘father’, ÅD ;: ’aẖ ‘brother’, uD ; ḥamun ‘father-in-law’, '?H fū (u;H fam) ‘mouth’, and ?y ḏū ‘possessor; owner’ change their suffix according to the context, the suffix ‫ و‬waw indicates nominative case, ‫’ ا‬alif indicates accusative case and ‫ ي‬yā’ indicates genitive case. Rules for predicting the case or mood, and case and mood marks for singular and broken plural nouns depend on the - 279 short vowel (i.e. the syntactic mark) that appears on the end of the word. The absence of short vowels and the contextual rules that deal with the nouns according to their context (i.e. subject or object) increases the potential of wrong prediction especially for singular and broken plural nouns. Moreover, determining the morpheme that carries the syntactic mark of the word is not an easy task. For example the word > >-,; >)t= ;E>" bi-’aǧniḥatihi ‘by its > bi, stem morpheme b>)t= ;: ’aǧniḥa, feminine wings’ has four morphemes: preposition J ; > ti, and the suffixed pronoun U> hi. The case mark, which is always considered by suffix ` traditional Arabic grammar to be at the end of the word, is carried by the third morpheme > ti in this example, rather than the final morpheme the suffixed the feminine suffix ` pronoun U> hi. The prediction of the morphological features of case or mood, and case and mood marks for verbs depends on the previous prediction made for the morphological feature of inflectional morphology that classifies verbs into conjugated or invariable. Only a conjugated verb has mood. The prediction rules for mood depend on the part-of-speech subcategory of verb where mood is applicable to imperfect verbs and not applicable to perfect and imperative verbs. The rules also analyze the suffixes of the imperfect verb to determine the applicability of mood. Imperfect verbs that contain the third person feminine suffix pronoun ‫ ن‬nūn are invariable verbs which are marked by sukūn such as Ê ; = ?-= ;! yaktubna ‘they (fem.) write’. Those containing the emphatic nūn suffix are invariable verbs which are marked by fatḥah such as C. R; ;%#= G;;%G;H falaya‘lamanna ‘and allāh will surely make evident’. The final rule of prediction depends on the short vowel which appears on the morpheme that carries the mood mark, where ḍammah indicates indicative mood, fatḥah indicates subjunctive mood, and sukūn indicates imperative or jussive mood. The absence of short vowels and the contextual rules that deal with nouns according to their context (i.e. subject or object) increases the potential for wrong prediction especially for subjunctive, and imperative or jussive verbs which are always preceded by subjunctivegoverning particles and jussive-governing particles respectively. The results show the interdependency of these three morphological feature categories. The morphological feature category of case and mood marks depends on both case or mood, and inflectional morphology. Case or mood depends on inflectional morphology. The prediction errors for inflectional morphology are propagated to the case - 280 or mood category, and then to case and mood markers. Therefore, accuracy rates were decreased in the direction of error propagation. 9.8.8 Results of Predicting the Value of the Morphological Feature of Definiteness The accuracy of predicting the morphological feature of definiteness was high at 99.03% and 96.40% for the CCA test sample and the Qur’an test sample respectively. The prediction of the morphological feature of definiteness depends on the availability of the definite article c as a proclitic for the analyzed noun. If the noun contains the definite article in its proclitics then the noun is definite; otherwise it is an indefinite noun. The morphological feature of definiteness is not applicable to verbs. Errors in classifying the word into noun or verb will be propagated to this category especially for indefinite prediction. 9.8.9 Results of Predicting the Value of the Morphological Feature of Voice The prediction of the morphological feature of voice achieved a high accuracy score of 99.22% and 98.61% for the CCA test sample and the Qur’an test sample respectively. The morphological feature of voice is only applicable to verbs. The prediction rules classify verbs into active verbs or passive verbs depending on the short vowel appearing on the first letter of the verb after removing proclitics. If a fatḥah appears on the verb’s first letter, then it is classified as an active voice verb. If ḍammah appears on the verb’s first letter, then it is classified as a passive voice verb. Errors can happen in some cases where ḍammah appears on the first letter of active voice verbs such as k ; ? =!>?! yurīdūna ‘they want’ which matches the pattern k' ; ?%>#S= ?G! yuf‘ilūn. The passive verb form of this example is k ; Q? ;?G! yurādūna ‘they are wanted to be’ which matches the pattern k' ; ?%#; S= ?G! yuf‘alūn. The difference between the two patterns is the short vowel that appears on the second root radical. The short vowel on the second root radical is kasrah for active voice and fatḥah for all verbs generated from these patterns. The patterns dictionary used by the SALMA – Tagger distinguishes between active voice and passive voice patterns. Applying prediction rules for the morphological feature of voice that depend on patterns rather than the short vowel of the first letter of the verb will increase the prediction accuracy. - 281 - 9.8.10 Results of Predicting the Value of the Morphological Feature of Emphasized and Non-Emphasized The prediction accuracy of the morphological feature of emphasized and nonemphasized was high at 99.63% and 99.95% for the CCA test sample and the Qur’an test sample respectively. The morphological feature of emphasized and non-emphasized is applicable only to verbs. Prediction rules for classifying verbs into emphasized or nonemphasized depends on the part-of-speech subcategory of the verb. Perfect verbs are always non-emphasized while imperfect and imperative verbs can be emphasized. The prediction rules also investigate the suffixes of the verb. Emphasized verbs contain the emphatic nūn as a suffix. 9.8.11 Results of Predicting the Value of the Morphological Feature of Transitivity The prediction accuracy of the morphological feature of transitivity was high at 99.63% and 99.69% for the CCA test sample and the Qur’an test sample respectively. The morphological feature of transitivity is applicable only to verbs. The prediction rules of the morphological feature of transitivity classify verbs into: intransitive verbs which complete their meaning without the need for an object; singly transitive verbs which need one object to complete their meaning; doubly transitive verbs, which need two objects to complete their meaning; or triply transitive verbs, which need three objects to complete their meaning. The prediction rules of the morphological feature of transitivity depend on matching the analyzed verb with one verb stored in the lists of doubly transitive and triply transitive verbs. The singly transitive verb attribute is the default value of the morphological feature of transitivity. The absence of contextual rules for predicting the attributes of the morphological feature of transitivity increases the potential for making prediction mistakes. On the other hand, suffix pronouns analysis can capture some attributes of this morphological feature. 9.8.12 Results of Predicting the Value of the Morphological Feature of Rational The prediction of the morphological feature of rational scored an accuracy of 93.74% for the CCA test sample and an accuracy of 94.34% for the Qur’an test sample. The morphological feature of rational is applicable to both nouns and verbs. The rationality of the subject (or the doer) of the verb determines the rationality attribute of the analyzed verb. The prediction rules for the morphological feature of rational assign - 282 default values to the analyzed words depending on their part-of-speech subcategory; see section 8.4.2. Proper nouns are classified as rational if the proper noun is found in the personal proper nouns list, and as irrational if they are found in the locations or organizations proper nouns lists. Demonstrative pronouns are classified according their use as rational or irrational. Qur’an verbs are assigned a default value of rational as most of the Qur’an verbs represent dialogue between God and people. Classifying words into rational or irrational depends on the semantics of the word itself and its context, such that agreement is maintained between sentence parts such as verb-subject agreement and adjective-descriptive noun agreement. A comprehensive dictionary which includes Rational information for each dictionary entry is needed to determine the correct attribute value of rational for nouns. 9.8.13 Results of Predicting the Value of the Morphological Feature of Declension and Conjugation The prediction of the morphological feature of declension and conjugation was highly accurate at 99.72% for the CCA test sample and slightly less accurate at 90.11% for the Qur’an test sample. The morphological feature of declension and conjugation is applicable to nouns, verbs and particles. The prediction rules of the values of declension and conjugation of nouns depend on the part-of-speech subcategories. The rules for predicting the values of declension and conjugation of verbs depend on searching four lists of verbs: the non-conjugated/restricted-to-the-perfect verb list; the nonconjugated/restricted-to-the-imperfect verb list; the non-conjugated/restricted-to-theimperative verb list; and the partially conjugated verb list. The default value of the morphological feature of declension and conjugation for verbs is fully conjugated verb. Including the declension and conjugation information in the Arabic dictionary will increase the correct prediction of attributes for this morphological feature. 9.8.14 Results of Predicting the Value of the Morphological Features of Unaugmented and Augmented, Number of Root Letters, and Verb Roots The prediction accuracy of the morphological features of unaugmented and augmented, number of root letters, and verb roots was 98.53%, 99.63% and 99.95% for the CCA test sample and 95.21%, 99.74% and 100% for the Qur’an test sample respectively. The morphological features of unaugmented and augmented, and number of root letters are applicable to both nouns and verbs, while the morphological feature of verb roots only applies to verbs. The rules for predicting the three morphological features - 283 mainly depend on the root of the analyzed word. The prediction rule of unaugmented and augmented attributes subtracts the length of the root from the length of the analyzed word. The prediction rule of the attributes of the number of root letters depends on the length of the root. The prediction rules of the morphological feature of verb roots depend on the nature of the root letters - whether they are consonants, containing hamzah, or containing one or two vowels. The prediction errors are higher for the morphological feature of unaugmented and augmented due to the ambiguous word boundaries. In some cases of non-vowelized text tanwīn fatiḥ (ً‫ )ا‬appears as ’alif which will be counted as an augmented letter. In other cases, vowels might be deleted from the word. Therefore, the rules for counting the added letters to the word need to know whether a vowel is deleted or not. For example, the verb ? >; yağidu ‘he finds’ has the root Qgg w-ğ-d and is augmented by one letter ; yā’ representing the imperfect prefix. The first root letter  wāw is a vowel and is deleted from the word. 9.8.15 Results of Predicting the Value of the Morphological Feature of Noun Finals The prediction of the morphological feature of noun finals was a highly accurate at 99.31% for the CCA test sample and slightly lower at 93.31% for the Qur’an test sample. The rules for predicting the value of the morphological feature of Noun Finals mainly depend on the long stem and the root of the analyzed word. The rules check the final letters of the long stem against a set of conditions that classify nouns into 6 categories; see section 8.4.3. Knowing the value of the Noun Finals feature helps in specifying other features such as the morphological features of Inflectional Morphology and Case and Mood Marks. Case marks cannot appear on the last letter of nouns with shortened ending, and only fatḥah, the mark of accusative case, appears on the last letter of nouns with curtailed ending. 9.8.16 More Conclusions In conclusion, the SALMA – Tagger was evaluated on two text samples from different genres: chapter 29 of the Qur’an representing classical Arabic, and a sample from the CCA represents Modern Standard Arabic. The focus of this evaluation was to report on the applicability of the SALMA – Tagger in distinguishing the fine-grained morphological features of the Arabic text corpus, by measuring the accuracy of each of the 22 morphological feature categories represented by the SALMA – Tag for each - 284 morpheme in the two samples. The evaluation used the SALMA – Gold Standard. One advantage of carrying out this type of evaluation is to report for users who will use/reuse the SALMA – Tagger or parts of it the accuracy of predicting the attributes of the finegrained morphological features. Users can depend on this evaluation to decide which parts of the SALMA – Tagger can be used directly. Another advantage directly addresses our interest in developing an Arabic morphological analyzer that is able to analyze Arabic text corpora by providing fine-grain analysis for each word. Fine-grain analysis of the Arabic word involves dividing the word into five parts and giving each part a detailed morphological features tag or possibly multiple tags if the part has multiple clitics or affixes. The prediction accuracy was high for 15 morphological features: the morphological features of main part-of-speech; part-of-speech subcategory of verb; part-of-speech subcategory of particle; part-of-speech subcategory of other (residual); part-of-speech subcategory of punctuation; morphological feature of definiteness; morphological feature of voice; morphological feature of emphasized and non-emphasized; morphological feature of transitivity; morphological feature of declension and conjugation; morphological feature of unaugmented and augmented; morphological feature of number of root letters; morphological feature of verb roots; and morphological feature of noun finals. The accuracy for predicting the attributes of these 15 morphological features was between 98.53% and 100% for the CCA test sample and 90.11% and 100%for the Qur’an test sample. The morphological features of part-of-speech subcategory of noun, gender, number, person, inflectional morphology, case or mood, case and mood marks, and rational, scored slightly lower accuracy of prediction at 81.35% - 97.51%for the CCA test sample and 74.25% - 89.03%for the Qur’an test sample. The next section (9.9) discusses the limitations, and the factors that affected the prediction accuracy of the morphological features, and suggests solutions that might improve this accuracy. 9.9 Limitations and improvements The SALMA – Tagger achieved high prediction accuracy for 15 morphological features, and lower accuracy for 7 morphological features. The high prediction accuracy was due to the factors of the detailed analysis of words into morpheme and classifying these morphemes into distinctive classes that helped in predicting the attributes of these - 285 morphological feature categories. The reuse of the predicted attributes of some categories helped in predicting the correct attribute value of other categories. Providing the SALMA – Tagger with lists of (function words, broken plurals, named entities, doubly transitive verbs and triply transitive verbs, and conjugated and non-conjugated verbs) was the basis for predicting the attributes of many morphological feature categories. The SALMA – ABCLexicon is mainly used to extract the correct root of the analyzed words. The root information represents the basis for predicting the correct attribute of some morphological features. Finally, the patterns dictionary and the pattern matching algorithms were used in the prediction rules of most of the morphological feature categories. The lower accuracy achieved with the other 7 morphological feature categories was due to an absence of contextual rules in the SALMA – Tagger, such that it treats words out of their context. The absence of short vowels on text especially for MSA text makes the prediction of the attributes of some morphological features difficult. Moreover, the interdependency between some morphological features such as the morphological features of inflectional morphology, case or mood, and case or mood marks decreases the accuracy of the dependent features by propagating errors from one feature to another. Finally, prediction errors increase, if the number of attributes of a certain morphological feature increases. To improve the accuracy of predicting the attributes of the morphological feature categories, contextual rules can be implemented as a second pass. The contextual rules will also help in reducing the number of candidate analyses of the analyzed words by excluding those analyses that do not satisfy certain contextual rules. Some morphological feature categories such as rational depend on the semantic nature of the analyzed word itself. Providing rationality information for Arabic dictionary entries and reusing this information in morphological analyzers will increase the accuracy of prediction. Moreover, updating the dictionaries which are used by the SALMA – Tagger by increasing their coverage will increase the prediction accuracy. 9.10 Extension of the SALMA – Tag Set The SALMA – Tag Set is a general-purpose fine-grain tag set. The aim of developing this tag set is that it should be used as the standard for part-of-speech tagging software to annotate corpora with more detailed morphological information for each word. The SALMA – Tag Set was evaluated by applying it to two text samples of - 286 different genres: chapter 29 of the Qur’an representing classical Arabic, and a sample of the CCA representing modern standard Arabic. Both samples and their annotations were used in the SALMA – Gold Standard. The application of the SALMA – Tag Set to the Qur’an text sample did not introduce any reason for extending the tag set. However, the CCA text sample introduced some examples of tokens that appear in MSA text. These examples include numbers (digits), currency, non-Arabic words, borrowed (foreign) words, dates and special characters. Extensions of the SALMA – Tag Set were made to two morphological feature categories: others (residual) and punctuation. The morphological feature of others (residuals) was extended to include new attributes for numbers (digits), currency, nonArabic words, borrowed (foreign) words and dates. Table 9.3 shows the new attributes added to the part-of-speech subcategory of others (residuals). The part-of-speech subcategory of punctuation marks was extended by adding an attribute for special characters that are used as punctuation marks. These special characters appear on the MSA text due to the use of word-editing software that enables typing of special characters within text easily, and because of the lack of knowledge about using standard punctuation in Arabic text. Table 9.4 shows the attribute added to the part-of-speech subcategory of punctuation marks. Borrowed (foreign) words are words borrowed from other languages which have become part of the language because they have become used widely by Arabic speakers. They also appear in text in transliteration format using Arabic letters. These words are used within the sentence like normal Arabic words. They accept inflectional affixes and change their form according to the context. Therefore, the SALMA – Tag Set treats them as Arabic words by classifying them within the main part-of-speech category attributes and assigning the morphological feature attributes that are applicable to them. They are given the tag ‘x’ in the fifth position of the tag string to distinguish them as borrowed (foreign) words. Figure 9.14 shows an example of tagging a borrowed (foreign) word. - 287 Table 9.3 Extended attributes of the Part-of-speech subcategories of Other (Residuals) and their tags at position 5 Position 5 Feature Name Part-of-Speech: Other (U%+ #, 8<) -I% !M !.< ’aqsām al-kalām al-far’iyyat (’uẖrā) Number (digits) (+325461) (-897,653) (0.986) H+@+ raqam (13x10-3) (-1.2E2) (1.2e-2) t Currency (:.Q1,500) (v.2,927) ($250) + :, 8I ‘umla (£430) Date 3 + tārīẖ (27/09/2011) (2011 c'%!: 27) s&@ (27.09.11) (11 ·R-  27) 3-6%+ I+ %,+Z :+ 3 + kalimat ḡayr windows, Non-Arabic word photoshop, games, ‘arabiyyah download t 3 Borrowed k2;-=>'?"'?“'= ? kuzmūbūlītān +6%- + 8 :+ + kalima (foreign) word mu‘arrabah ‘cosmopolitan’ Q2;- stād ‘stadium’ Tag g c e w x Table 9.4 Extended attributes of the Part-of-speech subcategories of Punctuation Marks and their tags at position 6 Position 6 Feature Name Punctuation Marks (H% I) I% !M !.< ’aqsām al-kalām al-far’iyyat (‘alāmāt at-tarqīm) Other punctuations ‘alāmāt ’uẖrā / U%+ #, 8<  + I+ Tag o Word SALMA – Tag > nj--x-xb----i---hns--s k2;-='?"'?“'= ? kuzmūbūlītān ‘cosmopolitan’ Figure 9.14 Example of tagging a borrowed (foreign) word 9.11 Chapter Summary This chapter discussed the evaluation of the SALMA – Tagger. The evaluation methodologies for morphological analyzers are not standardized yet. The first part of the chapter discussed the development of agreed standards for evaluating morphological analyzers for Arabic text, based on our experiences and participation in two communitybased evaluation contests: the ALECSO/KACST initiative for developing and evaluating morphological analyzers, and the MorphoChallenge 2009 competition. The guideline recommendations, evaluation specifications and procedures, and evaluation metrics were reused to generate a global standard for evaluating morphological analyzers for Arabic text. The developed standards were applied for evaluating the SALMA – Tagger. The developed evaluation standards depend on using gold standards for evaluating morphological analyzers for Arabic text. A reusable general purpose gold standard (the SALMA – Gold Standard) was constructed to evaluate various morphological analyzers for Arabic text and to allow comparisons between the different analyzers. The SALMA – Gold Standard is adherent to standards, and enriched with fine-grained morphological information for each morpheme of the gold standard text samples. The detailed - 288 information is: the input word, its root, lemma, pattern, word type and the word’s morphemes. For each of the word’s morphemes, the morpheme type is classified into proclitic, prefix, stem, suffix and enclitic, and a fine-grain SALMA – Tag which encodes 22 morphological feature categories of each morpheme, was included. The SALMA – Gold Standard contains two text samples of about 1000-words each representing two different text domains and genres of both vowelized and non-vowelized text taken from the Qur’an – chapter 29 representing Classical Arabic, and from the CCA representing Modern Standard Arabic. The SALMA – Gold Standard is stored using different standard formats to allow wider reusability. XML technology allows storage of the gold standard in a machine-readable structured format. Tab-separated column files are widely used by researchers. They are used to store the gold standard following the Morphochallenge 2009 recommendations for constructing gold standards. Other formats are used to display the information of the gold standard for end users. These formats include HTML files and the visual display of the gold standard in colour-coded format. The SALMA – Gold Standard was used to evaluate the SALMA – Tagger. The evaluation focused on measuring the prediction accuracy of the 22 morphological features encoded in the SALMA – Tags for each of the gold standard’s text sample morphemes. The results show that 53.50% of the Qur’an text sample morphemes and 71.21% of the CCA text sample were correctly tagged using “exact match” of the gold standard’s morpheme tags, but some of the errors were very minor such as replacing ‘?’ by ‘-’. The evaluation reported accuracy, recall, precision, f1-score and the confusion matrix for each morphological feature category. The individual category accuracy results are useful for users who will use/reuse the SALMA – Tagger or parts of it, to know in advance the prediction accuracy of the attributes of each morphological feature category. Accuracy scores are high for 15 morphological feature categories at about 98.53%-100% for the CCA test sample and 90.11% -100% for the Qur’an test sample. These categories are: the morphological feature of main part-of-speech; part-of-speech subcategory of verb; part-of-speech subcategory of particle; part-of-speech subcategory of other (residual); part-of-speech subcategory of punctuation; definiteness; voice; emphasized and non-emphasized; transitivity; declension and conjugation; unaugmented and augmented; number of root letters; verb roots; and noun finals. The other 7 morphological feature categories: part-of-speech subcategory of noun; gender; number; person; inflectional morphology; case or mood; case and mood marks; and rational, were less accurately predicted: 81.35% - 97.51% for the CCA test sample and 74.25%-89.03% for the Qur’an test sample. - 289 The absence of contextual rules, the absence of short vowels, the interdependency between some morphological features, and the number of attributes of a certain morphological category increase the potential for prediction errors of some morphological feature categories. To improve the accuracy of predicting the attributes of the morphological feature categories, contextual rules can be implemented as a second pass. Some morphological feature categories such as rational depend on the semantic nature of the analyzed word itself. Providing rationality information for Arabic dictionary entries and reusing this information in morphological analyzers will increase the accuracy of prediction. Moreover, updating the dictionaries which are used by the SALMA – Tagger by increasing their coverage will increase the prediction accuracy. The SALMA – Gold Standard for evaluating Arabic morphological analyzers is an open-source resource that is available to download, for reuse in evaluation of other Arabic morphological analyzers. - 290 - Chapter 10 Practical Applications of the SALMA – Tagger This chapter is based on the following sections of published papers: Section 2 is based on section 4 in (Sawalha and Atwell 2010b) and section 1 in (Sawalha and Atwell 2011a) Section 3 is based on section 1 in (Sawalha and Atwell 2011b) Chapter Summary The SALMA Tagger has been used in two important applications of Arabic text analytics: first, lemmatizing the 176-million words Arabic Internet Corpus, and second, as corpus linguistic resources and tools for Arabic lexicography. This chapter will illustrate how the tools- the SALMA – Tagger and SALMA – Lemmatizer and Stemmer, the resources - the SALMA – ABCLexicon and the Corpus of Traditional Arabic Lexicons, and the proposed standards - the SALMA – Tag Set - have been useful tools, resources and standards to advance Arabic computational linguistic technologies. - 291 - 10.1 Introduction In this research, resources (the SALMA – ABCLexicon, Chapter 4), Standards (the SALMA – Tag Set, Chapters 5, 6 and 7), and tools (the SALMA – Tagger, Chapters 8 and 9) were developed and evaluated. The main purpose in developing the resources, standards and tools is for morphosyntactic annotation of Arabic text with fine-grain morphosyntactic information. This chapter will investigate two applications of these resources, standards and tools: lemmatizing the 176-million word Arabic Internet Corpus66 (AIC) (Sawalha and Atwell 2011a), and as language engineering resources to construct the Arabic dictionary (Sawalha and Atwell 2011b). The resources, standards and tools were evaluated on samples of Arabic text to measure their accuracy and applicability to text analytics tasks. However, the performance aspects of the SALMA – Tagger such as speed, memory and ability to perform the desired analysis tasks were not evaluated previously. Applying the SALMA – Lemmatizer and Stemmer to lemmatize the 176-million word Arabic Internet Corpus is a practical application through which to evaluate performance and investigate the challenges of applying the resources, standards and tools on real, large-scale data. The second application is a proposal about how these resources, standards and tools can be used as a language engineering toolkit for Arabic lexicography. This study reviews the resources and tools which are used in modern lexicography, and shows that the developed resources, and standards constitute a toolkit for constructing Arabic bi-lingual and monolingual dictionaries. Section 10.2 discusses the application of lemmatizing the 176-million word AIC. Section 10.3 discusses the resources and tools for Arabic lexicography. 10.2 Lemmatizing the 176-million words Arabic Internet Corpus The Arabic Internet Corpus is one of several large corpora collected for Translation Studies research at http://corpus.leeds.ac.uk/internet.html alongside Internet corpora for English, Chinese, French, German, Greek, Italian, Japanese, Polish, Portuguese, Russian and Spanish (Sharoff 2006). The Arabic Internet Corpus consists of about 176 million words67. Initially it consisted of raw text, with no further processing such as lemmatization or part-of-speech tagging. This section shows how the lemma and root were added for each word of the AIC. 66 67 Querying Arabic Corpora http://smlc09.leeds.ac.uk/query-ar.html The frequency list of the Arabic Internet Corpus http://corpus.leeds.ac.uk/frqc/i-ar-forms.num - 292 Arabic is a morphologically rich and highly inflectional language. Hundreds of words can be derived from the same root; and a lemma can appear in the text in many different forms due to the glutination of clitics at the front and end of the word. Therefore, lemmatization and root extraction is necessary for search applications, to enable inflected forms of a word to be grouped together. We used the lemmatizing part of the SALMA – Tagger (see section 8.3.2) to annotate the Arabic Internet Corpus words at two levels; the lemma and the root, as shown in Figure 10.1. The SALMA – Lemmatizer and Stemmer is relatively slow. In initial tests it processed 7 words per second, because it deals with orthographic issues, spell checking of the word’s letters, short vowels and diacritics and the large dictionaries provided to perform its task. The estimated execution time for lemmatizing the full Arabic Internet Corpus was roughly 300 days using an ordinary uniprocessor machine. To reduce the processing time of the whole task, we used the power of HPC (High Performance Computing). NGS68 (National Grid Services) aims to enable coherent electronic access for UK researchers to all computational and data-based resources and facilities required to carry out their research, independent of resource or researcher location. The huge computational power of NGS was used to lemmatize the Arabic internet corpus. As a result, a massive reduction in execution time was gained. The Arabic Internet Corpus was divided the into half-million-word files. Then a specialized program distributed copies of the SALMA – Lemmatizer and Stemmer to multiple CPUs and assigned different input files to run the lemmatizer for the partitioned corpus files in parallel. The output files were combined in one lemmatized Arabic Internet Corpus, comprising 176 million word-tokens, 2,412,983 word-types, 322,464 lemmatypes, and 87,068 root-types. By using the NGS, a massive reduction was gained in execution time for processing the 176-million words corpus to only 5 days. It might have been a few hours, if enough CPUs had been allocated to process all files strictly in parallel; NGS provides virtual parallel processing on a reduced set of CPUs. Therefore, the half-million-word files were divided into three groups containing 100, 150 and 80 files respectively depending on the number of CPUs they were allocated. The average CPU time used to lemmatize a file of average 584,599 words was 91,102 seconds (25 hours, 18 minutes and 22 seconds) at an average of 6.4 words per second. The total CPU time used to lemmatize all the corpus files was 30,245,965 seconds (8401 hours, 39 minutes and 25 second – approximately one year). However, five days were enough to lemmatize the 176-million word Arabic Internet Corpus via parallel processing. 68 NGS (National Grid Services) http://www.ngs.ac.uk NGS case study: Accelerating the Processing of Large Corpora, http://www.ngs.ac.uk/accelerating-theprocessing-of-large-corpora-using-grid-computing-technologies-for-lemmatizing-176 - 293 After lemmatizing the three groups of corpus files, the lemmatized output files were combined into one lemmatized Arabic Internet Corpus. The lemmatized corpus was stored in one large tab-separated column file where the words occupy the first column, the lemmas occupy the second column, the roots occupy the third column, and special tags were added in the fourth column. These tags are: STOP_WORD to mark function words; N_BP to mark broken plural nouns; NE_PERS to mark personal named entities; NE_LOC to mark locational named entities and NE_ORG to mark organizational named entities. Figure 10.1 shows a one-sentence example of the lemmatized Arabic Internet Corpus. The sentence is: di%t !' ..!' Ÿ £ ”2@ - .   e S  12 ¯ n%4 ) Ÿ S-i! 2'"2 k'! k: %# .1'­ 3l)8 \m 1S( b "2(m 5e )2 `'  la‘allahu ’an yakūna kābūsan wa yastafῑqu minhu ‘alā al-’ašyā’i al-’alῑfati wa aṭ-ṭayyibati wa al-ḥabῑbati. wa imtadda aš-šāri‘u al-ḍayyiqu ṭawῑlan.. ṭawῑlan wa ğalasat al-buyūtu sākinatan, muṭriqatan, wa al-maṣābῑḥu aṣṣafrā’u al-maqrūratu tanzifu ḍaw’an ‘Perhaps it is a nightmare and he will wake up to the usual, good and beloved things. The narrow road is extend long. long. The homes sat silent, listening, speechless, and the yellow bubbled lamps bled light.’ %# k: k'! 2'"2 Ÿ S-i! ) n%4 12 ¯ S   e   +4 k: k2 v'"2 Ÿ S-i! ) n%4 12 ¯: 6 :   +%4 k: k' ˆ h'H ) n%4 E¯ 61   . - ”2@ Ÿ £ . - ”2¯ Ÿ­ . Q ”¯ Ÿ­ STOP_WORD STOP_WORD STOP_WORD STOP_WORD !' +!' c' . . !'  di%t `'  )2  5e  b "2(m 1S( \m 3l)8 1'­ . . +!'  ˆ%t d" C2  he  b "2( 1S/  3l< 1'­ . . c'  ˆ%t d" C  h  b/ S/ 5 6H“ :'­ Figure 10.1 Sample of lemmatized sentence from the Arabic Internet Corpus N_BP - 294 The main challenge of lemmatizing the 176-million words Arabic Internet Corpus was the long execution time that might take several months. This challenge was solved by using the high performance computational power provided by the NGS. The lemmatization of the AIC was significantly reduced to 5 days. The other challenge that appeared during lemmatizing the AIC was the many cases of spelling errors. The AIC was collected automatically from web pages (Sharoff 2006). These web pages were constructed using different web authoring tools which have integrated word processing modules. Most of these word processing tools that support Arabic are not aware of what letter and diacritic combinations can appear on a letter in a given position in the word. The absence of such a module in word processing tools that support Arabic increases the potential for mis-spelling Arabic words. Many spellingerrors are found in the AIC. Such errors are: adding more than one short vowel to the same letter; starting or ending the word with taṭwīl; adding a diacritic to taṭwīl; starting the Arabic word with a silent letter by adding sukūn to the first letter; and adding tanwīn to any of the word’s letters other than the last letter. The SALMA – Tokenizer has a specialized procedure that checks whether the letter and diacritic combinations are correct or not; see section 8.3.1. The first step in lemmatization is the tokenization of the corpus words that classifies words into Arabic words or other words (i.e. number, currency, non-Arabic word and date). The Arabic words are passed to the spell-checking procedure that discovers the spelling errors and corrects them. The mis-spelled words are replaced by the correct words. 10.2.1 Evaluation of the Lemmatizer Accuracy There was not a gold standard for evaluating the accuracy of the AIC lemmas and roots accuracy. Therefore, small random samples were selected and the accuracy was computed for each sample. To evaluate the accuracy of the lemmatizer, in terms of lemma and root accuracies, 10 samples of 100-words each from the lemmatized AIC were randomly selected. For each word in the sample the lemma and root accuracies were computed by counting the percentage of correct lemma and root analyses in the samples. Tables 10.1 and 10.2 show the accuracy results for each sample. Accumulative averages of both the lemma and root accuracies were computed to track the accuracy changes from one sample to another. The accumulative average accuracy showed steady accuracy rates among the selected samples. So, the evaluation stopped adding more samples. The accumulative accuracy averages were reported as the lemma and root accuracies of the AIC. Figure 10.2 shows the lemma accuracy and root accuracy for each sample, the accumulative average of the lemma accuracy, and the accumulative average of the root accuracy. - 295 The results show that the accumulative average root accuracy is 81.20% and the average lemma accuracy is 80.80%. Table 10.1 Lemma accuracy Sample Sample name Start line Tokens Accuracy % Average % 100 Correct lemmas 81 1 newdp_out.txt 111,435 81.00% 81.00% 2 newfo_out.txt 384,384 100 76 76.00% 78.50% 3 newih_out.txt 113691 100 78 78.00% 78.33% 4 newca_out.txt 13,076 100 80 80.00% 78.75% 5 newfc_out.txt 59,313 100 78 78.00% 78.60% 6 newlg_out.txt 234,254 100 85 85.00% 79.67% 7 newdr_out.txt 570,807 100 77 77.00% 79.29% 8 newmi_out.txt 507,492 100 80 80.00% 79.38% 9 newir_out.txt 355,144 100 82 82.00% 79.67% 10 neweu_out.txt 149,057 100 91 91.00% 80.80% 1000 808 80.80% 80.80% Accuracy % Average % 85.00% 85.00% Table 10.2 Root accuracy Sample Sample name Start line Tokens 1 newdp_out.txt 111,435 100 Correct roots 85 2 newfo_out.txt 384,384 100 72 72.00% 78.50% 3 newih_out.txt 113691 100 80 80.00% 79.00% 4 newca_out.txt 13,076 100 82 82.00% 79.75% 5 newfc_out.txt 59,313 100 79 79.00% 79.60% 6 newlg_out.txt 234,254 100 85 85.00% 80.50% 7 newdr_out.txt 570,807 100 71 71.00% 79.14% 8 newmi_out.txt 507,492 100 85 85.00% 79.88% 9 newir_out.txt 355,144 100 84 84.00% 80.33% 10 neweu_out.txt 149,057 100 89 89.00% 81.20% 1000 812 81.20% 81.20% - 296 - Lemmatizer Accuracy 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 1 2 3 4 5 6 7 8 Lemma Accuracy Root Accuracy Accum Lemma Average Accuracy Accum Root Average Accuracy 9 10 Figure 10.2 Lemma and root accuracy of the lemmatized Arabic internet corpus 10.3 Corpus Linguistics Resources and Tools for Arabic Lexicography Corpora have been used to construct dictionaries since the release of the CollinsBirmingham University International Database COBUILD. Computer technology was used in the four stages of constructing COBUILD: data-collection, entry-selection, entry construction and entry-arrangement (Ooi 1998). A Large and representative corpus which is made up of texts of many different domains, formats and genres provides detailed information about all aspects of written language that can be studied. Corpus and corpus analysis tools e.g. Sketch Engine69, have brought about a revolution in dictionary building. Corpus analysis tools are used to build a detailed statistical profile of every word in the corpus, which enables lexicographers to understand the words, their collocations, their behaviors, usages and the connotations they may carry. Ways of producing new words and expressions and the popularity of coinages can be identified with the help of the corpus. Oxford dictionaries70 represent an exemplar of the use of corpus in constructing dictionaries. The second and traditional source of information which is used to construct dictionaries is citations. Citations represent the objective evidence of language in use. They are a prerequisite for a reliable dictionary but they have their limitations (Atkins and Rundell 2008). 69 70 Corpus analysis tools such as Sketch Engine (www.sketchengine.co.uk) Oxford dictionaries http://www.oxforddictionaries.com - 297 Arabic corpora have not been used to construct Arabic dictionaries71. Advances in corpora construction technologies, corpora analysis tools and the availability of large quantities of Arabic text of different domains, formats and genres on the web can allow us to build a large and representative lexicographic corpus of Arabic to be used in constructing new Arabic dictionaries. A lemmatizing tool is needed to group words that share the same lemma. It also helps in finding the collocations of the word. Figures 10.3 and 10.4 show examples of the word #; 2> t; ğāmi‘at “University” and its collocations. Figure 10.3 Example of the concordance line of the word #2t ğāmi‘at “University” from the Arabic Internet Corpus 71 The last Arabic dictionary ? => ' ; u? ƒ; #=?m mu‘jam al-wasῑṭ “Al-Waseet Lexicon” appeared in 1960’s by the Arabic language academy in Cairo. - 298 - Figure 10.4 Example of the collocations of the word #2t ğāmi‘at “University” from the Arabic Internet Corpus The second important resource of information needed to construct new Arabic dictionaries is the long established traditional Arabic lexicons. Over the past 1200 years, many different kinds of Arabic lexicons were constructed; these lexicons are different in ordering, size and goal of construction. The traditional Arabic lexicons followed four main methodologies for ordering their lexical entries. These methodologies use the root as lexical entry. The main disadvantage of these methodologies is that the words derived from the root are not arranged methodically within the lexical entry. Ordering of dictionary entries is the main challenge in constructing Arabic dictionaries. Traditional Arabic lexicons represent a citation bank to be used in the construction of modern Arabic dictionaries. They include citations for each lexical entry from the Qur’an and authentic poetry that represents the proper use of keywords. They provide information about the origin of words. They also include phrases, collocations, idioms, and well-known personal names and places derived from that root (lexical entry). The corpus of traditional Arabic lexicons is a collection of 23 lexicons. It represents a different domain than existing Arabic corpora. It covers a period of more than 1200 years. It consists of a large number of words, about 14,369,570 and about 2,184,315 word types. The corpus of traditional Arabic lexicons has both types of Arabic text; vowelized and non-vowelized. Figure 10.5 shows the most frequent words of the Corpus of Traditional Arabic Lexicons, see section 4.6. - 299 - Partially-vowelized Word Frequency Non-vowelized Word Frequency * fī “in” 292,396 C min “from” C min “from” 269,200 * fī “in” c25 qāl “he said” 172,631 c25 qāl “he said” 190,918 “and” 120,060 : ’ay “which” 132,635 n%4 ‘alā “over” 108,252  wa “and” 130,809 2 mā “what” 89,195 n%4 ‘alā “over” 119,639 88,233 yZ ’iẖā “if” 115,842  wa c25 wa qāl “and he said” c25 wa qāl “and he said” 322,239 301,895 C4 ‘an “about” 82,027 yZ ’iẖā “if” 81,479 C" ’ibn “son of” 94,980 78,622 2 mā “what” 94,530 75,149 C" bin “son of” 92,213 69,737 C4 ‘an “about” 87,064 C" ’ibn “son of” 58,334 ' wa huwa “and he” 80,375 " bihi “in it” 53,343 r lā “no” 73,066 * wa fī “and in” 53,197 '": abū “father” 72,231 5 wa qad “and perhaps” 50,648 k: ’an “that” 65,419 '": abū “father” 47,915 : ’aw “or” 62,298 C" bin “son of” ;: ’ay “which” 46,880  allāh “Allah” 59,511 46,788 " bihi “in it” 58,941 ' huwa “he” 45,916 : ’ay “which” ' wa huwa “and he” r lā “no” c2! yuqāl “it is said” 99,601 58,062 c2! yuqāl “it is said” 45,794 * wa fī “and in” 55,077  %4 ‘alayhi “about him” 44,786 5 wa qad “and perhaps” 53,992 r wa lā “and not” 42,190  %4 ‘alayhi “about him” 50,906  allāh “Allah” 39,961 ' huwa “he” 49,785 : ’aw “or” 39,210 qZ ’ilā “to” 48,363 Figure 10.5 The Corpus of Traditional Arabic Lexicons frequency lists - 300 Figure 10.6 shows a proposed web interface for an Arabic dictionary that illustrates the adaptation of the resources, standards and tools developed in this research as language-engineering tools to construct Arabic dictionaries. Input Word Definitions `2#2o (1) #; 2> t; (noun)(3) > D”2R; -t = D”2;¨= >Z }? R; Ñ=; }2> t; } R> Ñ=; #; 2> t; Lj >#2> t; k' ; B>#2> t; .#2> t; >> `2 D .#2t; }; ;¨; .#¨; }R; ¤=; }D R. ;¤? ”' D R? ¤=; }? R; =; Pronunciation: /ğāmi‘a / m C  ¸’@  24  R %#8 `2‰ M8 i— Position in dictionary (2) Related words (4) t Institution which provides a high level of education for somebody who has left school Lemma Root Pattern D#; 2> t; (5) }; ;¨; (6) `2#; 2> t; > ;H (7) D;%42 Plural form Examples (8)  n5: +A2'" C Y* vm +‰! u%#-! +Se ·! .H2,( JQ~ )´ e u%#-! C 2)  S%-› `2'%# E%rn  "@ .  6% C*5  F#  %  O24 u %#8 h2i ·-›  - Phrases, Collocations, Idioms > #RG `2#2> o ? 2 ;; ? ; ; Origin (9) . ‰; D#2> t; D/2 }; ;¨; D.>";4; D#; 2> t; Link to the Corpus of Traditional Arabic Lexicons Morphological analysis of input words (10) ; c= }2> t; ` ? p--c------------------ Conjunction 6e4 3 r---d----------------- Definite Article 6!#8 \Q: np----fp-vndd---ncat-s Generic noun ˆ)t u r---l----------------- feminine plural suffix w2i §<—m }¨ 3 Figure 10.6 A proposed web interface for Arabic dictionary The number label on the figure is mapped to one of the resources, standards and tools: • • • Label number 1: This allows users to search for any word. The SALMA – Lemmatizer and Stemmer can be used to extract the lemma (lexical entry) related to the input word and retrieve the definitions stored in the dictionary. Label number 2: The SALMA – ABCLexicon can be used to retrieve a list of alphabetically ordered lexical entries that share the same root. Label number 3: The SALMA – Tagger can provide the main part-of-speech of the lexical entry. - 301 - • • • • • • Label number 4: The lemmatized AIC can be used to retrieve related words by measuring the Loglikelihood, T-score and Mutual Information to extract the collocation of the searched word Labels number 5 and 6: The SALMA-Lemmatizer can be used to extract the lemma and the root of the entered word. Label number 7: The pattern information can be produced using the SALMA – Pattern Generator. Label number 8: Examples are selected from the lemmatized AIC concordance lines of the input word and its lemma. Label number 9: The origin of this word and the time line of the semantic development of the lexical entries can be investigated via a link to the Corpus of Traditional Arabic Lexicons. Label number 10: The morphological analysis of the input word, its morphemes and the morphological features of each morpheme are described using both the SALMA – Tag Set and the SALMA – Tagger. 10.4 Chapter Summary Resources, standards and tools developed in this research have many potential applications as they work as fundamental prerequisites for most Arabic text analytics applications. The main purpose in developing the resources, standards and tools is to annotate an Arabic text corpus with fine-grain morphosyntactic information. This chapter investigated two applications of these resources, standards and tools: lemmatizing the 176-million word Arabic Internet Corpus (AIC), and as language engineering resources to construct an Arabic dictionary. The developed resources, standards and tools were evaluated on a sample of Arabic text to measure their accuracy and applicability for use to perform text analytics tasks. However, the performance aspects of the SALMA – Tagger such as speed, memory and ability to perform the desired analysis tasks were not evaluated previously. Applying the SALMA – Lemmatizer and Stemmer to lemmatize the 176-million word Arabic Internet Corpus is a practical application that evaluated its performance and investigated the challenges of applying the resources, standards and tools on real and large-scale data. Two main challenges arose during the lemmatizing of the AIC: the speed and the spelling errors. NGS was used to lemmatize the divided parts of the AIC in parallel. A massive reduction in execution time was gained. The SALMA – Tokenizer was used to detect and correct the spelling errors that appear in the AIC due to poor word processing tools used in authoring web pages. - 302 The second application is a proposal about how these resources, standards and tools can be used as a language engineering toolkit for Arabic lexicography. This study reviews the resources and tools which are used in modern lexicography, and shows that the developed resources, and standards constitute a toolkit for constructing Arabic monolingual and bi-lingual dictionaries. - 303 - Part V: Conclusions and Future Work - 304 - Chapter 11 Conclusions and Future Work 11.1 Overview Arabic morphological analyzers and stemming algorithms have become a popular area of research. This chapter reviews the main contributions of this thesis to this area. It discusses the conclusions drawn from experimental work, and connects these findings with related future work. Finally, the chapter summarises PhD impact, originality and contributions to Arabic NLP. Several computational linguists have designed and developed algorithms to address problems in automatic morphosyntactic annotation of Arabic text. This thesis has surveyed current Arabic morphological analyzers, and conducted experiments to discover the theoretical and practical challenges of morphological analysis for Arabic. Practical work includes the development of resources to enhance the accuracy of such systems, where these resources can also be reused in diverse Arabic text analytics applications. It also includes the proposal of linguistically informed standards for Arabic morphological analysis which draw on the long-established traditions of Arabic grammar. Finally, resources and proposed standards are brought together in the development of the SALMA – Tagger: a fine-grained morphological analyzer for Arabic text of different domains, formats and genres. Resources, proposed standards and tools are intended to be open-source. The development of the SALMA – Tagger used the open source programming language Python because it is intended for integration into the Natural Language Toolkit (NLTK72), a set of open source Python modules, linguistic data and documentation for research and development in natural language processing and text analytics. 11.2 Thesis Achievements and Conclusions This section summarises the main achievements of this thesis and the conclusions drawn from experimental work. It starts by discussing the practical challenges of Arabic morphological analysis. The second section discusses the motivations and benefits of creating the SALMA – ABCLexicon as a lexical resource for improving Arabic 72 Natural Language Toolkit (NLTK) http://www.nltk.org - 305 morphological analyzers. Section 11.2.3 discusses standardization of morphosyntactic annotation for Arabic corpora. Section 11.2.4 covers the application of proposed standards and resources developed in the SALMA – Tagger, a tool for fine-grain morphological analysis of Arabic text. Finally, section 11.2.5 discusses the evaluation of the SALMA – Tagger, focusing on the fine-grained morphological feature categories, and draws conclusions from this evaluation that suggest opportunities for future work to enhance the performance and accuracy of the SALMA – Tagger as a languageengineering toolkit for morphosyntactic analysis for Arabic text. 11.2.1 The Practical Challenge of Morphological Analysis for Arabic Text Several stemming algorithms for Arabic already exist, but each researcher proposes an evaluation methodology based on different text corpora. Therefore, direct comparisons between these evaluations cannot be made. At the time of the experiment, only three stemming algorithms and morphological analyzers for Arabic text were readily accessible to assess their implementation and/or performance results. The three selected algorithms are Khoja’s stemmer (Khoja 2003), Buckwalter’s morphological Analyzer (BAMA) (Buckwalter 2002) and the triliteral root extraction algorithm (Al-Shalabi et al. 2003). A range of four fair and precise evaluation experiments was conducted using a gold standard for evaluation consisting of two 1000-word text documents from the Holy Qur’an and the Corpus of Contemporary Arabic. The four experiments on both text samples show the same accuracy rank for the stemming algorithms: Khoja’s stemmer achieved the highest accuracy, then the triliteral root extraction algorithm, and finally BAMA. The results show that: • The stemming algorithms used in the experiments work better on MSA text (i.e. newspaper text) than Classical Arabic (i.e. Qur’an text), not unexpectedly as they were originally designed for stemming MSA text (i.e. newspaper text). The SALMA – Tagger is designed for wide coverage and so can deal with both genres. • All stemming algorithms involved in the experiments agree and generate correct analysis for simple roots that do not require detailed analysis. So, more detailed analysis and enhancements are recommended as future work. • Most stemming algorithms are designed for information retrieval systems where accuracy of the stemmers is not such an important issue. On the other hand, accuracy is vital for natural language processing, and this what the SALMA – Tagger is designed for. - 306 • Accuracy rates surveyed show that even the best algorithm failed to achieve an accuracy rate of more than 75%. This proves that more research is required: part-ofspeech tagging and then parsing cannot rely on such stemming algorithms because errors from the stemming algorithms will propagate to such systems. To give a clear picture of the stemming problem, an analytical study was conducted to compute the percentage of triliteral roots, words, and word type distribution on 22 categories of triliteral roots, as classified in sections 3.7 and 6.2.21. The roots, words and word types of the Qur’an and the SALMA-ABCLexicon were analysed. The study clearly showed that about one third of Arabic text words have roots belonging to the defective or defective and hamzated root categories (i.e. one or two root radicals belong to vowels or hamzah). Words belonging to these two root categories are hard to analyze and the root extraction process of such words always has higher error rates than for words belonging to the intact root category. Existing stemming and morphological analyzers are subject to mistakes when analysing words belonging to these two categories. The evaluation methodology used in this thesis for stemming algorithms and morphological analyzers for Arabic text based on the gold standard has since been reused and referenced by Alotaiby, Alkharashi et al. (2009), Kurimo, Virpioja et al. (2009), Harrag, Hamdi-Cherif et al. (2010), Yusof, Zainuddin et al. (2010), Al-Jumaily, Martínez et al. (2011), and Hijjawi, Bandar et al. (2011).. 11.2.2 Resources for improving Arabic Morphological Analysis The previous section raises the following question: How can we improve stemming and morphological analysis for Arabic so the algorithm can deal successfully with the hard cases of the 35% of words belonging to defective and defective and hamzated triliteral root categories? Two methodologies can be adopted: either to build a sophisticated algorithm that deals with the hard cases or simply to provide the algorithm with a prior-knowledge broad-coverage lexical resource that contains most of the hard case words and their triliteral roots and enables direct access to its contents. The stemming algorithm then looks up the word to be analysed in the lexicon and gets the correct analysis for that word. We chose to construct a broad-coverage lexical resource, the SALMA ABCLexicon to improve the accuracy of Arabic morphological analysis rather than - 307 developing a sophisticated stemming algorithm. Our choice was influenced by our interest in Arabic lexicon development and the advantages to be gained from developing the SALMA – ABCLexicon such as: • Improving Arabic morphological analysis by providing a broad-coverage lexical resource that can be integrated to different stemming algorithms and can reduce the series of complex analysis steps to a simpler look-up procedure. • The broad-coverage lexical resource can be a stand-alone resource which can be integrated in different Arabic natural language processing systems and benefits from such integration can be gained. • It is easier to update the lexical resource by adding new content to it and correcting it than updating a sophisticated algorithm which needs specialized developers. • It can also be used as a material resource to assist in the teaching-learning process. The SALMA-ABCLexicon was constructed by analysing the text of 23 traditional Arabic lexicons, all of which are freely available open-source documents, and by following an agreed standard for constructing a morphological lexicon from raw text. However, three factors directed the selection of traditional Arabic lexicons as our raw text corpus: (i) the absence of an open-source, large, representative Arabic corpus; (ii) the absence of an open-source generation program; and (iii) the generation programme problems of over-generation and under-generation. The major advantages of using the traditional Arabic lexicons text as a corpus are: the corpus contains a large number of words (14,369,570) and word types (2,184,315), and the possibility of finding the different forms of the derived words of a given root. The SALMA-ABCLexicon is constructed by combining information extracted from disparate lexical resource formats and merging Arabic lexicons. The coverage of the SALMA – ABCLexicon was computed via two methods. The first was to match the words of the test corpora to the words in the lexicon, which scored about 67%. The second was to use a lemmatizer to compute the coverage, which scored about 82% for the Qur’an, the CCA, and a million-word sample of the AIC. The SALMA-ABCLexicon contains 2,781,796 vowelized word-root pairs which represent 509,506 different non-vowelized words. The lexicon is stored in three different formats: tab-separated column files, XML files, and a relational database. It is also provided with access and searching facilities and a web interface that provides a facility - 308 for searching a certain root and retrieving the original root definitions of the analyzed traditional Arabic lexicons. In addition, the Corpus of Traditional Arabic Lexicons (14,369,570 words, and 2,184,315 word types) was created as a special corpus constructed from the text of 23 traditional Arabic lexicons. 11.2.3 Standards for Arabic Morphosyntactic Analysis The initial evaluation of morphological analyzers and stemmers for Arabic text pointed out the lack of standardization and guidelines for morphosyntactic annotation for Arabic text. These standards and guidelines are the prerequisites for morphosyntactic annotation of corpora. Therefore, eight existing Arabic tag sets were surveyed and compared in terms of purpose of design, characteristics, tag-set size, and their applications (section 5.3.7). The drawbacks of the existing tag sets for Arabic were found to be: • Existing Arabic tag sets vary in size from 6 tags to 2000 or more tags. • Some of these tag sets follow standards for tag set design for English such as the PATB tag sets, and these may not always be appropriate for Arabic. • The tag sets share common morphological features such as gender, number, person, case, mood and definiteness, but the attributes of the morphological feature categories are not standardized. • These tag sets lack standardization in defining a suitable scheme for tokenizing Arabic words into their morphemes and they mix morpheme tagging with whole word tagging. • They also lack suitable documentation that illustrates the decision made for each design dimension of the tag set. • The tags assigned to words in a corpus are not consistent in either presentation of the tag itself or the morphological features which are encoded within the tag. Moreover, the most widely used and important morphosyntactic annotation standards and guidelines, namely EAGLES, are designed for Indo-European languages. These guidelines are not entirely suitable for Arabic. The previous comparative evaluation of Arabic tag sets and the opportunity for making an original contribution motivated the development of the SALMA – Tag Set as proposed standard for morphological annotation for Arabic text corpora. This constitutes - 309 a common standard to simplify and promote comparisons and sharing of resources. For a morphologically rich language like Arabic, the Part-of-Speech tag set should be defined in terms of morphological features characterizing word structure. The SALMA – Tag Set has the following characteristics: • The SALMA – Tag Set captures long-established traditional morphological features of Arabic, in a notation format intended to be compact yet transparent. • A detailed description of the SALMA – Tag Set explains and illustrates each feature and its possible values. • A tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash “” represents a feature not relevant to a given word. • The SALMA – Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora. The SALMA – Tag Set has been validated in two ways. First, it was validated by proposing it as a standard for the Arabic language computing community, and it has been adopted in Arabic language processing systems. • It has been used in the SALMA – Tagger to encode the morphological features of each morpheme (Sawalha and Atwell 2009a; Sawalha and Atwell 2010b). • Parts of The SALMA Tag Set were also used in the Arabic morphological analyzer and part-of-speech tagger Qutuf (Altabbaa et al. 2010). • It has been reported as a standard for evaluating morphological analyzers for Arabic text and for building a gold standard for evaluating morphological analyzers and part-of-speech taggers for Arabic text (Hamada 2010). Second, an empirical approach to evaluating the SALMA Tag Set of Arabic showed that it can be applied to an Arabic text corpus, by mapping from an existing tag set to the more detailed SALMA Tag Set. The morphological tags of a 1000-word test text, chapter 29 of the Quranic Arabic Corpus, were automatically mapped to SALMA tags. Then, the mapped tags were proofread and corrected. The result of mapping and correction of the SALMA tagging of this corpus is a new Gold Standard for evaluating Arabic - 310 morphological analyzers and part-of-speech taggers with a detailed fine-grain description of the morphological features of each morpheme, encoded using SALMA tags. 11.2.4 Applications and Implementations Morphosyntactic analysis is a very important and basic application of Natural Language Processing which can be integrated into a wide range of NLP applications. Arabic has many morphological and grammatical features, including sub-categories, person, number, gender, case, mood, etc. More fine-grained tag sets are often considered more appropriate. The additional information may also help to disambiguate the (base) part of speech. The SALMA – Tagger is an open-source fine-grain morphological analyzer for Arabic text which puts together the developed resources (i.e. mainly the SALMA – ABCLexicon) and standards (the SALMA – Tag Set). It also depends on pre-stored lists (i.e. prefixes, suffixes, roots, patterns, function words, broken plurals, named entities, etc.) which were extracted from traditional grammar books. The morphological analyzer was developed to analyze the word and specify its morphological features. It uses a tokenization scheme for Arabic words that distinguishes between five parts of a word’s morphemes as defined by the SALMA – Tag Set. Each part is given a fine-grained SALMA Tag that encodes 22 morphosyntactic categories of the morpheme (or possibly multiple tags if the part has multiple clitics or affixes). The SALMA – Tagger consists of several modules which can be used independently to perform a specific task such as root extraction, lemmatizing and pattern extraction. Or, they can be used together to produce full detailed analyses of the words. The SALMA – Tagger was evaluated on a sample of Arabic text to measure its accuracy and applicability for use in text analytics tasks. It was also practically evaluated by applying the SALMA – Lemmatizer and Stemmer to lemmatize the 176-million word Arabic Internet Corpus (AIC) (section 10.2). This application measured the performance aspects of the SALMA - Tagger such as speed, memory and ability to perform the desired analysis tasks. Two main challenges arose during the lemmatizing of the AIC: • • Speed: which is solved by using the NGS to lemmatize the divided parts of the AIC in parallel giving a massive reduction in execution time. Spelling errors: which are solved by using the SALMA-Tokenizer to detect and correct the spelling errors that appear in the AIC due to poor word processing tools used in authoring web pages. - 311 - The second application is a proposal about how these resources, standards and tools can be used as a language engineering toolkit for Arabic lexicography. We reviewed the resources and tools which are used in modern lexicography, and we showed that the resources, proposed standards, and tools developed constitute a toolkit for constructing Arabic monolingual and bi-lingual dictionaries (section 10.3). 11.2.5 Evaluation The evaluation for the SALMA – Tagger showed that evaluation methodologies for morphological analyzers are not standardized yet. Therefore, we developed agreed standards for evaluating morphological analyzers for Arabic text, based on our experiences and participation in two community-based evaluation contests: the ALECSO/KACST initiative for developing and evaluating morphological analyzers; and the MorphoChallenge 2009 competition. The guideline recommendations, evaluation specifications and procedures, and evaluation metrics were reused to generate a global standard for evaluating morphological analyzers for Arabic text. The developed standards were applied when evaluating the SALMA – Tagger. The developed evaluation standards depend on using gold standards for evaluating morphological analyzers for Arabic text. A reusable general purpose gold standard (the SALMA – Gold Standard) was constructed to evaluate various morphological analyzers for Arabic text and to allow comparisons between the different analyzers. The SALMA – Gold Standard is adherent to standards, and enriched with fine-grained morphological information for each morpheme of the gold standard text samples. The detailed information is: the input word, its root, lemma, pattern, word type and the word’s morphemes. For each of the word’s morphemes, the morpheme type is classified into proclitic, prefix, stem, suffix and enclitic, and a fine-grain SALMA Tag which encodes 22 morphological feature categories of each morpheme, is also included. The SALMA – Gold Standard contains two text samples of about 1000-words each representing two different text domains and genres of both vowelized and non-vowelized text taken from the Qur’an – chapter 29 representing Classical Arabic, and from the CCA representing Modern Standard Arabic. The SALMA – Gold Standard is stored using different standard formats (i.e. XML files, tab-separated column files, HTML and colourcoded format) to allow wider reusability. The evaluation using the SALMA – Gold Standard focused on measuring the prediction accuracy of the 22 morphological features encoded in the SALMA – Tags for - 312 each of the gold standard’s text samples morphemes. The evaluation aimed to answer the following questions: • Is fine-grained morphological analysis for Arabic text practical? • Can traditional Arabic grammar be leveraged to inform the knowledge-base for predicting the attribute values of the morphological feature categories? How can accuracy metrics report usefully for potential users who will use/reuse the SALMA – Tagger or parts of it? • • How are morphological feature categories related to each other (i.e. what interdependencies exist between the morphological features categories)? The results show that 53.50% of the Qur’an text sample morphemes and 71.21% of the CCA text sample were correctly tagged using “exact match” of the gold standard’s morpheme tags, but some of the errors were very minor such as replacing ‘?’ by ‘-’. These results of applying the SALMA – Tagger answer the first question and show that fine-grained morphological analysis for Arabic text is practical. The results show the applicability of the SALMA – Tagger to process different types of text types, domains and genres of both vowelized and non-vowelized Arabic text. The SALMA – Tagger can be used to POS-tag Arabic text corpora and to provide detailed fine-grained analysis for each morpheme of the corpus words. Moreover, these general results and the individual accuracy rates reported for each morphological feature show that the linguistically-informed knowledge-based system for predicting the values of the morphological feature categories is applicable to Arabic morphological analysis. The traditional Arabic grammar rules are leveraged to inform and construct the knowledge-based system for predicting the attribute values of the morphological feature categories. The evaluation reported the accuracy, recall, precision, f1-score and the confusion matrix for each morphological feature category. The individual category accuracy results are useful for users who will use/reuse the SALMA – Tagger or parts of it, to know in advance the prediction accuracy of the attributes of each morphological feature category. Prediction accuracy was high for 15 morphological feature categories: namely, 98.53%100%for the CCA test sample and 90.11%-100% for the Qur’an test sample. These categories are: main part-of-speech; subcategory of verb; subcategory of particle; subcategory of other (residual); punctuation; definiteness; voice; emphasized and nonemphasized; transitivity; declension and conjugation; unaugmented and augmented; number of root letters; verb roots; and noun finals. - 313 The remaining 7 morphological feature categories, namely: the subcategory of noun; gender; number; person; inflectional morphology; case or mood; case and mood marks; and the morphological feature of rational, achieved slightly lower prediction accuracy: 81.35%-97.51%for the CCA test sample and 74.25%-89.03% for the Qur’an test sample. Insights gained from this evaluation process for the morphological feature categories of Arabic words have been investigated in terms of the main background knowledge used for prediction and are as follows: • The prediction of the main part-of-speech of a word's morphemes depends on both maintaining agreement between the word’s affixes and clitics and the patterns dictionaries. Main part-of-speech information is provided in the clitics and affixes dictionaries and the patterns dictionary. • The prediction of the part-of-speech subcategory of noun was not easy for the Qur’an text sample due to the nature of Quranic Arabic. The Qur’an text sample has repeated examples of proper nouns of historical persons and places. One characteristic of MSA text is the frequent use of relative nouns such as *2 | > ; G.= aṯṯaqāfī ‘cultural’ and gerunds of profession such as ;.>);'; = al-waṭaniyyah ‘nationalism’ where the rule for predicting these attributes is simple. • The prediction of verbs depends on the analysis of the prefixes and suffixes and the matching of the stem morpheme with a patterns dictionary entry. • Most particles are stored in the function words list. However, some of the particles of the Qur’an text sample are complex particles which consist of more than one morpheme such as =w;;;: ’a-wa-lam ‘and not’, which consists of three morphemes. • The prediction of these affixes depends on matching the morphemes of the analyzed word with the entries of the clitics and affixes dictionaries. Ambiguous clitics can be classified into different categories. • The prediction of punctuation is done in the tokenization step. Special characters used in the MSA text which are not standard punctuation marks are given a special tag ‘o’ at position 6 of the tag string. • The morphological features of gender, number and person are related to each other and share the same prediction methodology which depends on suffix analysis. Contextual rules that define agreement between the verb and its doer (the subject of - 314 the sentence) are needed to support the prediction of these features when the affixes are ambiguous and cannot provide enough prediction information. • The prediction of the morphological feature of inflectional morphology for verbs depends on the part-of-speech subcategory of verbs and analysis of suffixes for imperfect verbs to determine whether the verb is conjugated or invariable. • The disambiguation of nouns into declined and invariable depends on applying many rules that deal with the part-of-speech subcategory of nouns, noun finals and patterns. These rules classify nouns into fully-declined or non-declined. • The prediction of the morphological feature of case and mood depends on the result of the prediction of the morphological feature of inflectional morphology, such that a declined noun has case (i.e. nominative, accusative and genitive) and a conjugated verb has mood (i.e. indicative, subjunctive, and imperative or jussive), while case or mood is not applicable to invariable nouns and verbs. • The prediction of a noun’s case investigates the proclitics attached to the beginning of the noun which might affect the case and its syntactic mark such as prepositions and jurative particles. Prediction rules also investigate the dual and plural suffixes which change according to the case of the noun. • Rules for predicting the case or mood, and case and mood marks for singular and broken plural nouns depend on the short vowel (i.e. the syntactic mark) that appears on the end of the word. The absence of short vowels and contextual rules that deal with nouns according to their context (i.e. subject or object) increases the potential of wrong prediction especially for singular and broken plural nouns. • Determining the morpheme that carries the syntactic mark of the word is not an easy task and needs more investigation and standardization. Defining the morpheme that carries the syntactic mark has an impact on the development of the syntactic parsers for Arabic text. • Only a conjugated verb has mood. The prediction rules of mood depend on the partof-speech subcategory of verb, such that mood is applicable to imperfect verbs and not applicable to perfect and imperative verbs. The rules also analyze the suffixes of the imperfect verb to determine the applicability of mood. The final rule of prediction depends on the short vowel. • Interdependency is clear between the three morphological feature categories: inflectional morphology, case or mood, and case and mood marks. - 315 • The prediction of the morphological feature of definiteness depends on the availability of the definite article c as a proclitic for the analyzed noun. • The prediction rules classify verbs into active verbs or passive verbs depending on the short vowel appearing on the first letter of the verb after removing proclitics. If a ḍammah does not appear on the verb’s first letter, then it is classified as an active voice verb. Errors can happen in some cases where ḍammah appears on the first letter of active voice verbs. Applying prediction rules for the morphological feature of voice that depend on the patterns rather than the short vowel of the first letter of the verb will increase the prediction accuracy. • Prediction rules for classifying verbs into emphasized or non-emphasized depend on the part-of-speech subcategory of the verb. Perfect verbs are always nonemphasized while imperfect and imperative verbs can be emphasized. The prediction rules also investigate the suffixes of the verb. Emphasized verbs contain the emphatic nūn as a suffix. • The prediction rules for the morphological feature of transitivity depend on matching the analyzed verb with one verb stored in the lists of doubly transitive and triply transitive verb lists. The singly transitive verb attribute is the default value for the morphological feature of transitivity. The absence of contextual rules for predicting the attributes of the morphological feature of transitivity increases the potential for making prediction mistakes. On the other hand, suffix pronoun analysis can capture some attributes of this morphological feature. • Classifying words into rational or irrational depends on the semantics of the word itself and its context, which determines agreements between sentence parts such as verb-subject agreement and adjective-noun agreement. A comprehensive dictionary which includes Rational information for each dictionary entry is needed to determine the correct attribute value of rational for nouns. • The morphological feature of declension and conjugation is applied to nouns, verbs and particles. The prediction rules of the values of declension and conjugation of nouns depend on the part-of-speech subcategories. Including declension and conjugation information in the Arabic dictionary will increase the correct prediction of attributes for this morphological feature. • The prediction rule of unaugmented and augmented attributes subtracts the length of the root from the length of the analyzed word. The prediction rule of the - 316 attributes of the number of root letters depends on the length of the root. The prediction rules of the morphological feature of verb roots depend on the nature of the root letters: whether they are consonants, containing hamzah, or whether they contain one vowel or two. • The rules for predicting the value of the morphological feature of Noun Finals mainly depends on the long stem and the root of the analysed word which checks the final letters of the long stem against a set of conditions that classify nouns into 6 subcategories. Knowing the value of the Noun Finals feature helps in specifying other features such as the morphological features of Inflectional Morphology and Case and Mood Marks. To summarize, the absence of contextual rules, the absence of short vowels, the interdependency between some morphological features, and the number of attributes of a certain morphological feature increase the potential of prediction errors for some morphological feature categories. To improve the accuracy of predicting the attributes of the morphological feature categories, contextual rules can be implemented as a second pass. Some morphological feature categories such as rational depend on the semantic nature of the analyzed word itself. Providing rationality information for Arabic dictionary entries and reusing this information in morphological analyzers will increase prediction accuracy. Moreover, updating the dictionaries which are used by the SALMA – Tagger by increasing their coverage will increase prediction accuracy. 11.3 Future work This section explores four possible applications of the SALMA – Tagger, and the resources developed in this thesis to future work projects: improving the SALMA – Tagger; a syntactic parser; the international corpus of Arabic ICA; and as a tool for annotating phrase-breaks and other prosodic features in a corpus. The Tagger can also be integrated with similar level applications that combine two systems together to maximise the capabilities of both systems. 11.3.1 Improving the SALMA – Tagger The evaluation of the SALMA – Tagger showed that the prediction rules for 7 morphological feature categories (namely: the subcategories of noun, gender, number, person, inflectional morphology, case or mood, case and mood marks, and the morphological feature of rational) achieved a slightly lower than expected prediction - 317 accuracy: 81.35%-97.51% for the CCA test sample and 74.25%-89.03% for the Qur’an test sample. The lower accuracy achieved with the 7 morphological feature categories was due to: • The absence of contextual rules in the SALMA – Tagger, which treats words out of their context. • The absence of short vowels in text, and especially MSA text. This makes the prediction of the attributes of some morphological features difficult. • The interdependency between some morphological features such as the morphological features of inflectional morphology, case and mood, and case and mood marks. The decreases the accuracy of the dependent features by propagating errors from one feature to another. • Prediction errors. These increase, if the number of attributes of a certain morphological feature increases. To improve the accuracy of predicting the attributes of the morphological feature categories, three practical solutions can be implemented as a second phase of the development of the SALMA – Tagger. These solutions are: • Contextual rules, which can be implemented as a second pass. The contextual rules will also help in reducing the number of candidate analyses of the analyzed words by excluding the analyses that do not satisfy certain contextual rules. • Enriching Arabic dictionary entries with fine-grain morphological information such as gender, number, inflectional morphology, rationality, and transitivity and reusing this information in morphological analyzers. This will increase the accuracy of prediction. • Updating the dictionaries and the linguistic lists which are used by the SALMA – Tagger by increasing their coverage. This will increase prediction accuracy. The morphological feature categories such as rational depend on the semantic nature of the analyzed word itself. Therefore, the development of the morphological analyzer of Arabic text is an ongoing project that will be integrated in different levels of applications (i.e. phonology, syntax and semantics) into these application levels on an information sharing basis. The morphological analyzer which is integrated to these levels will provide detailed morphological information about words and at the same time will benefit from feedback from these levels of analysis. - 318 - 11.3.2 A Syntactic Analyzer (parser) for Arabic Text The SALMA - Tagger generates all possible analyses for the analyzed words out of their context. A disambiguation tool that selects a suitable analysis within a certain context is needed. A syntactic analyzer (parser) is required as a tool for automatically annotating the Arabic corpus with the correct syntactic information. It is also required to build the syntactic parse trees for Arabic corpus sentences. The aim of this project is to build a syntactic analyzer (parser) to annotate the Arabic corpus with the syntactic information for each word in the corpus. The aim of this corpus annotation is to create a Treebank corpus and a dependency Treebank of Arabic. These tools and standards will be tied into a specific corpus, but they can be reused to annotate any Arabic corpus to meet the needs of updating the contents of any Arabic corpus or building new Arabic corpora for specific purposes. The syntactic analyzer for Arabic text will depend on both the linguistic information extracted from traditional Arabic grammar books and the use of machine leaning algorithms such as HMM and decision trees, to build the disambiguation tool that selects the appropriate morphosyntactic analysis of the word in its context. The following resources and tools are needed to develop a syntactic analyzer (parser) for Arabic text: • Morphological analysis tool and standard: The SALMA – Tagger and the SALMA – Tag Set are essential prerequisites for the syntactic parser, providing a detailed morphological analysis of all morphemes of words in the Arabic corpus. • Linguistic model of Arabic sentence structure and the syntactic tag set: The methodology used to develop the fine-grain morphological features tag set, the SALMA – Tag Set, can be reused to develop a syntactic tag set that is based on traditional Arabic grammar. The syntactic tag set of Arabic will specify the types of Arabic sentences and phrases (i.e. verbal sentences, nominal sentences and phrases); the components of Arabic sentences and phrases (i.e. verb, subject, object and complement); the linguistic attributes (i.e. syntactic features) of each sentence component; and the forms of agreement between the sentence components. • Representative Open Source Arabic Corpus: Very few open source Arabic corpora are available which can be used as seeds for the new representative open source Arabic corpus. Such available open source corpora are the Corpus of Contemporary Arabic (Al-Sulaiti and Atwell 2006), the Corpus of Traditional Arabic Dictionaries (Sawalha and Atwell 2010a), and the Quranic Arabic Corpus (Dukes et al. 2010). The first two corpora do not have any morphosyntactic annotation, but the Quranic Arabic Corpus is annotated with morphosyntactic analyses which can be reused by mapping the annotation to our standards. - 319 - • Evaluation Standards: The standard development methodology of the SALMA – Tagger can be reused to develop standards and guidelines to evaluate the syntactic parser. The evaluation standards will mainly depend on developing a gold standard for evaluation. The gold standard aims to be widely used by the Arabic NLP community and to be general purpose. It will be used as a standard for comparing different Arabic syntactic parsers. Therefore, the construction of the gold standard should follow specific guidelines for size, the corpora used in constructing it and its format. The gold standard should be large enough to cover most of the morphosyntactic phenomena that morphosyntactic analyzers have to handle. The corpus used to construct the gold standard should be representative, including text of • different text domains, formats and genres, with both vowelized and non-vowelized Arabic text. The format of the gold standard will specify what information it has to include and in which format it has to be stored. The Project Collaborators: this project is part of a future project that meets our interest in morphosyntactic analysis for Arabic text. Initial agreements have already been made between the project collaborators: Majdi Sawalha and Dr. Eric Atwell (Arabic Language Engineering team at the University of Leeds, UK); Professor Azzeddine Mazroui (Natural Language Processing team at the University of Mohammed I, Morocco); and Dr. AlMoutaz Bi-Allah Al-Sa’eed (Cairo University, Egypt). 11.3.3 Open Source Morphosyntactically Annotated Arabic Corpus The main objective in developing the SALMA – Tagger and the syntactic parser (previous section) is to annotate the Arabic corpus with detailed morphosyntactic analyses of each word in the corpus. There is as yet no open source Arabic Corpus with full morphosyntactic annotation. The construction of such a corpus aims to advance Arabic NLP studies. The survey of Arabic corpora in section 2.2 showed that there are only two open source Arabic corpora eligible for morphosyntactic annotation. These existing corpora are the Corpus of Contemporary Arabic (Al-Sulaiti and Atwell 2006) and the Quranic Arabic Corpus (Dukes et al. 2010). The CCA is an MSA corpus of raw text, while the QAC represents Classical Arabic which has morphological and syntactic annotations. The Corpus of Traditional Arabic Dictionaries (Sawalha and Atwell 2010a) developed in this thesis is a special corpus of raw text which represents text from a period of 1,300 years. A representative open-source Arabic corpus will be constructed by selecting the text from different genres and formats including both vowelized and non-vowelized Arabic text. The previously mentioned open-source corpora can represent a seed for our corpus. Each document of the corpus will be described by adding information of date, author, - 320 country, topic/genre, vowelization information, source, etc. These descriptions can be used to train text classifiers. An annotation tool and annotation guidelines are needed to achieve our objective. The design of the annotation program should take into account the choices for the annotator to manually annotate the corpus and to correct the automatically tagged text by selecting the appropriate morphological analysis resulting from the morphological analyzer and the ability to correct the syntactic analysis generated automatically using the syntactic parser. The annotation program should have capabilities for searching for morphosyntactic patterns in the annotated text, and for visualizing the sentences and the syntactic annotations as parse trees in a readable and representative way, with the added capacity to access parts of the parse tree and make corrections if necessary. The annotation program should also have an intelligent design that facilitates the annotation process. Some open source annotation tools already exist such as GATE (http://gate.co.uk). Our annotation tools and analyzers can be integrated into GATE, which can help widen usage of the tools and standards that will be produced in this project. The Morphosyntactic Analyses Training Corpus of Arabic is useful for developing machine learning algorithms. The latter requires a training corpus of Arabic text annotated with the appropriate morphosyntactic analyses. Parts of the open source Arabic corpus can be manually/semi-automatically annotated using the developed tools to train the machine learning algorithms that will be used to build statistical models for morphosyntactic analyses of Arabic text corpora. The project collaborators are: Majdi Sawalha and Dr. Eric Atwell (Arabic Language Engineering team at the University of Leeds, UK); Professor Azzeddine Mazroui (Natural Language Processing team at the University of Mohammed I, Morocco); and Dr. AlMoutaz Bi-Allah Al-Sa’eed (Cairo University, Egypt). 11.3.4 Arabic Phonetics and Phonology for Text Analytics and Natural Language Processing Applications This research applies Text Analytics techniques honed on English for resource creation and corpus-based exploration of Arabic speech and language for Arabic Natural Language Processing (NLP) applications. Such techniques depend on a corpus or sample of naturally occurring language texts capturing empirical data on the phenomena being studied, for example prosodic-syntactic patterns in the vicinity of phrase breaks or perceived pauses in the speech stream. Computational analysis of text also requires goldstandard (human) annotation of target phenomena and other linguistic knowledge inherent - 321 in text, such as part-of-speech (POS) categories. The approach is then to mine the annotations as well as plain text. Collaborators on this project have research interests and expertise in Corpus Linguistics, Artificial Intelligence, Text Analytics, and Lexicography for English and Arabic (Brierley and Atwell 2008; Dukes et al. 2010; Sawalha and Atwell 2010b). One area to focus on is the prosody-syntax interface: this approach builds on previous work on English prosody and Text Analytics (Brierley and Eric 2010) and involves mining rhythmic junctures to derive boundary templates and phrasing strategies from Arabic texts as diverse as transcribed speech recordings (e.g. Modern Standard Arabic newsreel), Classical Arabic poetry and Quranic Arabic. Some editions of the Quran have finegrained prosodic-boundary annotations, inviting comparison with conventions for British and American English (e.g. ToBI (Beckman and Hirschberg 1994)). Collaborators will report on an essential pre-requisite for this approach: an Arabic pronunciation lexicon and automatic text annotation tool modelled on a similar tool for English (Brierley and Atwell 2008). The SALMA patterns dictionary enriched with syllable and primary stress information, and the SALMA Tagger and Vowelizer are required as part of the languageengineering toolkit for this project. The project plans to represent significant boundary and phrasing patterns thus derived as categorical features for machine learning and to test these in phrase break models for Arabic Text-to-Speech Synthesis (TTS). Enhanced performance in TTS relates to the longer-term goal of achieving more realistic speech in virtual characters for both English and Arabic HCI (Human-Computer Interaction), with diverse applications in education, therapy and entertainment. The collaborators on this project are: Majdi Sawalha, Claire Brierley and Eric Atwell (Arabic Language Engineering team at the University of Leeds, UK). 11.4 Summary: PhD impact, originality, and contributions to research field Our research into morphosyntactic analysis of Arabic text corpora involves original scientific research, and focuses on the question of how to widen the scope of Arabic morphosyntactic analyses, to develop an NLP toolkit that can process Arabic text in a wide range of formats, domains, and genres, of both vowelized and non-vowelized Arabic text. This final section presents a brief summary of research contributions and achievements of this PhD. - 322 - 11.4.1 Utilizing the Linguistic Wisdom and Knowledge in Arabic NLP The inspiration behind this research is centuries-old linguistic wisdom and knowledge captured and readily available in traditional Arabic grammars and lexicons. The knowledge can be utilized in an Arabic NLP toolkit which can be accessed, standardized, reused and implemented in Arabic natural language processing. The detailed knowledge is applicable to both Classical and Modern Standard Arabic and can be used to restore orthographic (e.g. short vowels) and morphosyntactic features which signify important linguistic distinctions. Fine-grained morphosyntactic analysis is possible, achievable and advantageous in processing Arabic text. Enriching the text with linguistic analysis will maximize the potential for corpus re-use in a wide range of applications. We foresee the advantage of enriching the text with part-of-speech tags of very fine-grained grammatical distinctions, which reflect expert interest in syntax and morphology, but not specific needs of end-users, because end-user applications are not known in advance. The objective of the thesis has been achieved through developing a novel languageengineering toolkit for morphosyntactic analysis of Arabic text, the SALMA – Tagger. The SALMA – Tagger combines sophisticated modules that break down the complex morphological analysis problem into achievable tasks which each address a particular problem and also constitute stand-alone units. The novel language-engineering tool depends on two novel and original resources and standards (i) the SALMA – Tag Set and (ii) the SALMA – ABCLexicon. 11.4.2 Dimensions of Contributions to Arabic NLP This research has contributed to Arabic NLP in three dimensions: Resources, standards and tools (i.e. practical software). The following is a list of the contributions classified into the three dimensions: D. Resources 1. The SALMA – ABCLexicon: a novel broad-coverage lexical resource constructed by extracting information from many traditional Arabic lexicons, constructed over 1,300 years, of disparate formats. 2. The Corpus of Traditional Arabic Lexicons: a special corpus of Arabic which is compiled from the text of 23 traditional Arabic lexicons that cover a period of 1,300 years and shows the evolution of Arabic vocabulary. It contains about 14 million word tokens and about 2 million word types. 3. The morphological lists of the SALMA – Patterns Dictionary and the SALMA – Clitics and Affixes lists. - 323 4. The several linguistic lists that are used by the SALMA – Tagger such as: function words list, named entities lists, broken plural list, conjugated and nonconjugated verbs list, and transitive verbs lists. 5. The Lemmatized version of the Arabic Internet Corpus. E. Proposed standards 16. The SALMA – Tag Set: a morphological features tag set for Arabic text which captures long-established traditional morphological features of Arabic, in a compact yet transparent notation. 17. The SALMA – Gold Standard for evaluating morphological analyzers for Arabic text. 18. The MorphoChallenge 2009 Qur’an Gold Standard. 19. Proposed standards for developing morphological analyzers for Arabic text. 20. Proposed standards for evaluating morphological analyzers for Arabic text. F. Tools (practical software) 1. The SALMA – Tokenizer, which tokenizes the input text files and identifies the Arabic words, spell-checks and corrects the words, and identifies the words’ parts or morphemes. 2. The SALMA – Lemmatizer and Stemmer, which extracts the lemma and the root of the analysed word. 3. The SALMA – Pattern Generator, which is responsible for matching the word with its pattern. 4. The SALMA – Vowelizer, which is responsible for adding the short vowels to the analysed words. 5. The SALMA – Tagger module, which predicts the fine-grained morphological features for each of the analysed word’s morphemes. Finally, a potential future application of these contributions is as a languageengineering toolkit for Arabic lexicography to construct Arabic monolingual and bilingual dictionaries (Section 10.3). 11.4.3 Impact Journal and conference papers resulting from this thesis have addressed a range of research communities: Computational linguistics, Arabic Natural language processing, Language Resources and Evaluation, Linguistic studies (word structure analysis), and Lexicography. These publications have already been cited by other researcher such as Alotaiby, Alkharashi et al. (2009), Kurimo, Virpioja et al. (2009), Altabbaa, Al-Zaraee et al. (2010), Hamada 2010; Harrag, Hamdi-Cherif et al. (2010), Yusof, Zainuddin et al. (2010), Al-Jumaily, Martínez et al. (2011), and Hijjawi, Bandar et al. (2011). - 324 - References Al-Bawaab, M. 2009. ‫ مواصفات نظام التحليل الصرفي في اللغة العربية‬Specifications of Arabic Morphological Analyzer. Proceedings of the workshop of morphological analyzer experts for Arabic language, organized by Arab League Educational, Cultural and Scientific Organization (ALECSO), King Abdul-Aziz City of Science and Technology ( KACST) and Arabic Language Academy., Damascus, Syria. Al-Ghalayyni. 2005. ‫" جامع الدروس العربية‬Jami' Al-Duroos Al-Arabia". Saida - Lebanon: AlMaktaba Al-Asriyiah "‫"المكتبة العصرية‬. Al-Jumaily, H., Martínez, P., Martínez-Fernández, J., and Goot, E.v.d. 2011. A real time Named Entity Recognition system for Arabic text mining. Language Resources and Evaluation.1-21. al-Saydawi, Y. 2006. ‫ كتاب يعيد صوغ القواعد العربية‬:‫ الكفاف‬Sufficiency: A Book Reformulating Arabic Grammar. Damascus, Syria: Dar Al-Fikr. Al-Shalabi, R. 2005. Pattern-based Stemmer for Finding Arabic Roots. Information Technology Journal 4(1): 38-43. Al-Shalabi, R., Kanaan, G. and Al-Serhan, H. 2003. New approach for extracting Arabic roots. in ACIT '2003: Proceedings of The 2003 Arab conference on Information Technology, Alexandria, Egypt. Al-Shammari, E. and Lin, J. 2008. A novel Arabic lemmatization algorithm. AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data, pp. 113--118. Singapore: ACM. Al-Shamsi, F. and Guessoum, A. 2006. A Hidden Markov Model-Based POS Tagger for Arabic. 8es Journees internationales d'Analyse statistique des Donnees Textuelles. Al-Sughaiyer, I. A. and Al-Kharashi, I. A. 2002. Rule Parser for Arabic Stemmer Text, Speech and Dialogue, pp. 11-18. Springer Berlin / Heidelberg. Al-Sughaiyer, I. A. and Al-Kharashi, I. A. 2004. Arabic morphological analysis techniques: A comprehensive survey. Journal of the American Society for Information Science and Technology 55(3): 189-213. Al-Sulaiti, L. and Atwell, E. 2004. Designing and developing a corpus of contemporary Arabic TALC 2004: Proceedings of the sixth Teaching And Language Corpora conference, pp. 92-93. Al-Sulaiti, L. and Atwell, E. 2005. Extending the corpus of contemporary Arabic. Proceedings of Corpus Linguistics 2005. Al-Sulaiti, L. and Atwell, E. 2006. The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics 11: 135-171. ALECSO. 2008a. Arabic Derivation System. ALECSO. 2008b. Sarf - Arabic Morphology System The Arab League Educational, Cultural and Scientific Organization (ALECSO). Ali, A. S. M. 1987. A Linguistic Study of the development of Scientific Vocabulary in Standard Arabic. London and New York: Kegan Paul International. Alotaiby, F., Alkharashi, I. A. and Foda, S. G. 2009. Processing Large Arabic Text Corpora: Preliminary Analysis and Results. Paper presented to the Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, 2009. Alqrainy, S. 2008. A Morphological-Syntactical Analysis Approach For Arabic Textual Tagging. 2008, pp. 197. Leicester, UK: De Montfort University. - 325 AlSerhan, H. and Ayesh, A. 2006. A Triliteral Word Roots Extraction Using Neural Network For Arabic. IEEE International Conference on Computer Engineering and Systems (ICCES06), pp. 436-440. Cairo, Egypt. Altabbaa, M., Al-Zaraee, A. and Shukairy, M. A. 2010. An Arabic Morphological Analyzer and Part-Of-Speech Tagger Qutuf '‫'قُطُوف‬. Faculty of Informatics Engineering, pp. 100. Damascus: Arab International University. Atkins, B. T. S. and Rundell, M. 2008. The Oxford guide to practical lexicography Oxford ; New York Oxford University Press. Attia, M. A. 2007. Arabic Tokenization System. ACL-Workshop on Computational Approaches to Semitic Languages, Prague. Attia, M. A. 2008. Handling Arabic Morphological and Syntactic Ambiguity within the LFG Framework with a View to Machine Translation. Faculty of Humanities, pp. 279. Manchester: University of Manchester. Atwell, E. 2007. A cross-language methodology for corpus Part-of-Speech tag-set development Proceedings of Corpus Linguistics 2007. Atwell, E. 2008. Development of tag sets for part-of-speech tagging. In A. Ludeling and M. Kyto (eds.). Corpus Linguistics: An International Handbook, Volume 1, pp. 501-526 Mouton de Gruyter. Atwell, E., Demetriou, G., Hughes, J., Schiffrin, A., Souter, C. and Wilcock, S. 2000. A comparative evaluation of modern English corpus grammatical annotation schemes. ICAME Journal, International Computer Archive of Modern and medieval English, Bergen 24: 7-23. Atwell, E. and Roberts, A. 2007. CHEAT: combinatory hybrid elementary analysis of text Proceedings of CL'2007 Corpus Linguistics Conference. Baayen, R. H., Piepenbrock, R. and Rijn, H. v. 1995. The CELEX Lexical Database. Release 2. Baker, P., Hardie, A. and McEnery, T. 2006. A Glossary of Corpus Linguistics. Edinburgh, UK: Edinburgh University Press. Bamman, D. and Crane, G. 2008. Building a Dynamic Lexicon from a Digital Library. Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2008), Pittsburgh. Banko, M. and Brill, E. 2001. Scaling to Very Very Large Corpora for Natural Language Disambiguation. 39th annual meeting & 10th conference of the European Chapter : , Toulouse, 9-11 July 2001 Morgan Kaufman Publishers, [S. l.], INCONNU (2001) (Monographie). Banko, M. and Moore, R. C. 2004. Part of Speech Tagging in Context. 20th International Conference on Computational Linguistics (Coling 2004), pp. 556-561, Geneva, Switzerland: International Conference on Computational Linguistics. Beckman, M. E. and Hirschberg, J. 1994. The ToBI Annotation Conventions. Beesley, K. R. 1996. Arabic finite-state morphological analysis and generation. Proceedings of the 16th conference on Computational linguistics - Volume 1, Copenhagen, Denmark: Association for Computational Linguistics. Beesley, K. R. 1998. Arabic morphology using only finite-state operations. Proceedings of the Workshop on Computational Approaches to Semitic Languages, Montreal, Quebec, Canada: Association for Computational Linguistics. Benajiba, Y., Diab, M. T. and Rosso, P. 2008. Arabic named entity recognition using optimized feature sets. Proceedings of the Conference on Empirical Methods in Natural language Processing, EMNLP'08, pp. 248-293. Honolulu, Hawaii: Association for Computational Linguistics. - 326 Benmamoun, E. 1999. Arabic morphology: The central role of the imperfective. Lingua 108.175-201. Bird, S., Klein, E., and Loper, E. 2009. Natural Language Processing with Python (1st edition edn.: O’Reilly Media, Inc.). Black, W. J. and El-Kateb, S. 2004. A Prototype English-Arabic Dictionary Based on WordNet. The Second Global Wordnet Conference 2004 Brno, Czech Republic, January 20-23, 2004, pp. 67-74. Borin, L. 2000. Something Borrowed, Something Blue: Rule-Based Combination of POS Taggers. Proceedings of Second International Conference on Language Resources and Evaluation (LREC), pp. 21-26. Athens, Greece. Boudlal, A., Belahbib, R., Lakhouaja, A., Mazroui, A., Meziane, A. and Bebah, M. O. A. O. 2011. A Markovian Approach for Arabic Root Extraction. The International Arab Journal of Information Technology 8(1): 91-98. Boudlal, A., Lakhouaja, A., Mazroui, A., Meziane, A., Bebah, M. O. A. O. and M.Shoul. 2010. Alkhalil Morpho Sys: A Morphosyntactic analysis system for Arabic texts. IJCSI International Journal of Computer Science Issues. Brierley, C. and Atwell, E. 2008. ProPOSEL: a human-oriented prosody and PoS English lexicon for machine learning and NLP. Proceedings of COLING 2008, CogALex Workshop on Cognitive Aspects of the Lexicon. Brierley, C. and Eric, A. 2010. Holy smoke: vocalic precursors of phrase breaks in Milton's Paradise Lost. Literary and Linguistic Computing Journal 25(2). Buckwalter, T. 2002. Buckwalter Arabic Morphological Analyzer Version 1.0. Linguistic Data Consortium, catalog number LDC2002L49 and ISBN 1-58563-257-0. Buckwalter, T. 2004. Buckwalter Arabic Morphological Analyzer Version 2.0. Linguistic Data Consortium, catalog number LDC2004L02 and ISBN 1-58563-324-0. Cachia, P. 1973. The monitor : a dictionary of Arabic grammatical terms : ArabicEnglish, English-Arabic / compiled by Pierre Cachia. Beirut, Librairie du Liban. Chan, P. K. and Stolfo, S. J. 1995. A Comparative Evaluation of Voting and Metalearning on Partitioned Data. Proceedings of International Conference on Machine Learning, pp. 90-98. Clark, A. 2007. Supervised and Unsupervised Learning of Arabic Morphology. In A. Soudi, A. v. Bosch and G. Neuman (eds.). Arabic Computational Morphology, pp. 181-200. Springer. Dˇzeroski, S. s., Erjavec, T. z. and Zavrel, J. 2000. Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets. Proceedings of the Second International Conference on Language Resources and Evaluation. ELRA, pp. 1099-1104. ParisAthens. Dahdah, A. 1987. A Dictionary of Arabic Grammer in Charts and Tables " ‫معجم قواعد اللغة‬ ‫" العربيه – في جداول ولوحات‬. Beirut, Lebanon: Librairie du Liban publisher. Dahdah, A. 1993. A dictionary of Arabic Grammatical nomenclature Arabic – English " ‫انكليزي‬-‫"معجم لغة النحو العربي عربي‬. Beirut, Lebanon: Librairie du Liban publishers. Dejean, H. 2000. How to Evaluate and Compare Tagsets? A Proposal. Proceedings of the second international conference on Language Resources and Evaluation LREC 2000, Ahens, Greece: European Language Resources Association (ELRA). Diab, M. T., Hacioglu, K., and Jurafsky, D. 2004. Automatic Tagging of Arabic Text: From raw text to Base Phrase Chunks. Paper presented to the Proceedings of HLTNAACL 2004. Diab, M. T. 2007. Towards an Optimal POS Tag Set for Arabic Processing. Proc RANLP. - 327 Dichy, J. 2001. On lemmatization in Arabic, A formal definition of the Arabic entries of multilingual lexical databases. ACL/EACL 2001 Workshop on Arabic NLP, Toulouse, France, Friday 6 July 2001. Dichy, J. 2009. A basic method for assessing arabic morphological analysers : some crucial criteria. Proceedings of the workshop of morphological analyzer experts for Arabic language, organized by Arab League Educational, Cultural and Scientific Organization (ALECSO), King Abdul-Aziz City of Science and Technology ( KACST) and Arabic Language Academy., Damascus, Syria. Dichy, J. and Farghaly, A. 2003. Roots & patterns vs. stems plus grammar-lexis specifications: on what basis should a multilingual database centred on Arabic be built? MT Summit IX -- workshop: Machine translation for semitic languages, New Orleans, USA. Dickinson, M. and Jochim, C. 2010. Evaluating Distributional Properties of Tagsets. Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), pp. 2522-2529. Valletta, Malta: European Language Resources Association (ELRA). Dietterich, T. G. 2000. Ensemble Methods in Machine Learning. Lecture Notes in Computer Science, pp. 1-15. Diwan, A.-H. 2004. ‫ المعجم النحوي لمفردات اللغة العربية‬The Syntactic Lexicon of Arabic Words. Aleppo, Syria: Fusselat Publishers. Dror, J., Shaharabani, D., Talmon, R. and Wintner, S. 2004. Morphological Analysis of the Qur'an. Literary and Linguistic Computing 19(4): 431-452. Duh, K. and Kirchhoff, K. 2005. POS Tagging of Dialectal Arabic: A Minimally Approach. ACL-05, Computational Approaches to Semitic Languages Workshop Proceedings, pp. 55-62. University of Michigan Ann Arbor, Michigan, USA. Dukes, K., Atwell, E. and Sharaf, A.-B. M. 2010. Syntactic Annotation Guidelines for the Quranic Arabic Dependency Treebank. Language Resources and Evaluation Conference (LREC 2010), Valletta, Malta. Dukes, K. and Habash, N. 2010. Morphological Annotation of Quranic Arabic. Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta,19-21 May 2010.: European Language Resources Association (ELRA). Dzeroski, S., Erjavec, T. and Zavrel, J. 2000. Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets. Proceedings of Second International Conference on Language Resources and Evaluation (LREC), pp. 1099-1104. Elghamry, K. 2010. Broken Plurals. http://sites.google.com/site/elghamryk/arabiclanguageresources. Elkateb, S., Black, W. and Farwell, D. 2006. Arabic WordNet and the Challenges of Arabic. Preceedings of The Challenge of Arabic for NLP/MT International Conference at The British Computer Society (BCS), London. Elkateb, S. and Black, W. J. 2001. Towards the Design of English-Arabic Terminological Knowledge Base. Proceedings of ACL 2000, Toulouse, France:113-118. Elliott, J. and Atwell, E. 2000. Is anybody out there?: the detection of intelligent and generic language-like features. JBIS: Journal of the British Interplanetary Society 53: pp.7-23. Elworthy, D. 1995. Tagset design and inflected languages. In 7th Conference of the European Chapter of the Association for Computational Linguistics (EACL), From Texts to Tags: Issues in Multilingual Language Analysis SIGDAT Workshop, pp. 1– 10. Dublin. - 328 Erjavec, T. 2010. MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner and D. Tapias (eds.). Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), pp. 2544-2547. Valletta, Malta: European Language Resources Association (ELRA). Escudero, G., Mhrquez, L. and Rigau, G. 2000. A Comparison between Supervised Learning Algorithms for Word Sense Disambiguation. Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning, pp. 31-36. Lisbon, Portugal: Association for Computational Linguistics, Morristown, NJ, USA. Eynde, V. E. and Gibbon, D. (eds.) 2000. Lexicon development for speech and language processing. Dordrecht, The Netherlands: Kluwer Academic Publishers. Freeman, A. 2001. Brill's POS Tagger and a Morphology Parser for Arabic. NAACL 2001 Student Rersearch Workshop, Lancaster University. Gasser, M. 2010. Expanding the Lexicon for a Resource-Poor Language Using a Morphological Analyzer and a Web Crawler. Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), pp. 342-347. Valletta, Malta: European Language Resources Association (ELRA). Glass, K. and Bangay, S. 2005. Evaluating Parts-of-Speech Taggers for Use in a Text-toScene Conversion System. SAICSIT '05: Proceedings of the 2005 annual research conference of the South African institute of computer scientists and information technologists on IT research in developing countries, pp. 20--28. White River, South Africa South African Institute for Computer Scientists and Information Technologists. Gopal, M., Mishra, D. and Singh, D. P. 2010. Evaluating Tagsets for Sanskrit. Sanskrit Computational Linguistics, Lecture Notes in Computer Science 6465/2010: 150161. Habash, N. 2004. Large Scale Lexeme Based Arabic Morphological Generation. JEPTALN 2004, Session Traitement Automatique de l’Arabe, Fès. Habash, N., Faraj, R. and Roth, R. 2009. Syntactic Annotation in Columbia Arabic Treebank. 2nd International Conference on Arabic Language Resources & Tools MEDAR 2009, Cairo, Egypt. Habash, N. and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. Paper presented at the Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Ann Arbor, Michigan. Habash, N. and Roth, R. M. 2009. CATiB: The Columbia Arabic Treebank. Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 221–224. Suntec, Singapore: 2009 ACL and AFNLP. Habash, N. 2010. Introduction to Arabic Natural Language Processing. Morgan & Claypool Publishers. Hadrich, L. B. and Chaâben, N. 2006. Analyse et désambiguïsation morphologiques des textes arabes non voyellés. Actes de la 13ème édition de la conférence sur le Traitement Automatique des Langues Naturelles (TALN 2006), pp. 493-501. Belgique. Hajič, J., Smrž, O., Zemánek, P., Šnaidauf, J. and Beška, E. 2004. Prague Arabic Dependency Treebank: Development in Data and Tools. Proceedings of NEMLAR International Conference on Arabic Language Resources and Tools, pp. 110–117. Cairo, Egypt. - 329 Halteren, H. v., Zavrel, J. and Daelemans, W. 2001. Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems. Computational Linguistics 27(2): pp199-229. Hamada, S. 2009a. ‫" المحلالت الصرفية للغة العربية‬Morphological Analyzers for Arabic". Proceedings of the workshop of morphological analyzer experts for Arabic language, organized by Arab League Educational, Cultural and Scientific Organization (ALECSO), King Abdul-Aziz City of Science and Technology ( KACST) and Arabic Language Academy., Damascus, Syria. Hamada, S. 2009b. ‫ مقترح لمعايير وضوابط تقييم المحلﱢالت الصرفية‬A proposal for evaluating morphological analyzers for Arabic text. Proceedings of the workshop of morphological analyzer experts for Arabic language, organized by Arab League Educational, Cultural and Scientific Organization (ALECSO), King Abdul-Aziz City of Science and Technology ( KACST) and Arabic Language Academy., Damascus, Syria. 26-28 April 2009. Hamada, S. 2010. ‫ مقترح لمعايير وضوابط تقييم المحلﱢالت الصرفية‬Evaluation of the Arabic Morphological Analyzers Proceedings of The Sixth International Computing science Conference ICCA, Hammamet, Tunisia. Hamado, A.-M. B., Belghayth, L. and Sha’baan, N. 2009. ‫"الصرفي للغة العربية لمخبر "ميراكل‬ MORPH, morphological analyzer for Arabic text developed at MIRACL Labs. Proceedings of the workshop of morphological analyzer experts for Arabic language, organized by Arab League Educational, Cultural and Scientific Organization (ALECSO), King Abdul-Aziz City of Science and Technology ( KACST) and Arabic Language Academy., Damascus, Syria. Hardie, A. 2003. Developing a tagset for automated part-of-speech tagging in Urdu. Proceedings of the Corpus Linguistics 2003 conference., ed. by D. Archer, Rayson, P, Wilson, A, and McEnery, T. Department of Linguistics, Lancaster University.: UCREL Technical Papers Volume 16. Hardie, A. 2004. The computational analysis of morphosyntactic categories in Urdu. pp. 477. Lancaster University. Harmain, H. M. 2004. Arabic Part-of-Speech Tagging. Paper presented at the The Fifth Annual U.A.E. University Research Conference, United Arab Emirates. Harrag, F., Hamdi-Cherif, A. and Al-Salman, A. S. 2010. Comparative Study of Topic Segmentation Algorithms Based on Lexical Cohesion: Experimental Results on Arabic Language. The Arabian Journal for Science and Engineering 35.138-202. Haywood, J. A. and Nahmad, H. M. 1965. A New Arabic Grammar of the Written Language. London: Lund Humphries. Hijjawi, M., Bandar, Z., Crockett, K. and Mclean, D. 2011. An Arabic Stemming Approach using Machine Learning with Arabic Dialogue System. ICGST International Conference on Artificial Intelligence and Machine Learning (AIML11), Dubai, UAE. Hu, X. R. and Atwell, E. 2003. A survey of machine learning approaches to analysis of large corpora. In D. Archer, Rayson, P, Wilson, A & McEnery, T (ed.). Proceedings of SProLaC: Workshop on Shallow Processing of Large Corpora, pp. 657-661 Lancaster University. Ingulfsen, T., Burrows, T. and Buchholz, S. 2005. Influence of Syntax on Prosodic Boundary Prediction. Proceedings, INTERSPEECH 2005. 1817-1820. Johansson, S., Atwell, E., Garside, R. and Leech, G. 1986. The Tagged LOB Corpus. Bergen, Norway: Norwegian Computing Centre for the Humanities. Jurafsky, D. and Martin, J. H. 2008. Speech and Language Processing. New Jersey: Prentice Hall. - 330 Kammoun, N. C., Belguith, L. H. and Hamadou, A. B. 2010. The MORPH2 new version: A rubust morphological analyzer for Arabic text. JADT 2010: 10th International Conference on Statistical Analysis of Textual Data, SAPIENZA, Italy. Khafaji, R. 2001. Punctuation Marks in original Arabic texts. Zeitschrift fur Arabische Linguistik 40(2001): 7-24. Khalil, H. 1998. Dirasat fi al-lughah wa al-ma'ajim " ‫ " دراسات في اللغة والمعاجم‬Studies of language and lexicons Beirut, Lebanon: Dar al-nahdhah al-arabiah. Khoja, S. 2001. APT: Arabic Part-of-Speech Tagger. Student Workshop at the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL2001), Carnegie Mellon University, Pittsburgh, Pennsylvania. Khoja, S. 2003. APT: An Automatic Arabic Part-of-Speech Tagger. Computing Department, pp. 157. Lancaster, UK: Lancaster University. Khoja, S., Garside, P. and Knowles, G. 2001. A tagset for the morphosynactic tagging of Arabic. Corpus Linguistics 2001, Lancaster University, Lancaster, UK. Kiraz, G. A. 2001. Computational Nonlinear Morphology with Emphasis on Sematic Languages. Cambridge: Cambridge University Press. Koskenniemi, K. 1983. Two-Level Morphology. University of Helsinki. Kurimo, M., Virpioja, S. and Turunen, V. T. 2009. Overview and Results of Morpho Challenge 2009. Proceedings of the workshop of Unsupervised Morpheme Analysis MorphoChallenge at CLEF 2009 (Cross Language Evaluation Forum), Corfu, Greece. Lane, E. W. 1968. An Arabic-English Lexicon. 7: 117-119. Larkey, L. S. and Connell, M. E. 2001. Arabic Information Retrieval at UMass in TREC10. The Tenth Text REtrieval Conference (TREC 2001) Gaithersburg: NIST, 2001. Leech, G. and Wilson, A. 1996. EAGLES: Recommendations for the Morphosyntactic Annotation of Corpora. Leech, G. and Wilson, A. 1999. Standards for Tagsets. In H. v. Halteren (ed.). Syntactic Wordclass Tagging, pp. 55-80. KLUWER Academic Publishers. Liberman, M.Y. and Church, K.W. 1992. Text Analysis and Word Pronunciation in Textto-Speech Synthesis. In Advances in Speech Signal Processing. Furui S. and Sondhi, M.M. (eds.). New York. Marcel Dekker Inc. Maamouri, M. and Bies, A. 2004. Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools. Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004). Maamouri, M., Bies, A., Buckwalter, T. and Mekki, W. 2004. The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus. NEMLAR Conference on Arabic Language Resources and Tools,, Cairo, Egypt. MacKinlay, A. 2005. The Effects of Part-of-Speech Tagsets on Tagger Performance. The Department of Computer Science and Software Engineering, pp. 44. Melbourne, Australia: University of Melbourne. Marques, N. C. and Lopes, G. P. 2001. Tagging with Small Training Corpora. Advances in Intelligent Data Analysis, pp. 63-72. Springer Berlin / Heidelberg. Marsi, E., Bosch, A. v. d. and Soudi, A. 2005. Memory-based morphological analysis generation and part-of-speech tagging of Arabic. Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, pp. 1-8. Ann Arbor: Association for Computational Linguistics. - 331 Mazroui, A. e., Meziane, A.-w., Lakhouaja, A.-H., Bebaha, M., Boudlal, A.-R. and Belhabeeb, R. 2009. ‫ محلل صرفي للكلمات العربية خارج النص وداخله‬Morphological analyzer for Arabic text in-context and out of context. Proceedings of the workshop of morphological analyzer experts for Arabic language, organized by Arab League Educational, Cultural and Scientific Organization (ALECSO), King Abdul-Aziz City of Science and Technology ( KACST) and Arabic Language Academy., Damascus, Syria. McCarthy, J. and Prince, A. 1990a. Foot and word in prosodic morphology: The Arabic broken plurals. Natural Language & Linguistic Theory 8: 209–282. McCarthy, J. and Prince, A. 1990b. Prosodic morphology and templatic morphology. In M. Eid and J. McCarthy (eds.). Perspectives on Arabic Linguistics: Papers from the Second Symposium, pp. 1–54. Amsterdam: Benjamins, Amsterdam. Melamed, D. and Resnik, P. 2000. Tagger Evaluation Given Hierarchical Tag Sets. Computers and the Humanities 34: 79-84. Monachini, M. and Calzolari, N. 1996. Synopsis and comparison of morphosyntactic phenomena encoded in lexicons and corpora. A common proposal and applications to European languages. Istituto di Linguistica Computazionale -CNR. Mousser, J. 2010. A Large Coverage Verb Taxonomy For Arabic. Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), pp. 2675 - 2681. Valletta, Malta: European Language Resources Association (ELRA). Nicolas, L., Sagot, B., Farré, J. and Clergerie, É. d. L. 2008. Computer aided correction and extension of a syntactic wide-coverage lexicon. Proceedings of COLING 2008 22nd International Conference on Computational Linguistics, Manchester, UK. Ooi, V. B. Y. 1998. Computer corpus lexicography Edinburgh: Edinburgh University Press. Paikens, P. 2007. Lexicon-Based Morphological Analysis of Latvian Language. Proceedings of the 3rd Baltic Conference on Human Language Technologies, pp. 235–240. Kaunas. Pauw, G. D. and Schryver, G.-M. D. 2008. Improving the Computational Morphological Analysis of a Swahili Corpus for Lexicographic Purposes. Lexikos 18 (AFRILEXreeks/series 18: 2008): 303-318. Petasis, G., Karkaletsis, V., Dimitra Farmakiotou, Samaritakis, G., Androutsopoulos, I. and Spyropoulos, C. D. 2001. A Greek Morphological Lexicon and its Exploitation by Greek Controlled Language Checker. In Y. Manolopoulos and S. Evripidou (eds.). Proceedings of the 8th Panhellenic Conference in Informatics, pp. 80–89. Nicosia, Cyprus. Porter, M. F. 1980. An algorithm for suffix stripping. Program 14(3): 130−137. Roark, B. and Sproat, R. W. 2007. Computational Approaches to Morphology and Syntax. Oxford University Press. Rodríguez, H., Farwell, D., Farreres, J., Bertran, M., Alkhalifa, M. and Martí, M. A. 2008. Arabic WordNet: Semi-automatic Extensions using Bayesian Inference. the 6th Conference on Language Resources and Evaluation LREC2008, Marrakech (Morocco). Russell, G. J., Pulman, S. G., Ritchie, G. D. and Black, A. W. 1986. A dictionary and morphological analyser for English. Proceedings of the 11th coference on Computational linguistics, Bonn, Germany: Association for Computational Linguistics. Ryding, K. C. 2005. A Reference Grammar of Modern Standard Arabic. Cambridge University Press. - 332 Sabir, M. and Abdul-Mun’im, A.-M. i. 2009. ‫ برنامج )مداد( للتحليل الصرفي للكلمات العربية‬MIDAD morphological analyzer for Arabic text. Proceedings of the workshop of morphological analyzer experts for Arabic language, organized by Arab League Educational, Cultural and Scientific Organization (ALECSO), King Abdul-Aziz City of Science and Technology ( KACST) and Arabic Language Academy., Damascus, Syria. Sagot, B. 2005. Automatic acquisition of a Slovak Lexicon from a Raw Corpus. Lecture Notes in Artificial Intelligence (© Springer-Verlag) 3658 156-163. Sagot, B. 2010. The Lefff, a Freely Available and Large-coverage Morphological and Syntactic Lexicon for French. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner and D. Tapias (eds.). Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), pp. 2744-2751. Valletta, Malta: European Language Resources Association (ELRA). Sagot, B. , Clement, L., Clergerie, E. V. d. L. and Boullier, P. 2006. The Lefff 2 syntactic lexicon for French: architecture, acquisition, use. Proceeding of the fifth international conference on Language Resources and Evaluation, LREC 2006, Genoa - Italy: European Language Resources Association (ELRA). Sánchez León, F., and Nieto Serrano, AF. 1997. Retargeting a tagger. Corpus Annotation, ed. by Garside, Leech & McEnery, 163-64. London: Longman. Sawalha, M. and Atwell, E. 2008. Comparative evaluation of Arabic language morphological analysers and stemmers. Proceedings of COLING 2008 22nd International Conference on Computational Linguistics, Manchester, UK. Sawalha, M. and Atwell, E. 2009a. Linguistically Informed and Corpus Informed Morphological Analysis of Arabic. Proceedings of the 5th International Corpus Linguistics Conference CL2009, Liverpool, UK. Sawalha, M. and Atwell, E. 2009b. ‫( توظيف قواعد النحو والصرف في بناء محلل صرفي للغة العربية‬Adapting Language Grammar Rules for Building Morphological Analyzer for Arabic Language). Proceedings of the workshop of morphological analyzer experts for Arabic language, organized by Arab League Educational, Cultural and Scientific Organization (ALECSO), King Abdul-Aziz City of Science and Technology ( KACST) and Arabic Language Academy., Damascus, Syria. Sawalha, M. and Atwell, E. 2010a. Constructing and Using Broad-Coverage Lexical Resource for Enhancing Morphological Analysis of Arabic. Language Resource and Evaluation Conference LREC 2010, Valleta, Malta: European Language Resources Association (ELRA). Sawalha, M. and Atwell, E. 2010b. Fine-Grain Morphological Analyzer and Part-ofSpeech Tagger for Arabic Text. Language Resource and Evaluation Conference LREC 2010 Valleta, Malta: European Language Resources Association (ELRA). Sawalha, M. and Atwell, E. 2011a. Accelerating the Processing of Large Corpora: Using Grid Computing Technologies for Lemmatizing 176 Million Words Arabic Internet Corpus. Advanced Research Computing Open Event, University of Leeds, Leeds, UK. Sawalha, M. and Atwell, E. 2011b. Corpus Linguistics Resources and Tools for Arabic Lexicography. Workshop on Arabic Corpus Linguistics, Lancaster University, Lancaster, UK. ‫ﱠ‬ ‫التحليل‬ Sawalha, M. and Atwell, E. 2011c. ‫الصرفي لنصوص اللغة العربية الحديثة والكالسيكية‬ "Morphological Analysis of Classical and Modern Standard Arabic Text". 7th International Computing Conference in Arabic (ICCA11), Imam Mohammed Ibn Saud University, Riyadh, KSA. - 333 Sawalha, M. and Atwell, E. Under review. A Theory Standard Tag Set Expounding Traditional Morphological features for Arabic Language Part-of-Speech Tagging. Word structure journal, Edinburgh University Press. Schmid, H. and Laws, F. 2008. Estimation of Conditional Probabilities with Decision Trees and an Application to Fine-Grained POS Tagging. COLING'08, Manchester,UK. Sharoff, S. 2006. Creating General-Purpose Corpus Using Automated Search Engine Queries. In M. Baroni and S. Bernardini (eds.). WaCky! Working papers on the Web as Corpus, pp. 63-98. Bologna: GEDIT. Sharoff, S., Kopotev, M., Erjavecy, T., Feldmanz, A. and Divjak, D. 2008. Designing and Evaluating a Russian Tagset. LREC 2008: In Proceedings of the sixth international conference on Language Resources and Evaluation. Smrz, O. 2007. Functional Arabic Morphology: Formal System and Implementation. Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, pp. 104. Prague: Charles University in Prague. Smrž, O. 2009. ElixirFM Functional Arabic Morphology: Case Studies. Proceedings of the workshop of morphological analyzer experts for Arabic language, organized by Arab League Educational, Cultural and Scientific Organization (ALECSO), King Abdul-Aziz City of Science and Technology (KACST) and Arabic Language Academy., Damascus, Syria.26-28 April 2009. Smrž, O., Bielický, V., Kouřilová, I., Kráčmar, J., Hajič, J. and Zemánek, P. 2008. Prague Arabic Dependency Treebank: A Word on the Million Words. Proceedings of the Workshop on Arabic and Local Languages (LREC 2008), pp. 16–23. Marrakech, Morocco. Sonbul, R., Ghnaim, N. and Dusouqi, M. S. 2009. ‫موجه بالتطبيقات‬ ‫ نظام تحليل صرفي ﱠ‬An Application Oriented Arabic Morphological Analyzer. Proceedings of the workshop of morphological analyzer experts for Arabic language, organized by Arab League Educational, Cultural and Scientific Organization (ALECSO), King Abdul-Aziz City of Science and Technology ( KACST) and Arabic Language Academy., Damascus, Syria.26-28 April 2009. Soudi, A., Bosch, A. v. d. and Neumann, G. (eds.) 2007. Arabic Computational Morphology. Knowledge-based and Empirical Methods. Dordrecht, The Netherlands: Springer. Soudi, A., Cavalli-Sforza, V. and Jamari, A. 2001. A Computational Lexeme-Based Treatment of Arabic Morphology. ACL/EACL 2001 Workshop on Arabic NLP., Toulouse, France, Friday 6 July 2001. Tadi, M. and Fulgosi, S. 2003. Building the Croatian morphological lexicon. Proceedings of the 2003 EACL Workshop on Morphological Processing of Slavic Languages, Budapest, Hungary: Association for Computational Linguistics. Talmon, R. and Wintner, S. 2003. Morphological Tagging of the Qur'an. In Proceedings of the Workshop on Finite-State Methods in Natural Language Processing, an EACL'03 Workshop, Budapest, Hungary. Teahan, B. 1998. Modeling English Text. Department of Computer Science, New Zealand: University of Waikato. Teufel, S., Schmid, H., Heid, U. and Schiller, A. 1996. Study of the relation between tagsets and taggers. Stuttgart, Germany Institut für maschinelle Sprachverarbeitung, Universität Stuttgart Thabet, N. 2004. Stemming the Qur’an. COLING 2004, Workshop on computational approaches to Arabic script-based languages.August 28,2004, pp. 85-88. - 334 Tlili-Guiassa, Y. 2006. Hybrid Method for Tagging Arabic Text. Journal of Computer Science 2(3): 245-248. Taylor, P. and Black, A.W. 1998. Assigning Phrase-Breaks from Part-of-Speech Sequences. In Computer Speech and Language. 12.2: 99-117. Voutilainen, A. 2003. Part-of-Speech Tagging. In R. Mitkov (ed.). The Oxford Handbook of Computational Linguistics, pp. 219-232. Oxford University Press. Wald Abah, M. A. 2008. ‫ تاريخ النجو العربي في المشرق والمغرب‬History of Arabic Grammar in the East and the West. Beirut, Lebanon: Dar Al-Kutub Al-Alamyyah. Wright, W. 1996. A Grammar of the Arabic Language, Translated from the German of Caspari, and Editted with Numerous Additions and Corrections. Beirut: Librairie du Liban. Ya‘qūb, I. B. 1996. Mu‘jam al-awzān al-sarfiyah ‫معجم األوزان الصرفية‬. Beirut, Lebanon: ‘ālam al-Kutub Yonghui, G., Baomin, W., Changyuan, L. and Bingxi, W. 2006. Correlation Voting Fusion Strategy for Part of Speech Tagging. 8th International Confenerance on Signal Processing Proceedings, ICSP2006. Yousfi, A. 2010. The morphological analysis of Arabic verbs by using the surface patterns. IJCSI International Journal of Computer Science Issues 7(3(11)): 33-36. Yusof, R. J. R., Zainuddin, R. and Baba, M. S. 2010. Qur'anic Words Stemming. The Arabian Journal for Science and Engineering 35(2C): 37-49. Zaenen, A., Carletta, J., Garretson, G., Bresnan, J., Koontz-Garboden, A., Nikitina, T., O’Connor, M. C. and Wasow, T. 2004. Animacy encoding in English: Why and how. In Proceedings of the ACL-04 Workshop on Discourse Annotation. Zaied, M. 2009. ‫" تقرير في المحلالت الصرفية العربية‬Report on Arabic Morphological Analyzers". Proceedings of the workshop of morphological analyzer experts for Arabic language, organized by Arab League Educational, Cultural and Scientific Organization (ALECSO), King Abdul-Aziz City of Science and Technology ( KACST) and Arabic Language Academy., Damascus, Syria. Zarrouki, T. and Kebdani, M. 2009. ‫ تجربة وآفاق‬،‫مشروع أية –سبل القاموس العربي للتدقيق اإلمالئي مفتوح المصدر‬ Aya-Spell Project, An Open-source Arabic Spell Checker Dictionary, experience and Future Work. Proceedings of the workshop of morphological analyzer experts for Arabic language, organized by Arab League Educational, Cultural and Scientific Organization (ALECSO), King Abdul-Aziz City of Science and Technology ( KACST) and Arabic Language Academy., Damascus - Syria. Zeman, D. 2008. Reusable Tagsets Conversion Using Tagset Drivers. Proceedings of the Sixth conference on International Language Resources and Evaluation (LREC'08), pp. 213-218. Marrakech, Morocco: European Language Resources Association (ELRA). Zerrouki, T. and Balla, A. 2009. Implementation of infixes and circumfixes in the spellcheckers. 2nd International Conference on Arabic Language Resources and Tools, Cairo - Egypt. Zibri, C. B. O., Torjmen, A. and Ahmad, M. B. 2006. An Efficient Multi-agent system Combining POS-Taggers for Arabic Texts. CICLing 2006, LNCS 3878(pp.121131). Zolfagharifard, E. 2009. Anti-terror technology tool uses human logic. The Engineer. - 335 - Appendix A The SALMA Tag Set for Arabic text The SALMA Morphological Features Tag Set (SALMA, Sawalha Atwell Leeds Morphological Analysis tag set for Arabic) captures long-established traditional morphological features of Arabic, in a compact yet transparent notation. First, we introduce Part-of-Speech tagging and tag set standards for English and other European languages, and then survey Arabic Part-of-Speech taggers and corpora, and longestablished Arabic traditions in analysis of grammar and morphology. A range of existing Arabic Part-of-Speech tag sets are illustrated and compared; and we review generic design criteria for corpus tag sets. For a morphologically-rich language like Arabic, the Part-of-Speech tag set should be defined in terms of morphological features characterizing word structure. We describe the SALMA Tag Set in detail, explaining and illustrating each feature and possible values. In our analysis, a tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash “-” represents a feature not relevant to a given word. The first character shows the main Parts of Speech, from: noun, verb, particle, punctuation, and Other (residual); these last two are an extension to the traditional three classes to handle modern texts. The characters 2, 3, and 4 are used to represent subcategories; traditional Arabic grammar recognizes 34 subclasses of noun (letter 2), 3 subclasses of verb (letter 3), 21 subclasses of particle (letter 4). Others (residuals) and punctuations are represented in letters 5 and 6 respectively. The next letters represent traditional morphological features: gender (7), number (8), person (9), inflectional morphology (10) case or mood (11), case and mood marks (12), definiteness (13), voice (14), emphasized and non-emphasized (15), transitivity (16), rational (17), declension and conjugation (18). Finally there are four characters representing morphological information which is useful in Arabic text analysis, although not all linguists would count these as traditional features: unaugmented and augmented (19), number of root letters (20), verb root (21), types of nouns according to their final letters (22). The SALMA Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora. The SALMA tag structure consists of 22 characters. Figure 1 shows a sample of tagged sentence from the Qur’an and it shows the morphological categories and the attributes of a selected word in more details. - 336 Word wa waaṣṣaynā And We have enjoined +(,) - *+*+ al-’insāna (on) man 3, 9 + ./ + 0 Morphemes *+ wa , ) - *+ +/ C, 3 9 + ./ +  3  3 5+ *+ F, G3 34&5+ 3 36 , + bi- wālidayhi His parents ḥusnan Kindness $+ ., 78  +., 78 Tag p--c-----------------v-p---mpfs-s-amohvtt&r---r-xpfs-s----hn---r--d-----------------nq----ms-pafd---htbt-s p--p-----------------nu----md-vgki---htot-s r---r-xdts-s---------r---r-msts-k---------ng----ms-vafi---ndst-s r---k------f---------- And waṣṣay Have enjoined nā We al- The ’insāna man bi To wālida Parents y Both hi His ḥusn kindness an Figure A.1 Sample of Tagged document of vowelized Qur’an Text using SALMA Tag Set Verb root Noun Finals 22 23 21 b t - s Sound - Triliteral Augmented by two letters Primitive / Concrete noun Rational - - - Defined fatḥah Accusative Non-declinalbe - Singular Masculine - - - - Generic noun Noun Figure A.2 SALMA tag structure 20 19 n q - - - - m s - p a f d - - - h t Number of root letters Unaugmented and Augmented Declension and Conjugation 18 16 17 15 Transitivity Emphasized and non-emphasized Rational Voice 10 14 9 Definiteness Person 7 13 Number 6 Case and Mood marks Gender 5 12 Punctuation marks 4 11 Part-of-Speech: Other 3 Inflectional morphology Part-of-Speech: Particle 2 Attributes Case and Mood Part-of-Speech: Verb 1 9 + ./ + ِ‫إ‬ Main Part-of-Speech Position Part-of-Speech: Noun Main category - 337 Table A.1 SALMA Tag Set categories Position Morphological Features Categories 1 Main Part-of-Speech 2 Part-of-Speech: Noun t (H?Q) -I% !M !.< ’aqsām al-kalām al-far‘iyya (al-‘ism) 3 Part-of-Speech: Verb t (S) -I% !M !.< ’aqsām al-kalām al-far‘iyya (al-fi’l) 4 Part-of-Speech: Particle 5 Part-of-Speech: Other 6 Punctuation marks 7 Gender 8 Number 9 Person "?Q al-’isnād 10 Inflectional Morphology aṣ-ṣarf T% - 11 Case or Mood 12 Case and Mood Marks 13 Definiteness 14 Voice 15 16 Emphasized emphasized Transitivity 17 Rational 18 20 Declension and Conjugation Unaugmented and Augmented Number of Root Letters 21 Verb Root 22 Noun Finals t -.>% !M !.+< ’aqsām al-kalām ar-ra’īsiyya t -I% !M !.< ’aqsām al-kalām al-far‘iyya (al-ḥarf) (T%) t (U%#<) -I% !M !.< ’aqsām al-kalām al-far‘iyya (’uẖrā) t -I% !M !.< ’aqsām al-kalām al-far‘iyya (‘alāmāt at-tarqīm) (H% I) V-/W+ :* 8 %- + : 8 al-muḏakkar wa al-mu’annaṯ "5 al-‘adad tu tu *< H? 6%I0  al-ḥāla al-’i‘rābiyya lil-’ism ’aw al-fi‘l S 19 and Non- XB *< %I0 I E%+ M3 -* +%3 , : + 3 B,: + * ! 8 , :+  B,: + C 8 Y, :+ 3 5-W: 8 %Z* 8 8 5-W: ‘alāmāt al-’i‘rāb wa al-binā’ al-ma‘rifati wa an-nakirati al-mabnī lil-ma‘lūm wa al-mabnī lil-mağhūl al-mu’akkad wa ḡayir al-mu’akkad F5:* !A al-lāzim wa al-muta‘addi S %Z* S al-‘āqil wa ḡayir al-‘āqil ;&%- at-taṣrīf al-muğarrad wa al-mazīd 5&[:* "%Y: @, Y , "5+ I+ ‘adad ’aḥruf al-ğaḏr + T%8 7< tu S 86 bunya al-fi‘l an ^%#_ ]  B H?\ !.< ’aqsām al-’ismi tib‘ li-lafẓi ’āẖirhi A.1 Position 1; Main part-of-speech Table A.2 Main part-of-speech category attributes and tags at position 1 Position 1 Feature Name Main Part-of-Speech -.>% !M !.+< ’aqsām al-kalām ar-r‘īsiyyat Noun ’ism H? +3 kitāb ‘book’ Verb S fi‘l Particle T%7 ḥarf Other (Residual) U%#< ’uẖrā Punctuation H% I ‘alāmat tarqīm Tag n + ++ katab ‘wrote’ o+ I+ ‘alā ‘on’ tun ƒ B3 kātiba ‘writer / Fem’ v p r > ; qāla : ’anā ḏāhib D ;y 2<;: :c25 said: I am leaving’ un ‘he u - 338 - A.2 Position 2; Part-of-Speech Subcategories of Noun Table A.3 Part-of-Speech subcategories of Noun attributes and their tags at position 2 Position 2 Feature Name Part-of-Speech: Noun (H?Q) -I% !M !.+< ’aqsām al-kalām al-far’iyyat (al-‘ism) Gerund / Verbal noun Gerund/ verbal noun with initial mῑm J= ­ ; ḍarb ‘hitting’ g >  maw‘id ‘date’ 4' ; m \;=‘;< naẓrah ‘one look’ i; %= t> ğilsah ‘sitting position’ o ^2R eÉ ;<lT dR ? Ye ḥaṭṭamtu alta ẖizāna taḥṭīman ‘I completely destroyed the wardrobe’ .?H furūsiyyah ‘Horsemanship’ e al-ḍamῑr ' huwa ‘He’ p ’ism al-’šārah  hāḏā ‘This’ d @5: al-maṣdar @5: al-maṣdar mῑmῑ :: al- maṣdar al-marrah Gerund of state E%: - @5 / b @5 maṣdar al-hay’a / maṣdar al-naw’ Gerund of emphasis 5  @5 Gerund of instance `  @5 Gerund profession of @5: %:c Demonstrative pronoun Specific relative pronoun E@d0 H? C ) : H? ef Non-specific relative pronoun C ) : H? Interrogative pronoun Conditional noun !?Q H? g%': %' H? Allusive noun &M T%-h Adverb Active participle Intensive participle h maṣdar al-tawkῑd al-maṣdar al-ṣināῑ I Pronoun Active Passive participle Adjective Noun of place Noun of time Instrumental noun SI H? H? B SI C : H?  i -B': 9M: H? 9A H? j H? Proper noun H  H? Generic noun kY H? Tag ’ism al-mawṣūl al-ẖāṣ  al-laḏī ‘Who’ s i r ’ism al-mawṣūl al-muštarak C= ; man ‘Who’ c ’ism al-’istfhām C= ; man ‘Who?’ b ’ism al-šarṭ 2R)!: aynamā ‘where ever’ h al-kināyah  kaḏā ‘as well as’ a aẓ-ẓarf M'! yawm ‘day’ > ḍārib ‘hitter’ J2­ v ’ism al-fā‘il mubālaḡat al-fā‘il ’ism ’ism al-mf‘ūl h aṣ-ṣifa mušabbahah al- ’ism al-mkān ’ism zamᾱn ’ism al-’āla h u a.t; ğarraḥ ‘Surgeon’ w J?£ = ; maḍrūb ‘Struck’ +!' ṭawīl ‘tall’ k ;-= ; maktab ‘office’ > }%=e; maṭla‘ start time 2@)=> minšār ‘Saw’ l j t z ’ism al-ğins > fāṭimah ‘Fatima’ R; 2H k2(> hiṣān ‘Horse’ q ’ism al-‘alam n Numeral "5 H? ’ism al-‘adad –– ṯalāṯa ‘Three’ + Verb-like noun S H? ’ism al-fi‘l `2F  hayhāt Wishing & al-’asmā’ ẖamsah J D ;: ‘ab ‘Father’ f Five nouns X:?\ .:f h al- un - 339 Position 2 Feature Name Part-of-Speech: Noun (H?Q) -I% !M !.+< ’aqsām al-kalām al-far’iyyat (al-‘ism) Relative noun LY R> %= 4> ‘ilmiyyun Scientific  . H? ’ism mansūb Diminutive \;G=ƒ; ¯? šuğayrah ‘Bush’ %  H? ’ism taṣḡīr Form exaggeration Collective noun of Plural generic noun B ) k H? * y 2.t; ğabbār ‘Tremendous’ x M'5 qawm ‘Folk’ $ ’ism ğins ğam‘ī a2S8 tuffāḥ ‘Apple’ # ṣῑḡat mubālaḡah ’ism ğam‘ l: H? Tag al- : Elative noun Sc H? ’ism tafḍῑl +£H: ’afḍal ‘Better’ @ Blend noun   H? ’ism manḥūt %Ri" basmalah ‘bismallah’ % Ideophonic interjection  ) H? ’ism ṣawt ] ’āh ‘Ah’ ! A.3 Position 3; Part-of-Speech Subcategories of Verb Table A.4 Part-of-Speech subcategory of verb attributes and their tags at position 3 Position 3 Feature Name Part-of-Speech: Verb (S) -I% !M !.+< ’aqsām al-kalām al-far’iyyat (al-fi’l) in Perfect verb “ S fi‘l māḍ ™ + ++ kataba ‘He wrote’ Imperfect verb `@c S fi‘l muḍāri‘ Imperative verb %\ S fi‘l al-’amr 8 8M, +& yaktubu ‘He is writing’ , 8, ’uktub ‘write’ Tag p c i A.4 Position 4; Part-of-Speech Subcategories of Particle Table A.5 Part-of-speech subcategories of Particles attributes and their tags at position 4 Position 4 Feature Name Part-of-Speech: Particle (T*%) -I% !M !.+< ’aqsām al-kalām al-far‘iyyat (alḥarf) Jussive-governing particle ![ T%7 ḥarf ğazim =w; lam ‘No’ Subjunctive-governing particle Partially subjunctivegoverning particle Preposition / T%7 I%  T%7 % T%7 Tag j ḥarf naṣib L= ; kay ‘So that’ o ḥarf naṣib far‘ῑ ḥarf ğarr Æ ḥattā ‘till’ u qZ ’ilā ‘To’ p Annulling particle s?/ T%7 ḥarf nāsiẖ 2 mā ‘No’ a Conjunction ;qI T%7 ḥarf ‘aṭif  wa ‘And’ c X5/ T%7 ḥarf nidā’ 2! yā ‘Oh’ v Vocative particle Exceptive particle XL? T%7 ḥarf ’stiṯnā’ .rZ ’illā ‘Except’ x Interrogative particle !? T%7 ḥarf ’stifhām + hal ‘Is?’ i Particle of futurity CBn? T%7 ḥarf ’stiqbāl 3' sawfa ‘will’ f Causative particle S  T%7 ḥarf ta‘lῑl L kay ‘To’ s Negative particle / T%7 ḥarf nafῑ n Jurative particle H. T%7 ḥarf qasam =w; lam ‘No’ > bi ‘sware’ J  Y T%7 ḥarf ğawāb u#< na‘am ‘Yes’ w Yes/No response particle q - 340 Position 4 Feature Name Part-of-Speech: Particle (T*%) -I% !M !.+< ’aqsām al-kalām al-far‘iyyat (alḥarf) Jussive-governing šart kZ !A %d T%7 ḥarf = ’in ‘If’ conditional particle ğāzim . hallā ‘would’ Particle of incitement  mc T%7 ḥarf taḥḍῑḍ Gerund-equivalent particle F@5 T%7 ḥarf maṣdarῑ h Particle of attention 4B T%7 ḥarf tanbῑ Emphatic particle 5  T%7 ḥarf tawkῑd Explanatory particle %. T%7 ḥarf tafsῑr 4B' T%7 h Particle of comparison Non-governing particles SI %Z T%7 ḥarf tašbῑ ḥarf ‘āmil ḡayr Tag k m k: = ’an ‘To’ r: ’alā ‘careful’ g . kZ ‘emphasis’ : ’ay ‘i.e’ z ’inna t d . kE ka’anna ‘similar’ = ;5 qad ‘already or perhaps’ l b A.5 Position 5; Part-of-Speech Subcategories of Other (Residuals) Table A.6 Part-of-speech subcategories of Other (Residuals) attributes and their tags at position 5 Position 5 Feature Name Part-of-Speech: Other (U%+ #, 8<) -I% !M !.< ’aqsām al-kalām al-far‘iyyat (’uẖrā) h Prefix fῑ € -- ’istaktabanī ‘he : M C*<  E"&A ziyāda ’awal al- employed me as a writer’ kalimah h Suffix fῑ 125/;: ’aṣdiqā’ ‘Friends’ %#_  E"&A ziyāda ’āẖir al: M kalimah > Suffixed pronoun S %:D ḍamīr ??"2- kitabahu ‘his book’ mutaṣil h tā' marbūṭah ;>82 kātibatun ‘she-writer’ N 6% X tā’ marbūṭa Relative yā' B. X& $&  tanwῑn tā' of femininization nūn of protection Emphatic nūn V/t X &  9 / 5  9 / Imperfect prefix I@c T%7 Definite article ;&% E"< Masculine plural letters sound Feminine sound plural letters Dual letters Imperative prefix %: l: T*%7 H. V/W: l: T*%7 H. oL: T*%7 %\ T*%7 yā’ an-nisbah tanwῑn tā’ al-ta’nῑṯ nūn wiqāyah nūn tawkῑd alal- ḥarf muḍāra‘ah ’adāt ta‘rῑf ḥurūf ğam‘ al-muḏakkar as-sālim ḥurūf ğam‘ al-mu’nnaṯ as-sālim ḥurūf almuṯannā ḥurūf al-’amr K Y >;4; ‘arabiyy ‘Arabian’ un > J2D  kitāb ‘a book’ d = ;;-; katabat ‘she wrote’ LG>);G;E; sa’alanī ‘he asked me’ C. ;G"> £ = ;! yaḍribanna ‘They are hitting’ c? ;Ei= ;G! yas’alu ‘He is asking’ Tag p s r t y k f n z a J2-G al-kitāb ‘The book’ d k'G 82 al-kātibūn writers (MAS)’ ‘The m `2G 82 al-kātibāt writers (FEM)’ ‘The l k2G 82 al-kātibān ‘The two writers’ - ’uktub ‘Write’ u I - 341 Position 5 Feature Name Part-of-Speech: Other (U%+ #, 8<) -I% !M !.< ’aqsām al-kalām al-far‘iyyat (’uẖrā) Number (digits) H+@+ raqam Currency + :, 8I ‘umlat 3 + s&@ -36%+ I+ %,+Z :+ 3 + Date Non-Arabic word Borrowed word 3 +6%- + 8 :+ + (foreign) tārīẖ t kalima ḡayr ‘arabiyyah kalimat mu‘arrabah Tag (+325461) (-897,653) (0.986) (:.Q1,500) (v.2,927) ($250) (27/09/2011) (2011 c'%!: 27) g windows, photoshop, games, download k2;-=>'?"'?“'= ? kuzmūbūlītān ‘cosmopolitan’ w c e x A.6 Position 6; Part-of-Speech Subcategories of Punctuation Marks Table A.7 Part-of-speech subcategories of Punctuation Marks attributes and their tags at position 6 Position 6 Feature Name Punctuation Marks (H% I) I% !M !.< ’aqsām al-kalām al-far‘iyyat (‘alāmāt at-tarqīm) Full stop nuqṭah qn/ (.) h Comma fāṣila ) (w) Colon nuqṭatān 9qn/ (:) Semi colon h N n ) fāṣila manqūṭa h Tag s c n (y) l Parentheses 9?  qawsān (()) p Square brackets 9%)7 9?  qawsān ḥāṣiratān ([]) b Quotation mark Dash Question mark Exclamation mark Ellipsis mark tu pB I ‘alāma ’iqtibās ("") t D% N%d šarṭa mu‘tariḍa !? I Y I T7 I Continuation mark 6- I Other punctuations U%+ #, 8<  + I+ h h (}) d tu (~) q tu (!) e (...) i (=) / f ‘alāma ’istifhām ‘alāma ta‘ağğub tu ‘alāma ḥaḏf tu ‘alāma at-tabi‘yya h ‘alāmāt ’uẖrā o A.7 Position 7; Morphological Feature of Gender Table A.8 Morphological feature of Gender attributes and their tags at position 7 Position 7 Feature Name Morphological Gender V-/W+ :* 8 %- + : 8 al-muḏakkar wa al-mu’annaṯ Masculine % muḏakkar S@ rağul ‘man’ Feminine mu’annaṯ V/W E<%’imra’ah Woman Common gender V/W *< % muḏakkar mu’annaṯ ’aw =  milḥ ‘Salt’ *@ rūḥ ‘Soul’ Tag m f x - 342 - A.8 Position 8; Morphological Feature of Number Table A.9: Morphological feature of Number attributes and their tags at position 8 Position 8 Feature Name Number "5 al-‘adad Singular "% mufrad Dual oL muṯannā H  qalam ‘A pen’ u fallāḥ ‘Farmer’ E@ manārah ‘A minaret’ ($:  w9:  :H ) (qalam: qalamān, qalamayn) Tag s d Sound plural H? l: ğam‘ sālim Broken plural Plural of paucity Plural of multitude Ultimate plural %.M l: ğam‘ taksῑr ‘(A pen: two pens)’ ($@ w9@ :E@) (manārah: manāratān, manāratayn)(A minaret: two minarets) ($7u w9 7u :u ) (fallāḥ: fallāḥūn, fallāḥīn) (A farmer: Farmers)’ (@ :E@) (manārah: manārāt) (A minaret: minarets) (!< :H ) (qalam: ’aqlām) ‘(A pen: pens)’  l: ğam‘ qillah (T%7< :T%7) (ḥarf: ’aḥruf) (A letter: letters) m E%L l: ğam‘ kaṯrah (T*%7 :T%7) (ḥarf: ḥurūf) (A letter: letters) j ` :Y o munthā alğumū‘ (5. :5Y.) (masğid: masāğid) (A mosque: mosques) ( 86 w 86 : 6) (bayt: buyūt, buyūtāt) ‘(A home: homes) - 8 3-q + ++ katab aṭ-ṭālibu ad-darasa p + @, 5 3 B3-q ++ ‘the student wrote the lesson’; 9 + + u Plural plural of Undefined l:Y l: T%- + 8 %Z ğam‘ ğam‘ al- ḡayr mu‘arraf p b l x p katab aṭ-ṭāliban ad-darsa ‘the two + @, 5 -  students wrote the lesson’; p 8 -q + ++ + @, 5 kataba aṭ-ṭulābu ad-darsa ‘the students wrote the lesson’ A.9 Position 9; Morphological Feature of Person Table A.10 Morphological feature of Person category attributes and their tags at position 9 Position 9 Feature Name Person Q2)r al-’isnād First Person Second Person Third Person Hi M+ +: 8 +Nf: 8 3>+  Tag al-mutakallim 8 B++ katabtu‘I wrote’ f al-muẖāṭab :8B,++ katabtumā ‘You wrote’ $B + ++ katabna‘They Wrote’ s al-ḡā’ib t - 343 - A.10 Position 10; Morphological Feature of Inflectional Morphology Table A.11 The morphological feature category of Inflectional Morphology attributes and their tags at position 10 Position 10 Feature Name Inflectional Morphology T% - aṣ-ṣarf Declined (noun) Conjugated (verb) Triptote / fully declined Non-declinable %8 Tag mu‘rab 8  +& yaḡību ‘Miss’ d T% } %8 mu‘rab - munṣarif ƒ >Z ḡā’ib ‘Absent’ v $ ` : – %8 mu‘rab - mamnū’ mina aṣ-ṣarf 9:L 8 8I ‘uṯmānu ‘Othman’ p mabnῑ 3 Wr XQ 8 hā’ulā’i ‘Those’ S+ + (+ fa‘ala ‘Did’ + ,+ layta ‘Wish’ s T% Invariable (v, n) B A.11 Position 11; Morphological Feature Category of Case or Mood Table A.12 The morphological feature of Case or Mood category attributes and their tags at position 11 Position 11 Feature Name Case or Mood S *< H? 6%I0  al-ḥālatu al-’i‘rābiyyatu lil-’ism ’aw al-fi‘l Nominative Indicative Accusative Subjunctive Genitive ------- ` % marfū‘   manṣūb -------- @*%Y mağrūr Imperative or jussive !*[Y mağzūm 8 8M, +& yaktubu ‘He is writing’ + 8M, +& $ lan yaktuba ‘He will not write’ ------, 8M, +& H, + lam yaktub He did not write’ Tag M al-kitābu 8 ‘The Book’ n M al-kitāba + ‘The Book’ a 3 M al-kitābi ‘The Book’ ----- g j - 344 - A.12 Position 12; The Morphological Feature of Case and Mood Marks Table A.13 The morphological feature category of Case and Mood Marks attributes and tags at position 12 Position 12 Feature Name Case and Mood Marks XB *< %I0 I ‘alāmāt al-’i‘rāb wa al-binā’ > qadima al-wazīru ‘The ḍammah al-ḍammah / !? “' M; 5 u£ / R£ al-ḍamm minister arrived’ : ?M'(;! yaṣūmu aḥmad ‘Ahmad fasts’ fatḥah al-fatḥah / b-S / ,-S !; “' DÞ2/ M: ; ’akrama ṣāliḥun alal-fatḥ wazīra ‘Salih honored the > B n%4 ·( minister’ c|  lan = ; ;< C naṣbira ‘alā aḏ-ḏulli ‘We are not standing the humiliation’ > kasrah al-kasrah / \i ¬ ;  `2Ri  Ÿ%‰ ẖalaqa al-kasr allahu as-samāwāti wa al’arḍa ‘God created the skys and the earth’ > sukūn (Silence) as-sukūn k'i )!m qZ H2 = ?: =w; lam ’usāfir ’ilā almadīnati ‘I did not travel to the city’ wāw al-wāw ' k'H2)m ; ;12t yZ ’iḏā ğā’aka al; munāfiqūn ‘If the Hypocrites come to thee’ > S n- ’iltaqā al-farīqān alif al-’alif 6 k2! ‘The two teams have met’ yā’ al-yā’ 12  ‡ ; ‰: qZ d ? y ; ḏahbtu ’ilā ’aẖīka ‘I went to your brother’ > Inflectional nūn ṯubūt an-nūn `2"2’-& T < ’ağwaf yā’ī mahmūz al-fā’ s ! A : >& T < ’ağwaf yā’ī mahmūz al-lām t nāqiṣ wāwī u nāqiṣ wāwī mahmūz al-fā’ v X A : ;c - X A : ! A :* X A : F** CL ;c F** CL >& CL ;c >& CL $ A : >& CL F** T < >& T < F** †/ X A : F** †/ - 348 Position 21 initially-hamzated verb Defective with wāw and medially-hamzated verb Defective with yā' verb Defective with yā' and initially-hamzated verb Defective with yā' and medially-hamzated verb Adjacent doubly-weak verb Adjacent doubly-weak and initially-hamzated verb Separated doubly-weak verb Separated doubly-weak and medially-hamzated verb Feature Name Verb Root S 86 bunyatu al-fi‘l $ A : F** † / Tag nāqiṣ wāwī mahmūz al-‘ayn w nāqiṣ yā’ī x X A : >& †/ nāqiṣ yā’ī mahmūz al-fā’ y $ A : >& †/ nāqiṣ yā’ī mahmūz al-‘ayn z lafῑf maqrūn * lafῑf maqrūn mahmūz al-fā’ $ lafῑf mafrūq & lafῑf mafrūq mahmūz al-‘ayn @ >& †/ 9*%n ; X A : 9*%n ; O*% ; $ A : O*% ; A.22 Position 22; The Morphological Feature of Noun Finals Table A.23 The morphological feature of Noun Finals category attributes and their tags at position 22 Position 22 Feature Name Noun Finals ^%#_ ]  B H?\ !.< ’aqsām al-’ismi tib‘an li-lafẓi ‘āẖirhi Sound noun %#j =) H?Q Semi-sound noun = 4Bd H?Q Noun with shortened ending Noun with extended ending Noun with curtailed ending Noun with deleted ending @ n: H?Q "*5:: H?Q e n: H?Q %#j T* H?Q Tag ṣahῑh SB ğabal ‘mountain’ %/ nahr s al-’ism šibh aṣ-ṣaḥῑḥ ‘river’ Hr@" dirham ‘Dirham (currency)’ ,"+ dalw ‘bucket’ 6 bahw ‘hall’ U%+ ', 86 bušrā ‘glad tidings’ i al-’ism al-’āir al-’ism maqṣūr al-’ism mamdūd al-’ism manqūṣ al-’ism maḥḏūf al-’āẖir al- t al- X:+ ?+ samā’ ‘sky’ e al- 3 n D + al-qāḍῑ ‘the judge’ c 5, +& yad ‘hand’, +?+ sanah ‘year’, and + 8 luḡah language’. d - 349 - Appendix B Summary of Arabic Part-of-Speech Tagging Systems Tagger Corpus used 1- APT: Arabic Partof-Speech tagger by KHOJA • 59,040 words of the Saudi `` Al- Jazirah'' newspaper, dated 03/03/1999. • 3,104 words of the Egyptian `` AlAhram'' newspaper, date 25/01/2000. • 5,811 words of the Qatari `` AlBayan'' newspaper, date 25/01/2000. • 17,204 words of Al-Mishkat, an Egyptian published paper in social science, April 1999. Lexicon: 50,000 words extracted from Jazirah newspaper were tagged, and used to derive the lexicon, which contains 9,986 words. Algorithm (Methodology) Tagset & tagset size Statistical and rule-based The tagset techniques. developed by Statistical tagger uses the Khoja contains 177 Viterbi algorithm. tags: 103Nouns 57 Verbs 9 Particles 7 Residual 1 Punctuation Evaluation method Evaluation Metrics Stemmer evaluated The test of the using a dictionary of stemmer shows an 4,748 trilateral and accuracy of 97%. quadrilateral roots. Statistical tagger achieved an accuracy of around 90% - 350 - Tagger Corpus used Algorithm (Methodology) 2- POS Tagging of Dialectal Arabic by Duh and Kirchhoff. 1- The CallHome Egyptian Colloquial Arabic (ECA) corpus 2- The LDC Levantine Arabic (LCA) corpus, 3- The LDC MSA Treebank corpus, LCD-distributed Buckwalter stemmer. Internal stem lexicon combined with rules for affixation. The baseline tagger was a statistical trigram tagger in the form of a hidden Markov model (HMM). 3- Memorybased morphological analysis and part-of-speech tagging of Arabic by Bosch, Marsi, and Soudi Arabic Treebank version 3.0 Memory-based learning (knearest neighbor classification) Lexicon They created a lexicon that maps every morphologically analyzes and PoS tags unvoweled written word to all analyses. Arabic and analyzes it using Tim Buckwalter’s Arabic Morphological analyser which is rule-based. They employed the MBT memory-based taggergenerator and tagger. http://ilk.uvt.nl/ Tagset & tagset size They mapped both sets of tags, the LDC ECA annotation and and the Buckwalter stemmer to a unified, simpler tagset consisting only of the major POS categories. 17 categories. Evaluation method ECA Evaluation set Systems: CombileData CombineLex Interpolate – λ Interpolate – λ (ti) JointTrain(1:4) JointTrain(2:1) JointTrain(2:1) + affix w/ECA+LCA w/ECA+MSA They used the same They evaluated on tagset in the Penn the complete Arabic TreeBank. correctness of all reconstructed analysis in terms of recall, precision and F-score. Evaluation Metrics Accuracy was 58.47% 66.61% improved using affix features and to 68.48% by joint training. The accuracy of the tagger on the held-out corpus was 91.9%. On the 14155 known words it was 93.1%. on the 947 unknown words it was 73.6% - 351 - Tagger Corpus used Algorithm (Methodology) Tagset & tagset Evaluation method size 4- Brill’s POS Large corpus of Modern Standard Brill’s “transformation- 119 tagset The system was not tagger and a Arabic text. All input Arabic text was based” or “rule-based” tagger. evaluated Morphology assumed to be Windows CP-1256 text parser for using the transliteration scheme devised Arabic by by Tim Buckwalter and Ken Beesely at Freeman Xerox. 5- Automatic The data was transliterated in the Support Vector Machine 24 collapsed tags A standard SVM with Tagging of Arabic TreeBank into Latin based (SVM) based approach available in the a polynomial kernel, Arabic Arabic Text by ASCII characters using the Buckwalter of degree 2 and TreeBank Diab, transliteration scheme. C=1.7 Standard distribution. This Hacioglu and metrics of Accuracy collapsed tag set is (Acc), Jurafsky. Precision a manually reduced (Prec), Recall (Rec), form of the 135 and the F-measure, morpho-syntactic Fβ=1, on the test set tags created by are utilized AraMorph. 6- Part-ofThe data they used comes from the SVM-based Yamcha (which They used a They mapped their Speech Penn Arabic Treebank. They used the uses Viterbi decoding) rather reduced POS tagset best solutions to the Tagging by first two releases of the ATB, ATB1 than an exponential model. (15 tags) along English tagset and Habash and and ATB2, which are drawn from with the other they assumed gold Rambow different news sources. orthogonal standard tokenization. They used the ALMORGEANA linguistic features. Then evaluated morphological analyzer which uses the against the gold databases (i.e.,lexicon) from the standard POS tagging Buckwalter Arabic Morphological which is mapped Analyzer. similarly. Evaluation Metrics The system was not evaluated 95.49% On their own reduced POS tagset, evaluating on TE1, they obtained an accuracy score of 98.1% on all tokens. - 352 - Tagger Corpus used Algorithm (Methodology) 7- Arabic Partof-Speech Tagging by Harmain. (42000 HTML document = 316 MB) mostly from Al-Hayat Arabic newspaper Dictionary: they used Buckwalter’s dictionary available from the Linguistic Data Consortium (LDC). Texts extracted from educational books in first stage and some Qur’anic text that was tagged using a small tag set. Rule-Based 8- Hybrid Method for Tagging Arabic Text by Tlili-Guiassa 9- A Hidden Markov Model –Based POS Tagger for Arabic by AlShamsi and Guessoum A training corpus of Arabic news articles has first been stemmed using the stemming component and then tagged manually with their proposed tag set. They examined LDC's Arabic TreeBank corpus (LDC, 2005) that consists of 734 news articles. They have developed a 9.15 MB corpus of native Arabic articles, which were manually tagged using the developed tag set. Tagset & tagset Evaluation method size Tagset is unknown. He did not show any evaluation for his system. Evaluation Metrics No evaluation done. Hybrid method of based- The tag set used is All experiments are 85% rules and a machine learning the tag set derived performed on texts method from APT extracted from educational books in first stage and some Qur’anic text that was tagged using a small tag set and retagged with more detailed tag set. They used Buckwalter's 55 tagset They used the F- 97%. stemmer to stem the training They selected the measure to evaluate data. tags that were rich POS tagger They constructed trigram enough to allow a performance. They language models and used the good training and a computed the Ftrigram probabilities in good performance measure as : [2 x building the HMM model of the HMM-based Precision x Recall] / POS tagger. At the [Precision + Recall] same time, they where tried carefully to Precision = Ncorrect / make the tag set Nresponse small enough to Recall = Ncorrect / make the training Nkey of the POS tagger computationally feasible. - 353 -