Comparative Analysis of Morphological Approaches for Low-Resource Indo-Aryan Languages: Case Studies in Angika, Maithili, and Hindi

Alok Kumar

doi:10.70454/IJMRE.2025.05037

Authors

Alok Kumar Department of Computer Science Engineering, YBN University Ranchi Author

DOI:

https://doi.org/10.70454/IJMRE.2025.05037

Keywords:

Morphological Analysis, Low-Resource Indo-Aryan Languages, Finite-State Trans- ducer, Rule-Based NLP, Hybrid Morphological Models

Abstract

Morphological analysis is important and challenging sub-task in Natural Language Processing (NLP), particularly for morphologically rich Indo-Aryan languages. Yet, some regional languages as Angika and Maithili are still under-resourced due to the absence of annotated corpora and compu- tational tools. This work is a comparative study of morphological analysis in the context of Angika, Maithili and Hindi, covering low-and little-resource scenarios. The work compares rule-based, finite-state transducer and hybrid methods and concentrates on the treatment of inflectional and derivational morphology. Linguistically motivated rules and small lexical re- sources are used for low resource languages, while Hindi is used as a reference language. Rule-based and hybrid models perform more robustly and have better interpretable results in low-resource settings than the purely data-driven models. The study demonstrates the need for LingA (linguistic analyzers) to seamlessly combine linguistic knowl- edge with computational techniques, in order to develop efficient morpho-analysis tools for Indo-Aryan lan- guages that are still under-represented typologically.

References

[1] Ankita Agarwal, Shashi Pal Singh, Ajai Kumar, Hemant Darbari, et al. Morphological analyser for hindi-a rule based implementation. International Journal of Advanced Computer Research, 4(1):19, 2014.

[2] Kenneth R. Beesley and Lauri Karttunen. Finite State Morphology. CSLI Publications, 2003.

[3] Laurent Besacier, Etienne Barnard, Alexey Karpov, and Tanja Schultz. Automatic speech recognition for under-resourced languages. Speech Communication, 56:85–100, 2014.

[4] Miriam Butt. The Structure of Complex Predicates in Urdu. CSLI Publications, Stanford, 1995.

[5] Amit Kumar Chandrana and Neha Garg. Number and gender agreement: A comparative study of angika and maithili. Anukriti: An International Peer Reviewed Refereed Research Journal, 11(6):47–52, 2021.

[6] Suniti Kumar Chatterji. The Origin and Development of the Bengali Language. George Allen & Unwin, London, 1926.

[7] Suniti Kumar Chatterji. Indo-Aryan and Hindi. Motilal Banarsidass, Delhi, 1960.

[8] John Goldsmith. Unsupervised learning of the morphology of a natural language. Computa- tional Linguistics, 27(2):153–198, 2001.

[9] George Abraham Grierson. Linguistic Survey of India, Vol. V: Indo-Aryan Languages. Gov- ernment of India, Calcutta, 1903.

[10] Nizar Habash. Introduction to Arabic Natural Language Processing. Morgan & Claypool, 2010.

[11] Lauri Karttunen. Constructing lexical transducers. Proceedings of the ACL Workshop on Computational Morphology, pages 1–10, 1997.

[12] Kimmo Koskenniemi. Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production. PhD thesis, University of Helsinki, 1983.

[13] Taku Kudo and Yukio Matsumoto. Applying conditional random fields to japanese morpho- logical analysis. In ACL Workshop on Morphological and Phonological Processing, 2004.

[14] Ishan Kumar, Renu Dhir, Gurpreet S Lehal, and Sanjeev Kumar Sharma. Design of dy- namic morphological analyser for hindi nouns using rule based approach. Recent Advances in Computer Science and Communications (Formerly: Recent Patents on Computer Science), 13(6):1152–1157, 2020.

[15] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eigh- teenth International Conference on Machine Learning (ICML), pages 282–289. Morgan Kauf- mann, 2001.

[16] Colin P. Masica. The Indo-Aryan Languages. Cambridge University Press, Cambridge, 1991.

[17] Siddhesh Pawar and Pushpak Bhattacharyya. Neural morphology analysis – a survey. Technical report, CFILT, IIT Bombay, 2022. survey; available as CFILT technical report.

[18] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recog- nition. Proceedings of the IEEE, 77(2):257–286, 1989.

[19] Raza Rahi, Sumant Pushp, Arif Khan, and Smriti Kumar Sinha. A finite state transducer based morphological analyzer of maithili language. arXiv preprint arXiv:2003.00234, 2020.

[20] Mayuri Rastogi and Pooja Khanna. Development of morphological analyzer for hindi. Inter- national Journal of Computer Applications, 95(17):1–5, 2014.

[21] Teemu Ruokolainen and Mikko Kurimo. Neural network morphological analyzers for highly in- flecting languages. In Proceedings of the Workshop on Computational Morphology and Phonol- ogy, 2016.

[22] Helmut Schmid. Eﬀicient parsing of highly ambiguous context-free grammars with bit vectors.

Proceedings of COLING, 2004.

[23] Linlin Wang, Zhu Cao, Yu Xia, and Gerard de Melo. Morphological segmentation with win- dow lstm neural networks. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), 2016.

[24] Ramawatar Yadav. A Reference Grammar of Maithili. Mouton de Gruyter, Berlin, 1996.