Comparative Study of Neural Machine Translation Approaches for Hindi–Malayalam: Bi-LSTM Baselines, Word2Vec and Attention Enhancements, and mBART Transfer Learning

Rajeev R R; Sneha S

Comparative Study of Neural Machine Translation Approaches for Hindi–Malayalam: Bi-LSTM Baselines, Word2Vec and Attention Enhancements, and mBART Transfer Learning

Rajeev R R, Sneha S

Abstract

Hindi–Malayalam machine translation faces significant challenges due to structural differences between Indo-Aryan and Dravidian languages. Malayalam is highly agglutinative, whereas Hindi relies more on syntactic structures and postpositions to express grammatical relations. Limited availability of high-quality parallel corpora further complicates the development of robust translation systems. This study presents a comparative evaluation of neural machine translation architectures for Hindi–Malayalam translation. A curated parallel corpus of about 80,000 sentence pairs was created using automated translation followed by manual correction. Five models were implemented and evaluated: a Bi-LSTM sequence-to-sequence baseline, Bi-LSTM with Word2Vec embeddings, Bi-LSTM with Word2Vec and attention, inference using the pretrained multilingual transformer mBART-50, and fine-tuning of mBART-50 on the dataset. All models were trained using a unified preprocessing pipeline including Unicode filtering and Indic tokenization. Results show clear improvements across architectures, with fine-tuned mBART-50 achieving the highest translation quality, highlighting the effectiveness of multilingual transformer models for low-resource Hindi–Malayalam translation

Keywords

Hindi–Malayalam Translation, Neural Machine Translation, Bi-LSTM, Word2Vec, Attention Mechanism, mBART, Informatics,

Full Text:

PDF

References

Anand, G. G., et al. (2023). The Effect of Difference in Word Order in Hindi: An Experimental Characterization.

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. In International Conference on Learning Representations (ICLR).

Cho, Kyunghyun, Bart van Merriënboer, Çağlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio (2014). Learning Phrase Representations Using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).

Dabre, Raj, et al. (2021). mBART Pre-training and In-Domain Fine-Tuning for Indic Languages.

Gala, Jay, et al. (2023). IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for All 22 Scheduled Indian Languages.

Gogineni, S., G. Suryanarayana, and S. K. Surendran (2020). An Effective Neural Machine Translation for English to Hindi Language. In Proceedings of the International Conference on Smart Electronics and Communication (ICOSEC).

Hochreiter, Sepp, and Jürgen Schmidhuber (1997). Long Short-Term Memory. Neural Computation 9 (8): 1735–1780.

Kudo, Taku, and John Richardson (2018). SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations.

Kumar, A., et al. (2017). Morphological Analysis of the Dravidian Language Family.

Laskar, S. R., A. Dutta, P. Pakray, and S. Bandyopadhyay. (2019). Neural Machine Translation: English to Hindi. In IEEE Conference on Information and Communication Technology (CICT).

Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.

Liu, Yinhan, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer (2020). Multilingual Denoising Pre-training for Neural Machine Translation. Transactions of the Association for Computational Linguistics 8: 726–742.

Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning (2015). Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

Mikolov, Tomáš, Kai Chen, Greg Corrado, and Jeffrey Dean (2013). Efficient Estimation of Word Representations in Vector Space. In International Conference on Learning Representations (ICLR).

Moghe, Nikita, et al. (2023). Extrinsic Evaluation of Machine Translation Metrics. In Proceedings of the Association for Computational Linguistics.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318.

Popović, Maja (2015). chrF: Character n-gram F-score for Automatic MT Evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, 392–395.

Ram, V. S., and S. Lalitha Devi (2023). Hindi to Dravidian Language Neural Machine Translation Systems. In Recent Advances in Natural Language Processing (RANLP).

Ramesh, Gowtham, et al (2022). Samanantar: The Largest Publicly Available Parallel Corpus for Indic Languages. Transactions of the Association for Computational Linguistics.

Rei, Ricardo, et al (2020). COMET: A Neural Framework for Machine Translation Evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.

Schuster, Mike, and Kuldip K. Paliwal (1997). Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing 45 (11): 2673–2681.

Sebastian, M. P., et al. (2023). Malayalam Natural Language Processing: Challenges and Opportunities.

Sennrich, Rico, Barry Haddow, and Alexandra Birch. (2016a). Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.

Sennrich, Rico, Barry Haddow, and Alexandra Birch. (2016b). Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.

Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul (2006). A Study of Translation Edit Rate with Targeted Human Annotation. In Proceedings of the Association for Machine Translation in the Americas.

Tang, Yuqing, et al. (2020). Multilingual Translation with Extensible Multilingual Pretraining and Finetuning.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems.

Wolf, Thomas, et al. (2020). Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations.

Refbacks

There are currently no refbacks.

License URL: https://creativecommons.org/licenses/by/4.0/

Informatics Studies | ISSN: 2583-8954 (Online), 2320-530X (Print)

Username
Password
Remember me