SABER

Autores
Ullah - Fida
Gelbukh Alexander
Zamir Muhammad Tayyab
Felipe Riverón Edgardo Manuel
Sidorov Grigori

Título	Enhancement of Named Entity Recognition in Low-Resource Languages with Data Augmentation and BERT Models: A Case Study on Urdu
Tipo	Revista
Sub-tipo	CONACYT
Descripción	Computers
Resumen	Identifying and categorizing proper nouns in text, known as named entity recognition (NER), is crucial for various natural language processing tasks. However, developing effective NER techniques for low-resource languages like Urdu poses challenges due to limited training data, particularly in the nastaliq script. To address this, our study introduces a novel data augmentation method, “contextual word embeddings augmentation” (CWEA), for Urdu, aiming to enrich existing datasets. The extended dataset, comprising 160,132 tokens and 114,912 labeled entities, significantly enhances the coverage of named entities compared to previous datasets. We evaluated several transformer models on this augmented dataset, including BERT-multilingual, RoBERTa-Urdu-small, BERT-base-cased, and BERT-large-cased. Notably, the BERT-multilingual model outperformed others, achieving the highest macro F1 score of 0.982%. This surpassed the macro f1 scores of the RoBERTa-Urdu-small (0.884%), BERT-large-cased (0.916%), and BERT-base-cased (0.908%) models. Additionally, our neural network model achieved a micro F1 score of 96%, while the RNN model achieved 97% and the BiLSTM model achieved a macro F1 score of 96% on augmented data. Our findings underscore the efficacy of data augmentation techniques in enhancing NER performance for low-resource languages like Urdu. © 2024 by the authors.
Observaciones	DOI 10.3390/computers13100258
Lugar	Basel
País	Suiza
No. de páginas	Article number 258
Vol. / Cap.	v. 13 no. 10
Inicio	2024-10-01
Fin
ISBN/ISSN