Resumen |
The increasing use of Arabic and Urdu on social media platforms, particularly Twitter, has created a growing need for robust Named Entity Recognition (NER) systems capable of handling noisy, informal, and code-mixed content. However, both languages remain significantly underrepresented in NER research, especially in social media contexts. To address this gap, this study makes four key contributions: (1) We introduced a manual entity consolidation step to enhance the consistency and accuracy of named entity annotations. In the original datasets, entities such as person names and organization names were often split into multiple tokens (e.g., first name and last name labeled separately). We manually refined the annotations to merge these segments into unified entities, ensuring improved coherence for both training and evaluation. (2) We selected two publicly available datasets from GitHub—one in Arabic and one in Urdu—and applied two novel strategies to tackle low-resource challenges: a joint multilingual approach and a translation-based approach. The joint approach involved merging both datasets to create a unified multilingual corpus, while the translation-based approach utilized automatic translation to generate cross-lingual datasets, enhancing linguistic diversity and model generalizability. (3) We presented a comprehensive and reproducible pseudocode-driven framework that integrates translation, manual refinement, dataset merging, preprocessing, and multilingual model fine-tuning. (4) We designed, implemented, and evaluated a customized XLM-RoBERTa model integrated with a novel attention mechanism, specifically optimized for the morphological and syntactic complexities of Arabic and Urdu. Based on the experiments, our proposed model (XLM-RoBERTa) achieves 0.98 accuracy across Arabic, Urdu, and multilingual datasets. While it shows a 7–8% improvement over traditional baselines (RF), it also achieves a 2.08% improvement over a deep learning (BiLSTM = 0.96), highlighting the effectiveness of our cross-lingual, resource-efficient approach for NER in low-resource, code-mixed social media text. © 2025 Elsevier B.V., All rights reserved. |