Resumen |
Internet searches are a daily occurrence, but we must be aware that more than one person searches the same topic with different words, this is called paraphrasing. Paraphrasing involves syntactic changes and the overlapping of words, linked to the rules of the language in which we work. The identification is a problem of great importance for natural language processing (NLP), especially paraphrasing questions with the same intention. In addition, it has been found that for the study of similarities, some features are not taken into account, which makes the identification yield lower results. In this paper, we address the problem of automatic paraphrase identification in the Quora Question Pair (QQP) dataset, paying special attention to data’s shape through exploratory data analysis (EDA). This is in order to obtain better results in the identification tasks, as well as to compare different classifiers in collaborative environments where resources are limited. © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG. |