Autores
Sánchez Pérez Miguel Ángel
Sidorov Grigori
Gelbukh Alexander
Título The Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014
Tipo Congreso
Sub-tipo Memoria
Descripción Notebook for PAN at CLEF 2014. CLEF 2014. CLEF2014 Working Notes
Resumen The task of (monolingual) text alignment consists in finding similar text fragments between two given documents. It has applications in plagiarism detection, detection of text reuse, author identification, authoring aid, and information retrieval, to mention only a few. We describe our approach to the text alignment subtask at PAN 2014 plagiarism detection competition. Our method relies on a sentence similarity measure based on a tf-idf-like weighting scheme that permits us to keep stopwords without increasing the false positives rate. We introduce a recursive algorithm to extend the matching sentences to maximal length passages. We also introduce a novel filtering method to resolve overlapping plagiarism cases. By the cumulative measure (Plagdet), our approach outperforms the best-performing system of the PAN 2013 competition, and was the best-performing (on the first corpus) and third best-performing (on the second corpus) system according to the official results of the PAN 2014 competition. Our system is publicly available in open-source form.
Observaciones Drive: The-winning-approach_2014
Lugar Sheffield
País Reino Unido
No. de páginas 1004-1011
Vol. / Cap. 1180
Inicio 2014-09-15
Fin
ISBN/ISSN