Autores
Gómez Adorno Helena Montserrat
Sidorov Grigori
Título Document embeddings learned on various types of n-grams for cross-topic authorship attribution
Tipo Revista
Sub-tipo JCR
Descripción Computing
Resumen Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper,we propose to learn document vectors based on n-grams and not only onwords.We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms wordbased embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.
Observaciones DOI 10.1007/s00607-018-0587-8
Lugar Wien
País Austria
No. de páginas 741-756
Vol. / Cap. v. 100 no. 7
Inicio 2018-07-01
Fin
ISBN/ISSN