SABER

Autores
Sánchez Pérez Miguel Ángel
Markov Ilia
Gómez Adorno Helena Montserrat
Sidorov Grigori

Título	Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus
Tipo	Congreso
Sub-tipo	Indefinido
Descripción	8th International Conference of the CLEF Association, CLEF 2017
Resumen	We compare the performance of character n-gram features (n = 3−8) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish.We used the same machinelearning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higherorder character n-grams (n = 5−8) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders (n = 1−2 for words and n = 3−8 for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topicdependent features.
Observaciones	DOI: 10.1007/978-3-319-65813-1 15 Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), v. 10456
Lugar	Dublin
País	Irlanda
No. de páginas	145-151
Vol. / Cap.	10456 LNCS
Inicio	2017-09-11
Fin	2017-09-14
ISBN/ISSN	9783319658124