SABER

Autores
Gómez Adorno Helena Montserrat
Martín del Campo Rodríguez Carolina
Sidorov Grigori

Título	Hierarchical clustering analysis: The best-performing approach at PAN 2017 author clustering task
Tipo	Congreso
Sub-tipo	Memoria
Descripción	9th International Conference of the CLEF Association, CLEF 2018
Resumen	The author clustering problem consists in grouping documents written by the same author so that each group corresponds to a different author. We described our approach to the author clustering task at PAN 2017, which resulted in the best-performing system at the aforementioned task. Our method performs a hierarchical clustering analysis using document features such as typed and untyped character n-grams, word n-grams, and stylometric features. We experimented with two feature representation methods, log-entropy model, and TF-IDF, while tuning minimum frequency threshold values to reduce the feature dimensionality. We identified the optimal number of different clusters (authors) dynamically for each collection using the Caliński Harabasz score. The implementation of our system is available open source (https://github.com/helenpy/clusterPAN2017). © Springer Nature Switzerland AG 2018.
Observaciones	DOI 10.1007/978-3-319-98932-7_20 Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), v. 11018
Lugar	Avignon
País	Francia
No. de páginas	216-223
Vol. / Cap.	11018 LNCS
Inicio	2018-09-10
Fin	2018-09-14
ISBN/ISSN	9783319989310