Resumen |
Soft-cardinality spectra (SC spectra) is a new method of approximation for text strings in linear time, which divides text strings into character q-grams of di?erent sizes. The method allows simultaneous use of weighting at term and q-gram levels. SC spectra in combination with resemblance coe?cients allows the construction of a family of text similarity functions that only use the surface information of the texts and weights obtained in the same text collection. These similarity measures can be used in various tasks of natural language processing as baseline for other methods that exploit the hidden syntactic and/or semantic structure using resources based on knowledge, inference of large orpora. The proposed method was evaluated on 22 data sets to address the tasks of information retrieval, entity matching, paraphrase and textual entailment recognition. The results raised the bar near to the best published results in the used data sets. We claim that any method that uses any resource or information external to a particular data set should outperform our method. We found that our method is an e?ective and challenging baseline for the evaluated tasks. |