Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion‑Tagged Social Media Comments in Spanish

Tessore, Juan Pablo; Esnaola, Leonardo Martín; Lanzarini, Laura; Baldassarri, Sandra

dc.rights.license	https://creativecommons.org/licenses/by-nc-nd/2.5/ar/	es_ES
dc.creator	Tessore, Juan Pablo	es_ES
dc.creator	Esnaola, Leonardo Martín	es_ES
dc.creator	Lanzarini, Laura	es_ES
dc.creator	Baldassarri, Sandra	es_ES
dc.date.accessioned	2021-07-26T14:44:02Z
dc.date.available	info:eu-repo/date/embargoEnd/2022-01-17	es_ES
dc.date.available	2021-07-26T14:44:02Z
dc.date.issued	2021-01-18
dc.identifier.citation	Tessore, J.P., Esnaola, L.M., Lanzarini, L. et al. Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish. Cogn Comput (2021). https://doi.org/10.1007/s12559-020-09800-x	es_ES
dc.identifier.issn	1866-9964	es_ES
dc.identifier.issn	1866-9956	es_ES
dc.identifier.uri	https://repositorio.unnoba.edu.ar/xmlui/handle/23601/142
dc.description.abstract	Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field.	es_ES
dc.description.sponsorship	Fil: Tessore, Juan Pablo. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Tecnología. Instituto de Investigación y Transferencia en Tecnología, Centro Asociado CIC; Argentina.	es_ES
dc.description.sponsorship	Fil: Tessore, Juan Pablo. Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Ciudad Autónoma de Buenos Aires, Argentina	es_ES
dc.description.sponsorship	Fil: Esnaola, Leonardo Martín. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Tecnología. Instituto de Investigación y Transferencia en Tecnología, Centro Asociado CIC; Argentina	es_ES
dc.description.sponsorship	Fil: Lanzarini, Laura. Facultad de Informática, Instituto de Investigación en Informática LIDI (Centro CICPBA), Universidad Nacional de La Plata, La Plata, Buenos Aires, Argentina	es_ES
dc.description.sponsorship	Fil: Baldassarri, Sandra. Departamento de Informática e Ingeniería de Sistemas, Universidad de Zaragoza, Aragon, Zaragoza, España	es_ES
dc.description.sponsorship	Fil: Baldassarri, Sandra. Instituto de Investigación en Ingeniería (I3A), Universidad de Zaragoza, Zaragoza, Aragon, España	es_ES
dc.format	application/pdf	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Springer Science+Business Media LLC	es_ES
dc.relation	info:eu-repo/grantAgreement/UNNOBA/SIB2017/EXP 195/2017/AR. Buenos Aires/Tecnología y Aplicaciones de Sistemas de Software: Calidad e Innovación en procesos, productos y servicios	es_ES
dc.rights	info:eu-repo/semantics/embargoedAccess	es_ES
dc.source	Cognitive Computation	es_ES
dc.subject	Sentiment analysis	es_ES
dc.subject	Dataset construction	es_ES
dc.subject	Dataset validation	es_ES
dc.subject	Facebook	es_ES
dc.subject	Text mining	es_ES
dc.title	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion‑Tagged Social Media Comments in Spanish	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.type	info:ar-repo/semantics/artículo	es_ES
dc.type	info:eu-repo/semantics/acceptedVersion	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.type	info:ar-repo/semantics/artículo	es_ES
dc.type	info:eu-repo/semantics/acceptedVersion	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.type	info:ar-repo/semantics/artículo	es_ES
dc.type	info:eu-repo/semantics/acceptedVersion	es_ES
dc.description.version	Con referato	es_ES
dc.relation.publisherversion	https://link.springer.com/article/10.1007/s12559-020-09800-x	es_ES
dc.contributor.orcid	0000-0002-2111-0976	es_ES
dc.contributor.orcid	0000-0001-6298-9019	es_ES
dc.contributor.orcid	0000-0001-7027-7564	es_ES
dc.contributor.orcid	0000-0002-9315-6391	es_ES