Tessore, Juan Pablo; Esnaola, Leonardo Martín; Russo, Claudia Cecilia; Baldassarri, Sandra
Resumen:
One of the key aspects of the texts coming from social media is that they tend to be very noisy. This is mainly because of the usage of informal language and none standard grammatical structures. So in order to use these contents as input for a text
analysis process, it is highly recommended to previously clean and reduce the noise of the data. This work focuses on measuring the effectiveness that diverse cleaning and repairing tasks have on the data. The results obtained, indicate that the tasks of tokens with no letters removal, and stressed words correction are the most effective. In addition, some tasks like hashtags or usernames processing, which behave very well in other datasets, are not that
relevant in this one. This research is part of a more general one that pursues to build an automatic emotion classifier that makes use of the preprocessed comments as input.