We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Towards Utilizing Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set.
- Authors
Klein, Ari Z; Magge, Arjun; O'Connor, Karen; Amaro, Jesus Ivan Flores; Weissenbacher, Davy; Hernandez, Graciela Gonzalez; Flores, Ivan; Gonzalez-Hernandez, Graciela
- Abstract
<bold>Background: </bold>In the United States, the rapidly evolving outbreak of COVID-19, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone.<bold>Objective: </bold>The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the CDC.<bold>Methods: </bold>Beginning January 23, 2020, we collected English tweets from the Twitter Streaming API that mention keywords related to COVID-19. We applied hand-written regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out "reported speech" (e.g., quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on pre-trained transformer models. Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1, 2020 and August 21, 2020.<bold>Results: </bold>Inter-annotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen's kappa). A deep neural network classifier, based on a BERT model that was pre-trained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision = 0.76, recall = 0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have United States state-level geolocations.<bold>Conclusions: </bold>We have made the 13,714 tweets identified in this study, along with each tweet's time stamp and United States state-level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.<bold>Clinicaltrial: </bold>
- Publication
Journal of Medical Internet Research, 2021, Vol 23, Issue 1, pN.PAG
- ISSN
1439-4456
- Publication type
Article
- DOI
10.2196/25314