Hoffmann, Sebastian

doi:10.1093/llc/fqm002

Back to matches

Your institution may have access to this item. Find your institution then sign in to continue.

Title: Processing Internet-derived Text—Creating a Corpus of Usenet Messages.
Authors: Hoffmann, Sebastian
Abstract: In recent years, linguists have become increasingly interested in the language of the Internet-both as an object of investigation as well as a source of authentic data to complement traditional electronic corpora. However, Internet-derived data is typically very messy data and a conversion process is often required in order to enable researchers to carry out a reliable quantitative investigation of the patterns observed with the help of standard corpus tools. In this article, I discuss the technical and methodological aspects involved in creating a large corpus of asynchronous computer-mediated communication by downloading and post-processing hundreds of thousands messages posted in twelve Usenet newsgroups. After describing how messages can be arranged into hierarchically structured discussion threads, I focus at some length on the strategies that are required to correctly assign authorship to the different textual elements in individual messages. My algorithms have a success rate of well over 90% for most newsgroups and the resulting corpus can thus serve as a suitable basis for an investigation into the interactive strategies employed in this particular type of written communication.
Subjects: TELEMATICS; WRITTEN communication; CORPORA; LANGUAGE &; languages; USENET (Computer network)
Publication: Literary & Linguistic Computing, 2007, Vol 22, Issue 2, p151
ISSN: 0268-1145
Publication type: Article
DOI: 10.1093/llc/fqm002

We found a match

Processing Internet-derived Text—Creating a Corpus of Usenet Messages.

Hoffmann, Sebastian

TELEMATICS; WRITTEN communication; CORPORA; LANGUAGE &; languages; USENET (Computer network)

Literary & Linguistic Computing, 2007, Vol 22, Issue 2, p151

0268-1145

Article

10.1093/llc/fqm002