We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Stand-off Annotation of Web Content as a Legally Safer Alternative to Crawling for Distribution.
- Authors
FORCADA, Mikel L.; ESPLÀ-GOMIS, Miquel; PÉREZ-ORTIZ, Juan Antonio
- Abstract
Sentence-aligned web-crawled parallel text or bitext is frequently used to train statistical machine translation systems. To that end, web-crawled sentence-aligned bitext sets are sometimes made publicly available and distributed by translation technologies practitioners. Contrary to what may be commonly believed, distribution of web-crawled text is far from being free from legal implications, and may sometimes actually violate the usage restrictions. As the distribution and availability of sentence-aligned bitext is key to the development of statistical machine translation systems, this paper proposes an alternative: instead of copying and distributing copies of web content in the form of sentence-aligned bitext, one could distribute a legally safer stand-off annotation of web content, that is, files that identify where the aligned sentences are, so that end users can use this annotation to privately recrawl the bitexts. The paper describes and discusses the legal and technical aspects of this proposal, and outlines an implementation.
- Subjects
ANNOTATIONS; INTERNET content; MACHINE translating
- Publication
Baltic Journal of Modern Computing, 2016, Vol 4, Issue 2, p152
- ISSN
2255-8942
- Publication type
Article