We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Maintaining very large random samples using the geometric file.
- Authors
Abhijit Pol; Christopher Jermaine; Subramanian Arumugam
- Abstract
Abstract Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a “sample” is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorithms meet the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. We also present algorithms to retrieve small size random sample from large disk-based sample which may be used for various purposes including statistical analyses by a DBMS.
- Subjects
STATISTICAL sampling; GEOMETRIC modeling; ALGORITHMS; DATABASE management
- Publication
VLDB Journal International Journal on Very Large Data Bases, 2008, Vol 17, Issue 5, p997
- ISSN
1066-8888
- Publication type
Article
- DOI
10.1007/s00778-007-0048-z