Abhijit Pol; Christopher Jermaine; Subramanian Arumugam

doi:10.1007/s00778-007-0048-z

Back to matches

Your institution may have access to this item. Find your institution then sign in to continue.

Title: Maintaining very large random samples using the geometric file.
Authors: Abhijit Pol; Christopher Jermaine; Subramanian Arumugam
Abstract: Abstract Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a “sample” is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorithms meet the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. We also present algorithms to retrieve small size random sample from large disk-based sample which may be used for various purposes including statistical analyses by a DBMS.
Subjects: STATISTICAL sampling; GEOMETRIC modeling; ALGORITHMS; DATABASE management
Publication: VLDB Journal International Journal on Very Large Data Bases, 2008, Vol 17, Issue 5, p997
ISSN: 1066-8888
Publication type: Article
DOI: 10.1007/s00778-007-0048-z

We found a match

Maintaining very large random samples using the geometric file.

Abhijit Pol; Christopher Jermaine; Subramanian Arumugam

STATISTICAL sampling; GEOMETRIC modeling; ALGORITHMS; DATABASE management

VLDB Journal International Journal on Very Large Data Bases, 2008, Vol 17, Issue 5, p997

1066-8888

Article

10.1007/s00778-007-0048-z