Stream sampling for variance-optimal estimation of subset sums

Cohen, Edith; Duffield, Nick; Kaplan, Haim; Lund, Carsten; Thorup, Mikkel

Computer Science > Data Structures and Algorithms

arXiv:0803.0473 (cs)

[Submitted on 4 Mar 2008 (v1), last revised 15 Nov 2010 (this version, v2)]

Title:Stream sampling for variance-optimal estimation of subset sums

Authors:Edith Cohen, Nick Duffield, Haim Kaplan, Carsten Lund, Mikkel Thorup

View PDF

Abstract:From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size $k$ that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, $\varoptk$, that dominates all previous schemes in terms of estimation quality.
$\varoptk$ provides {\em variance optimal unbiased estimation of subset sums}. More precisely, if we have seen $n$ items of the stream, then for {\em any} subset size $m$, our scheme based on $k$ samples minimizes the average variance over all subsets of size $m$. In fact, the optimality is against any off-line scheme with $k$ samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of {\em particular} subsets than previously possible. It is efficient, handling each new item of the stream in $O(\log k)$ time. Finally, it is particularly well suited for combination of samples from different streams in a distributed setting.

Comments:	31 pages. An extended abstract appeared in the proceedings of the 20th ACM-SIAM Symposium on Discrete Algorithms (SODA 2009)
Subjects:	Data Structures and Algorithms (cs.DS)
ACM classes:	C.2.3; E.1; F.2; G.3; H.3
Cite as:	arXiv:0803.0473 [cs.DS]
	(or arXiv:0803.0473v2 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.0803.0473

Submission history

From: Edith Cohen [view email]
[v1] Tue, 4 Mar 2008 15:12:24 UTC (21 KB)
[v2] Mon, 15 Nov 2010 16:43:54 UTC (63 KB)

Computer Science > Data Structures and Algorithms

Title:Stream sampling for variance-optimal estimation of subset sums

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Data Structures and Algorithms

Title:Stream sampling for variance-optimal estimation of subset sums

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators