Finding Subcube Heavy Hitters in Analytics Data Streams.
Data streams typically have items of large number of dimensions. We study thefundamental heavy-hitters problem in this setting. Formally, the data streamconsists of $d$-dimensional items $x_1,\ldots,x_m \in [n]^d$. A $k$-dimensionalsubcube $T$ is a subset of distinct coordinates $\{ T_1,\cdots,T_k \} \subseteq[d]$. A subcube heavy hitter query ${\rm Query}(T,v)$, $v \in [n]^k$, outputsYES if $f_T(v) \geq \gamma$ and NO if $f_T(v) < \gamma/4$, where $f_T$ is theratio of number of stream items whose coordinates $T$ have joint values $v$.The all subcube heavy hitters query ${\rm AllQuery}(T)$ outputs all jointvalues $v$ that return YES to ${\rm Query}(T,v)$. The one dimensional versionof this problem where $d=1$ was heavily studied in data stream theory,databases, networking and signal processing. The subcube heavy hitters problemis applicable in all these cases.
We present a simple reservoir sampling based one-pass streaming algorithm tosolve the subcube heavy hitters problem in $\tilde{O}(kd/\gamma)$ space. Thisis optimal up to poly-logarithmic factors given the established lower bound. Inthe worst case, this is $\Theta(d^2/\gamma)$ which is prohibitive for large$d$, and our goal is to circumvent this quadratic bottleneck.
Our main contribution is a model-based approach to the subcube heavy hittersproblem. In particular, we assume that the dimensions are related to each othervia the Naive Bayes model, with or without a latent dimension. Under thisassumption, we present a new two-pass, $\tilde{O}(d/\gamma)$-space algorithmfor our problem, and a fast algorithm for answering ${\rm AllQuery}(T)$ in$O(k/\gamma^2)$ time. Our work develops the direction of model-based datastream analysis, with much that remains to be explored.
Stay in the loop.
Subscribe to our newsletter for a weekly update on the latest podcast, news, events, and jobs postings.