Content Tags

There are no tags.

Managing and Querying Multi-versioned Documents using a Distributed Key-Value Store.

RSS Source
Authors
Souvik Bhattacherjee, Amol Deshpande

We address the problem of compactly storing a large number of versions(snapshots) of a collection of keyed documents or records in a distributedenvironment, while efficiently answering a variety of retrieval queries overthose, including retrieving full or partial versions, and evolution historiesfor specific keys. We motivate the increasing need for such a system in avariety of application domains, carefully explore the design space for buildingsuch a system and the various storage-computation-retrieval trade-offs, anddiscuss how different storage layouts influence those trade-offs. We propose anovel system architecture that satisfies the key desiderata for such a system,and offers simple tuning knobs that allow adapting to a specific data and queryworkload. Our system is intended to act as a layer on top of a distributedkey-value store that houses the raw data as well as any indexes. We designnovel off-line storage layout algorithms for efficiently partitioning the datato minimize the storage costs while keeping the retrieval costs low. We alsopresent an online algorithm to handle new versions being added to system. Usingextensive experiments on large datasets, we demonstrate that our systemoperates at the scale required in most practical scenarios and oftenoutperforms standard baselines, including a delta-based storage engine, byorders-of-magnitude.

Stay in the loop.

Subscribe to our newsletter for a weekly update on the latest podcast, news, events, and jobs postings.