Before going into detail, let me explain what deduplication is all about and why we need it. It’s well known that the user demand for storage is increasing day by day. On the other hand, cloud storage vendors always look for optimization techniques that allow them to save storage space and minimize redundant data. In this context, deduplication is very popular and is a key feature of many storage solutions such as HP StoreOnce, IBM System Storage N series Deduplication and Symantec’s NetBackup and Backup Exec.
From a high-level point of view, deduplication is very simple: duplicate data is stored only once and we keep pointers to the actual data.
Deduplication is great but it doesn’t work on encrypted data because, somehow, it’s conflicting with encryption. Suppose we have two users who have the same data and want to store it remotely. Since they don’t trust the cloud provider, they will encrypt their data before uploading them to the cloud provider. Unfortunately, after the encryption, their data won’t be identical anymore and deduplication won’t possible! Indeed, the cloud provider cannot detect that those two data segments are identical.
There is a (simple) solution to this problem: Convergent Encryption! In convergent encryption, the encryption key is derived from the data (usually, it’s the hash of the data). This way, two different users will automatically use the same encryption key for the same data and thus generate the same ciphertext. At this point, the cloud storage provider can still perform deduplication since he can detect duplicate data.
Well, this means that we have found the solution to combine deduplication and encryption… NO!
Recently, Perttula has found two vulnerabilities affecting convergent encryption and defined two possible attacks: Confirmation of a File and Learn-the-remaining-information. Basically, the main weakness of convergent encryption comes from the fact that the key of a a given data segment can be generated by anyone in a deterministic way. This allows an attacker to easily check which ciphertext corresponds to which plaintext. Even though the data are stored encrypted, an attacker can generate a key from a plaintext, encrypt the plaintext and check if the resulting ciphertext is already stored or not. It’s clear that this weakness seriously compromises users’ confidentiality and therefore the whole system (and if you’re wondering… yes, Bitcasa is vulnerable to these attacks too!!).
Well, how can we achieve both deduplication and convergent encryption then?
We cope with this issue by adding one additional layer of deterministic and symmetric encryption on top of convergent encryption (of course, as you’ll see, it’s much more than that). This additional encryption can be added by a component placed between the user and the cloud storage provider such as a local server or a gateway. This component will take care of encrypting/decrypting data from/to users. In order to allow the cloud provider to detect duplicates, encryption and decryption are performed with one unique set of secret keys. This set of secret keys is securely stored by the component and won’t be shared with anyone for any reason! As we can see, one simple additional layer of encryption is sufficient to keep deduplication feasible and prevent the cloud provider from performing any of the above-mentioned attacks! Indeed, the cloud provider will never access these secret keys.
That’s great, but how can we decrypt our files? Where do we store the block keys?
The simplest solution would be to make users store their keys, but this would be unpractical since it would require a considerable amount of storage space.
We suggest to introduce a new component which we called metadata manager. The goal of this additional component is to store encrypted block keys (key management) and perform deduplication on encrypted blocks.
Thanks to this separation between data (encrypted blocks) and metadata (pointers to the actual data, block links, encrypted keys) we achieve the complete independence from the storage provider!
Putting all together, our system is structured as follows:
- We have a number of users who, before uploading to the Cloud, split data into blocks (the data chunking technique that best fits your data), encrypt blocks with convergent encryption and send to the server (or gateway) encrypted blocks together with their associated encrypted keys.
- A server/gateway that further encrypts blocks and keys with a set of unique and secret keys.
- A metadata manager that updates the metadata (in order to rebuild the structure of each file) , stores encrypted block keys and performs deduplication on encrypted blocks. Only those blocks that are not already stored are actually stored.
- A storage layer to store single blocks, which can be seen as files/objects of small size. Since our system is completely storage agnostic, we can implement the storage layer with any storage system/provider. For instance, we might use a cloud storage provider such as Amazon S3, a distributed storage, a local fileystem, etc.
The inverse process (download and decryption) is straightforward.
It’s important to point out that thanks to our design, no single component has enough information to decrypt blocks or keys. Indeed, blocks and keys are encrypted by users and the server/gateway.
While this solution might seem straightforward, it’s surprising to see how effective it’s and how well it fits for various use cases.
A typical use case might be the following: the employees of a big enterprise want to store their data in the Cloud while keeping data confidentiality and minimizing the amount of storage space consumed. In this context, the server/gateway can be deployed on the premises of the enterprise and the metadata manager can be deployed either locally or remotely (if we want to rely on a service provider for the deduplication and key management services).
Finally, we are currently working (halfway done) on the full implementation of this system (metadata manager is based on REDIS) and the results are very promising. The storage space required for our metadata is really minimal and doesn’t impact the gains of deduplication. Also, from a computational point of view, deduplication is very efficient (constant cost)!
What do you think of our solution? Any comments are welcome!