Thursday, September 15, 2011

Thoughts on Bitcasa

Bitcasa has been getting a lot of attention in my Google Plus circles the past week or so; I suspect this is because it was in the running for TechCrunch's Disrupt prize (but ultimately lost to something called "Shaker", which appears to me to be some sort of virtual bar simulation). Bitcasa claims to offer "infinite" storage on desktop computers for a fixed monthly fee. I've yet to see any solid technical information on how they're doing this, but it seems to me that they're planning to do this by storing your data in the cloud and using the local hard drive as a cache.

There's nothing earthshattering about this; I set up Tivoli Hierarchical Storage Manager to store infrequently used files on lower-priority storage media (either cheap NAS arrays or tape) four years ago with my last employer, and the technology was fairly mature then. Putting the HSM server, data stores, or both in the cloud was possible even then and should be even more so now that cloud services are far more mature than they were then. So while there are obviously issues to sort out, this isn't a big reach technically.

More interesting to me is how they plan to provide "infinite" storage. Given the promise of infinite storage, most users will never delete anything; my experience is that users don't delete files until they run out of space, and if they really do have "infinite" storage that won't ever happen. The rule of thumb I recall from storage planning 101 is that storage requirements double every 18 months. According to Matthew Komorowski, the cost of storage drops by half every 14 months, so their cost to provide that doubled storage should slowly decline over time, but that margin is fairly thin and may not be sufficient to cover the exponentially growing complexity of their storage infrastructure over time. They'll also have to cope with ever-increasing amounts of data transit, but I can't find good information just now on the trend there, in part because transit pricing is still very complicated.

More interesting to me is that Bitcasa appears to be claiming that they will use deduplication to reduce the amount of data transferred from clients to the server. This is, itself, not surprising. The surprising thing is that they also claim that they will be using interclient deduplication; that is, if you have a file that another client has already transferred, they won't actually transfer the file. I think they're overselling the savings from interclient deduplication, though. I may not be typical, but the bulk of the file data on my systems seems to fall into a few categories: camera raws for the photos I've taken; datasets I've compiled for various bulk data analyses (e.g. census data, topographic elevation maps, the FCC licensee database); virtual machine images; and savefiles from various games I play. The camera raw files (over 10,000 photographs at around 10 megs each) are obviously unique to me, and as I rarely share the raws their opportunity to leverage deduplication gain there is essentially nil. As to the datasets, they themselves are duplicative (most of them are downloaded from government sources), of course, but the derived files that I've created from the source datasets are unique to me and are often larger than the source data. So, again, only limited gain opportunity there. Most of my virtual machine images are unique to me, as I've built them from the bottom up myself. And obviously the saved games are unique to me. If I were to sign up my laptop and its 220 GB hard drive (actually larger than that but I haven't gotten around to resizing the main partition from when I reimaged the drive onto a newer, larger drive after a drive crash a couple months ago, so Windows still thinks it's a 220 GB drive) onto Bitcasa, they'd probably end up having to serve me somewhere around 170 to 200 GB of storage, depending mainly on how well it compresses. (Much of the data on my machine is already compressed.)

Even my music (what there is of it; I keep a fairly small collection of about 20 gigabytes) doesn't dedup well. I know, I've tried to dedup my music catalog several times over the past decade plus and my experience is that "identical" songs are often not identical at the bit level; the song might be the same but the metatags differ in some manner that makes them not bit-compatible. Or the songs might be compressed with different bit rates or even different algorithms; I have several albums that I've ripped as many as five times over the years with different rippers. Even if you rip the same song twice from the same disc with the same settings on the same ripper it still might end up with a different bitstream, if the disc has a wobbly bit on it.

If Bitcasa assumes that most of its clients will be "typical" computer users, most of whose data is "stuff downloaded from the Internet", then I suppose they can expect significant deduplication gain, especially for videos, music, and especially for executables and libraries (nearly everyone on a given version of Windows will have the same NTOSKRNL.EXE, for example. although in general OS files cannot be made nonresident anyway without affecting the ability of the computer to boot). The problem I think they're going to run into is that many of their early adopters are not going to be like that. Instead, they're going to be people like us: content creators far more than content consumers, whose computers are all filled to the brim with stuff we've created ourselves out of nothing, unlike anything else out there in the universe.

Then there's the whole issue of getting that 220 GB of data on my machine to their servers. It took my computer nearly 40 days to complete its initial Carbonite image, and that's without backing up any of the executables. I have some large files that I keep around for fairly infrequent use; if Bitcasa decides to offline one of those and I end up needing it, I might be facing a fairly long stall while it fetches the file from their server. Or, if I'm using the computer off the net (something I often do), then I'm hosed, and if I'm on a tether (which is also fairly frequent) then I could be facing a download of a gig file (or larger) over Verizon 3G at about 200 kbps. Good thing Verizon gives me unlimited 3G data!

I also wonder how Bitcasa will interact with applications that do random access to files, such as database apps and virtual machine hosts. I use both of these on a fairly regular basis, and I think it might get ugly if Bitcasa wants to offline one of my VHDs or MDFs (or even my Outlook OST or my Quickbooks company file). If they are planning to use reparse points on Windows the way Tivoli HSM does, files that will need to be accessed randomly, or which need advanced locking semantics, will have to be fully demigrated before they can be used at all.

In addition to all this, the use of cryptographic hashes to detect duplicates is risky. There's always the chance of hash collision. Yes, I know, the odds of that are very small with even a decent sized hash. But an event with even very low odds will happen with some regularity if there are enough chancves for it to occur, which is why we can detect the 21cm ultra-fine hydrogen spin transition at 1420.405752 MHz: this transition occurs with a probability of something like one in a billion, but we can detect it fairly easily because there are billions upon billions of hydrogen atoms in the universe. With enough clients, eventually there's going to be a hash collision, which will ultimately result in replacing one customer's file with some totally different file belonging to some other customer. Worse yet, this event is undetectible to the service provider. (Kudos to the folks at spideroak for pointing this out, along with other security and legal concerns that interclient deduplication presents.)

So while I think the idea is interesting, I think they're going to face some pretty serious issues in both the short term (customer experience not matching customer expectation, especially with respect to broadband speed limitations) and long term (storage costs growing faster than they anticipate). Ought to be interesting to see how it plays out, though. I think it's virtually certain that they'll drop the "infinite" before too very long.

(Disclaimers: Large portions of this post were previously posted by me on Google Plus. I am not affiliated with any of the entities named in this post.)