Peer-2-Peer Backup Ideas

At the Desktop Summit I stood almost 2 hours in the yard talking to Michael Bell and amongst many other things we got an idea about how to solve the problem of Desktop backups in large environments.

The basic idea is to go away from a centralized approach and solve the problem with a peer-2-peer solution. The main benefit is scale-out vs. scale-up: As the amount of Desktops that require a backup grows, so does the amount of Desktops that provide backup space.

With a certain amount of redundancy one would probably be able to handle all issues of desktop systems "going away" unplanned (stolen, moved, switched off for a vacation ...).

The decentralized approach also makes a lot of sense from an economic point of view: When ordering 100s of desktop computers one can often get a hard disk upgrade to next bigger hard disk size for free or for ridiculous prices (e.g. add 10€ to go from 160GB to 500GB). Even with a 3-fold redundancy overhead this cost is several orders of magnitude lower than the cost of data center hard disk and even tape capacity!

So, what do we need for this idea?
  1. A backup suite that can run on desktop computers and do backups
  2. A storage provider that runs on desktop computers and contributes part of the local storage to a "storage cloud"
  3. Some central management that glues all of this together
Now comes the funny part: Apparently all software components for this idea are already part of the Open Source world:
  • duplicity is an end-user backup software that creates full, incremental and differential backups and stores them on a large variety of storage back-ends. It also comes with integrated encryption support so that the privacy of the backup data is guaranteed at the backup source.
  • Tahoe LAFS is a distributed storage cloud with very interesting attributes for our purpose:
    • Full encryption of the data
    • Access to data only via secret keys
    • Ability to differentiate between read-only and read-write access
    • Scale-out architecture with automation for node management
These are starting points on a longer list of things one could evaluate for this purpose (e.g. Hadoop etc.). However, I believe that one should be able to hack up a working proof-of-concept with these components, duplicity learned about tahoe in 2008.

Please contact me if you like this idea and/or would like to participate in a new Open Source project to build the missing glue.

BTW, this concept might be also an alternative for data center backups...

Finally, you might ask why reinvent the wheel? After all, a Google search for peer 2 peer backup open source reveals already a few interesting links:
  • Backup P2P is and dead for 20 months, not obviously cross-platform and apparently meant for a community of private or home users.

    I was looking more for something that a central IT department could roll out on all managed desktops in order to provide a reliable internal backup solution to all end-users without burdening the data center.
  • Crashplan, Mozy, rsync.net and others are also apparently not made for a managed setup in a local area network which does not suffer from typical DSL user issues like slow uplinks.