Why? Simply because at ImmobilienScout24 we invest our time into automating the setup of our servers instead of investing into the ability to automatically recover a manually configured system. Sounds simple but this is actually a large amount of work and not done in a few days. However, if you persist and manage to achieve the goal the rewards are much bigger: Don't be afraid of troubles, based on our automation we can be sure to reinstall our servers in a very short time.
The following idea can help to bridge the gap if you cannot simply automate all your systems but still want to have a simplified backup and disaster recovery solution:
Inject a layer of automation under the running system.
The provisioning and configuration of the automation layer should be of course fully automated. The actual system stays manually configured but runs inside a Linux container (LXC, docker, plain chroot ...) and stays as it was before. The resource loss introduced by the Linux container and an additional SSH daemon is negligible for most setups.
The problem of backup and disaster recovery for systems is converted to a problem of backup and restore for data, which is fundamentally simpler because one can always restore into the same environment of a Linux container. The ability to run the backup in the automation layer also allows using smarter backup technologies like LVM or file system snapshots with much less effort.
I don't mean to belittle the effort that it takes to build a proper backup and restore solution, especially for servers that have a high change rate in their persistent data. This holds true for any database like MySQL and is even more difficult for distributed database systems like MongoDB. The challange of creating a robust backup and restore solution stays the same regardless of the disaster recovery question. Disaster recovery is always an on-top effort that complements the regular backup system.
The benefit of this suggestion lies in the fact that it is possible to replace the effort for disaster recovery with another effort investing into systems automation. That approach will yield much more value: A typical admin will use systems automation much more often than disaster recovery. Another way to see this difference is that disaster recovery is optimizing the past while systems automation is optimizing the future.
The automation layer can also be based on one of the minimal operation systems like CoreOS, Snappy Ubuntu Core or Red Hat Atomic Host. In that case new services can be established with full automation as docker images opening up a natural road to migrate the platform to be fully automated. And to gracefully handle the manually setup legacy systems without disturbing the idea of an automated platform.
If you already have a fully automated platform but suffer from a few manually operated legacy systems then this approach can also serve as a migration strategy to encapsulate those legacy systems in order to keep them running as-is.
Update 12.03.2015: Added short info about Relax and Recover and explain better why it pays more to invest into automation instead of disaster recovery.