There is a common misconception that the principle known as Occam's Razor can be summed up as "all things being equal, the simplest answer is the right one." While that notion works wonders on a television show where important mysteries need to get resolved in 48 minutes or less, it's not really the point of this very important concept.
Occam's Razor in a modernized nutshell means "don't make things more complicated than they have to be." It's a powerful idea, and one that proves itself over and over again in the real world.
One example that comes to mind is Data Domain's Collection Replication. Collection replication is one of several forms of replication available when using Data Domain systems. The beauty of collection replication is the power of simplicity. Collection replication is an option when a customer wants to perform simple system-to-system mirroring, leveraging the efficiency of data deduplication to replicate vast quantities of data offsite for disaster recovery protection.
Data Domain takes advantage of that inherent simplicity by streamlining the replication process to the greatest extent possible. Because collection replication is aware that all new, unique (i.e. non-duplicate) data segments written to the local filesystem must be transferred, it is able to send that new data across the wire immediately. However, any data that is not new and unique does not need to be sent. In the instance of moves or deletes, collection replication sends across a minimal subset of deduplication-aware 'housekeeping instructions' to the remote side. Once data in the form of a file or virtual tape arrives at the remote system, it is immediately visible and immediately available. This is true even while replication is still in progress and there are still additional files or tapes inbound.
If a typical deduplication effect preemptively eliminates 98% of the data to be transferred, then collection replication ensures that the remaining 2% is moved as quickly as possible. And when compared to other Data Domain replication techniques, you can think of it this way; if multi-site replication functions like a network of highways, then collection replication is like a drag speedway. If all you want to do is move deduplicated data from point A to point B, there isn't a faster way to do it. Period.
There are other powerful use-cases as well. Collection replication is a fantastic option for nearline or archival storage when you want to replicate millions of small files without incurring overhead for the metadata associated with each individual file. There isn't another deduplication-aware replication technology available that has this capability.
Of course, sometimes a simple environment grows more complicated over the course of time. In that circumstance, Data Domain replication that started out as collection based can be converted to any of our other replication topologies quickly and easily, and in most cases without having to resend data that's already been stored at the replication destination.
Collection replication is an enterprise-class capability, and has a major advantage in the marketplace when it comes to serving the high-volume replication needs of the largest data centers. By comparison, neither IBM/Diligent nor Sepaton even have deduplication-aware replication - although they've both been promising it for some time. Other solutions such as the joint EMC/Quantum product, the DL 3000, (formerly known as the DL3D 3000 a.k.a. the DXi7500) must first wait for their slower, post-process deduplication to complete, then for data replication to occur, and then for several other byzantine processes including a 'namespace sync' and possibly some shell script to execute before the data is available. Even then, the data on the remote side is painfully slow to read.
So why is this collection replication unique to Data Domain? The answer has to do with the architecture. First, a system has to perform true inline deduplication to achieve the effective replication throughput of Data Domain's collection replication. All net-new data stored on disk must be immediately known with certainty to be unique. That factor alone disqualifies most of the competition. Second and equally important, a system must have a log-structured file system in order to achieve the efficiency of Data Domain's collection replication. The log-structure of the Data Domain file system effectively queues up data for replication and ensures write-order integrity by design, without the need for complex layers of additional replication code. Those two factors combined ensure that collection replication will remain unique to Data Domain for the foreseeable future.
