We're one month into it, and Spring is in full force here in the Silicon Valley. This season is all about rebirth, both figurative and literal, and as such, it seems like the time for our competition to dust off their old FUD, and try to put in a new spin on it. Tony Asaro in a recent post referred to one example.
The latest example stems from a post by Sepaton's resident blogger, Jay Livens, on the heels of their announcement that they closed a sixth round of funding.
He states:
"One of the questions I often get asked is 'how do your products compare to Data Domain's?' In my opinion, we really don't compare because we play in different market segments. Data Domain's strength is in the low-end of the market, think SMB/SME while SEPATON plays in the enterprise segment. These two segments have very different needs, which are reflected in the fundamentally different architectures of the SEPATON and Data Domain products."
Let's examine the facts. Our systems are deployed in the datacenters of many large enterprises. For example, the major financial services firm where I did my last field installation is storing over 5PB across four geographically distributed datacenters. These large enterprise customers clearly see the value in Data Domain's well-designed inline approach that does not slow down backups or restores, enables immediate replication of deduplicated data, and makes the deduplication process transparent to users and applications. To do this in practice is difficult since the process of identifying duplicates inline is inherently a very compute intensive process. WIthout careful thought about how to inline, the resulting design makes inefficient use of CPU, memory and disk resources or just runs too slowly to be effective.
Vendors that cannot successfully navigate these challenges must go to market with post-process deduplication systems. While this may allow them to avoid the hard problems of delivering deduplication with speed and simplicity, the consequences are many. Directly stated, post-process deduplication systems are just like traditional storage systems. They have the same disk I/O bottlenecks as traditional VTLs - they require disproportionately high spindle counts just to land the native data to a disk cache. That data must then be read back from disk and deduplicated before any additional processes, such as replication or verification, can occur.
This last point is critical, as in our experience, replication is the number one enterprise capability that our customers and prospects require in a deduplication storage solution, as it is key to enabling simple, reliable, and cost-effective disk-based DR. In this article from January 2006, announcing the ability to replicate the native data between their VTL's, Sepaton's president/CEO seems to share our belief in the importance of this capability:
"In recent years there have been numerous public accounts of lost or stolen tapes and the resulting corporate pressure to solve data loss. The risks are high - lost revenue and productivity, increase in customer dissatisfaction, fines and penalties and damaged corporate reputations," said Mike Worhach, president and CEO, SEPATON, Inc.
Customers looking for deduplication and replication solution who do their research will learn that Sepaton still does not have any ability to replicate deduplicated data, while our ability to do this is proven across thousands of customers. Different definitions of "enterprise" indeed.
I completely agree that deduplicated replication is a very important feature, and I never cease to mention that when I talk to the folks at SEPATON. You are also correct that they don't have it TODAY, but I don't think you will be able to say that for very long. But feel free to say it as loud as you want for as long as you can.
The feature that THEY have that you are missing is global deduplication, which is also a very important feature for the enterprise, and I never cease to mention that when I talk to the folks at Data Domain. The lack of global deduplication will inevitably require a large enterprise to buy more dedupe systems and require them to do things that a customer using global dedupe would not have to do.
Let's look at your 5 PB customer. If we assume their 4 locations are evenly distributed, that's 1.25 PB per location. Suppose they do a weekly full backup, daily incrementals and store backups for 90 days (my default configuration unless customers need more or less). That's 13 full backups (1.25 * 13 = 16.25 PB), and 90 incrementals (1.25 * .10 * 90 = 11.25 PB), for a total of 27.5 PB of backups PER LOCATION. If we dedupe that at 20:1, you'll need 1.375 PB of disk to hold it. Since each DD690 can hold only 35 TB of usable capacity (48 TB raw), your customer will need 40 DD690s PER LOCATION to hold this many backups.
Assuming a rolling weekly full backup and nightly (10%) incremental of everything, they'll be backing up 178.5 TB of fulls per night (1250/7) and 125 TB of incrementals (1250/10) each night. That's 303.5 TB, or just over 7000 MB/s., assuming a 12 hour backup window. The good news is that this is only 175 MB/s when you divide it by 40, so you won't be throughput bound.
BUT, your 5 PB customer will need to take each location's 1.25 PB and divide it into 40 equally-sized backup sets to fully utilize they 40 systems that they bought. While they MAY (and I really do mean MAY) be able to do this when they first configure the system, things never stay the same, and they will need to constantly move backups between devices as the size of some backups grow. This means that in reality, they will probably need more than 40 systems to deal with all of the back-and-forth. (They will also need more than 40 if they don't get 20:1 dedupe.)
If you had global dedupe, the customer would still need to buy 40 systems, but they wouldn't have to care about which system backed up which backup. They could just send all backups to any of the 40 systems and you would do your magic. That's why global dedupe is important for the enterprise.
Now that the Data Domain folks are thoroughly seething, let me throw you a bone or two.
First, while other vendors (like SEPATON) may have global dedupe, none of them have a supported config that can back up AND DEDUPE at 7000 MB/s. Ignoring higher ingest rates and looking only at dedupe rates, as I would when comparing to an inline model, SEPATON would still need five "systems," each consisting of five nodes. If I go with Falconstor's four-node system (the biggest I believe they've actually deployed), they would also need five "systems," each consisting of four nodes. So, while your 5 PB customer will still need to sub-divide their backups into equally sized chunks. It's just that they will need to split it into 5 chunks, not 40.
Also, while your limitation increases the operational difficulty and management cost of enterprise customers, SEPATON's limitation of not having replication stops them dead-cold if they plan to replicate. That is, until they ship deduplicated replication. Then I'm sure you will (rightfully so) change your story to "Well, they HAVE it, but we have thousands of customers using ours. How many customers do they have?"
May you live in interesting times.
Posted by: W. Curtis Preston | 05/01/2009 at 04:19 PM