GreenBytes: GreenPower Blog

Current Articles | RSS Feed RSS Feed

Pros and Cons of File-Based Deduplication vs. Block-Based

  
  
  
  

The question is often asked: What are the pros and cons of file-based deduplication vs. block-based? The reality is, it depends. It depends on the data and the likelihood for duplicates to exist, regardless of file or block.

GreenBytes doesn't look at or understand file content in terms of the type of file, but we identify changed blocks within a file, regardless of SAN or NAS, and dedupe the data regardless. Our technology also has the capability to understand if those changes have already been sent across the WAN. Of course, this requires a dedupe engine on both sides of the replication scheme.

In simple terms, block-level will result in greater levels of reductions as long as you have duplicate data. It will also result in greater performance overhead if all your data is unique - relatively speaking. Having a system that can be customized to fit the types of data you have, and drive to the highest levels of efficiencies in terms of reduction and performance is critical, and is exactly what GreenBytes offers. 



It can be said that ‘the devil is in the details,’ and given the wide variances of data, data types, archiving schemes and backup/DR policies I have seen over the years, I can create an argument for each, given the right conditions. 



File-level deduplication is more coarse and typically results in less overall reduction when a lot of sub-file duplicates exist -- changes to a few words, for example. File-level dedupe may also have less meta data to handle and may appear to be faster. Speed has been overcome by intelligent use of SSD

Overall: file-level dedupe favors speed over data reduction. 

Block-level deduplication focuses more on the sub-file level changes and typically results in lower levels of storage consumption when sub-file duplicates exist in the data. It typically has a lot of meta data to handle and is usually the bane of many dedupe appliances’ existence (except GreenBytes and one or two others) due to the performance and scalability of an ever-growing meta data database. 



Overall: block-level dedupe favors lower storage consumption. 



All this assumes localized storage. When dealing with remote replication (WAN), then block-level will be most sensitive to storage and WAN consumption and will typically yield greater benefits to an organization. 


I keep coming back to "it depends." I suggest IT staff force the storage vendor to provide capability that fits within your IT services, and does not force you to change how you do business to accommodate your storage vendor. If change is in fact made, it should drive to some tangible value for your organization - cost, efficiencies, time, capabilities, etc.

As for SAN vs. NAS, I suggest you dedupe regardless of the storage service being provided, and GreenBytes allows you to do that.

 

GB-4000

Comments

Currently, there are no comments. Be the first to post one!
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics