The luxury options available nowadays for custom storage solutions
In the last decade, open source software tools have been developed that implement data storage functions, previously available only through very pricey proprietary storage devices.
Of course at the same time, cloud storage offerings are becoming more and more compelling, either from the price point of view or the overall functionality, ease of use and integration flexibility. But if you still have reasons to deploy your own storages, you have at your disposal some brilliant software creations offering functions that were previously prohibitively expensive. Here are some examples:
The ZFS file system
The ZFS file system was created by Sun Microsystems and first released in 2006. It offers some groundbreaking functionalities for the cheap commodity hardware.
Due to the way ZFS writes its blocks, with a technique called "copy on write", it is able to create very "cheap" snapshots to the stored data. It can create thousands of snapshots in a file system with minimal overhead and at a blazingly fast speed, so you are able to have a version of your files from every point in time that you took a snapshot. And you could get a snapshot every minute or every second if you are compulsive enough.
ZFS implements caching mechanisms that give a lot of options to the sysadmin for improving performance when needed. There is the first level ARC cache (Adaptive Replacement Cache) speeding up read operations, stored in ram, caching as much data as it can fit in the system’s RAM. But the cost of RAM is considerable, so here comes the second level ARC cache called L2ARC to improve the performance some more by using fast ssd drives, where ssd drives are slower than RAM but quite a bit faster than disk drives and cheaper than RAM. So a good balance between cost and performance can be found there in order to speed up read operations. In order to also speed up the write operations there is the ZIL (ZFS Intent Log), where the data for the write operations are being written to fast ssd drives as log entries, and later on flushed as a transactional write to their destination ZFS pools.
ZFS implements all the different RAID modes found in expensive hardware RAID controllers RAID 0 through 6 and 10, targeting performance or data redundancy, and even offering a RAID6 mode with triple parity (called RAID-Z3) where you may lose up to three storage devices and be able to recover your data.
End-to-end data corruption detection
ZFS writes a checksum block for all data and thus it is able to verify the integrity of each data block and automatically detect and correct them (if there is redundancy available) at any stage from the system’s RAM, to the data buses, the storage controllers, the cabling or even the storage device. It is not possible to have "silent" data corruption in ZFS because of faulty hardware.
Data deduplication and compression
ZFS is able to deduplicate and even compress the data being written and thus it is able to offer significant space savings if the data can be deduplicated and compressed enough. This allows you to write files that contain similar content and uses only a fraction of the storage space required for their total size, accounting only for the really different data your files contain. So, you may have stored the same file as many times as you like, without using the storage space of more than the size of one file.
The Ceph storage cluster
Ceph is an open source cluster storage software offering tremendous features by using commodity hardware. Ten years ago, something with these kinds of functions at this cost level was unheard of. Here are some points that stand out:
Ceph distributes the data to multiple independent, self-monitored, demons and nodes (OSDs Object Storage Demons), adding the requested redundancy and thus achieving fault tolerance without having single points of failure. It does not rely on central metadata service to locate data. In case of node failure it can quickly and automatically heal and recover the redundancy level.
The cluster can scale as much as needed in order to meet the requirements by dynamically rebalancing its nodes.
Efficient use of the hardware
Ceph rebalances its nodes and it is able to distribute more evenly the use of the underlying hardware.
Ceph can provide Object storage, Block storage and Filesystems.
It can do Thin Provisioning, saving space and able to quickly clone virtual machines.
It can provide "cheap" snapshots.
When a node goes down, Ceph will automatically start replicating the data in a new location in the background in order to self-heal and retain the configured redundancy level and performance.
"Software is eating the world"?
It looks like the famous quote "Software Is Eating the World" by Marc Andreessen also describes what these brilliant pieces of software have done to the market of enterprise storages. They are just eating up a big chunk of this market, providing cheap and feature reach solutions at a fraction of the cost.