Americas

  • United States
W. Curtis Preston
Contributor

Data storage archive options: Batch, real-time and hierarchical storage management

News
Oct 02, 20236 mins
Data CenterEnterprise Storage

When choosing the appropriate archive option, enterprise storage professionals need to consider data preservation, accessibility, and resource optimization.

Visualization of data in motion through a data center corridor of servers.

When it comes to archiving data, there are three different approaches, generally speaking. Selecting the right system hinges on technical capabilities as well as external factors such as budget constraints. Enterprise storage pros need to balance data- preservation, accessibility, and resource-optimization requirements as they weigh the various archive systems available in the market. Let’s take a deeper look into the different types of archive systems.

Traditional batch archive

With a traditional batch archive, data serves its purpose for a certain period before being tucked away in a safe repository, awaiting the possibility of being of some use in the future. The main idea behind this type of archive is to preserve data over an extended timeframe, while keeping costs at a minimum and ensuring that retrieval remains a breeze even years down the line. In this kind of archive system, each collection of data selected for archiving is given one or more identities, stored as metadata alongside the archived data. This metadata plays a pivotal role in locating and retrieving the archived information, with details such as project names, tools used for to create the data, the creator’s name, and the creation timeframe all forming part of this digital fingerprint. It’s noteworthy, though, that the servers where the data was stored do not typically make it into the metadata – a key distinction from backup.

There are numerous scenarios where a traditional batch archive is the ideal choice. Consider a construction firm that assembles ad-hoc teams to bid on diverse projects. If the bid succeeds, project data remains on production storage for the project’s duration. However, in case of an unsuccessful bid, the data would transition to an archival system, serving as a reference point for future projects. The need to contain growth on production storage systems makes a traditional archive system the more pragmatic choice for housing historical bid data.

In my past engagements with a satellite company, they used a similar archival approach. They archived every satellite designsafter they were built. This practice paid dividends when the government, who had ordered a satellite several years prior, returned with a request for more of the same. With a few keystrokes, they accessed a treasure trove of designs in the archive – ranging from initial designs to final production blueprints.

Realtime archive

On the other end of the spectrum, we find realtime archives. In this type of archive, data created or stored in the production environment is instantaneously duplicated and sent to a secondary location for archiving purposes. Compliance and auditing are the primary use cases for realtime archives. Take, for instance, the classic example of journal email accounts in the era when on-premises email systems reigned supreme. As an email entered the mail system, an identical copy found its way into the journal mailbox, while the original landed in the recipient’s inbox.

This journal mailbox served as a reservoir accessible to auditors and managers seeking information for legal matters or fulfilling freedom of information (FOIA) requests. Access to real-time archives typically occurs through specialized portals equipped with granular search capabilities. It’s important to note that (unlike traditional archive) realtime archives don’t alleviate the pressure on production storage systems – unless, of course, they incorporate the features discussed later in this article regarding hierarchical storage management (HSM).

Now, with the rise of SaaS-based email systems and other cloud-based services, realtime archives have not become obsolete; rather, they’ve transitioned into the mainstream. Microsoft 365 and Google Workspace both offer realtime archiving solutions—Microsoft dubs it “Retention Policies,” while Google calls it “Google Archive.” If you have the proper access level, it’s a matter of a few clicks to instruct these systems to maintain an archival copy of every email and document generated, sent, or received via their platforms. Notably, Microsoft 365 even has a feature preventing any user, including administrators, from deleting this archive, rendering it truly immutable.

HSM-style archive

Among the diverse archive systems, the “HSM-style” archive is a standout. It leverages hierarchical storage management (HSM) to govern data storage – a term that has somewhat gone by the wayside, even though the concept remains. As data ages or sees reduced access, it becomes makes financial sense to relocate it to more cost-effective storage options. When users no longer require daily access to data, or when data becomes dated but must be retained for compliance, organizations start exploring alternatives like storing this data on scalable object storage systems or dedicated cloud-based cold storage. Additionally, some solutions allow archive data migration to tape for off-site and offline storage, with the notion that tape provides enhanced security by being virtually inaccessible unless explicitly needed. Moreover, tape often offers a lower cost per gigabyte compared to most other storage systems. Tape also excels at long-term data retention.

One common implementation of this concept applied HSM to real-time email archives, a prevalent practice in the early 2000s. As user mailboxes swelled with HTML-formatted emails and hefty attachments, organizations were faced with burgeoning storage requirements. Admins could take a proactive stance, specifying that emails older than a certain age or exceeding a particular size be moved to the archive and deleted from the primary system.

In recent times, the focus has shifted from email to unstructured data stored on networked file servers. While analysts underscore the decreasing cost per gigabyte, the ever-expanding need for storage space is undeniable. Hence, any opportunity to lower expenses from high-performance production storage becomes invaluable.

HSM-style archives typically relocate data based on age or the last access timestamp. As data migrates from the filesystem to the archive, it often leaves behind pointers or stubs in the source system, facilitating automated retrieval when required. Some systems, however, opt for a robust search engine instead of stubs. This approach enhances cross-system compatibility but occasionally falls short when users remember where they stored data but not its content, making searches less effective.

So, as you look into the world of archive systems, remember that each brings its unique strengths and considerations to the table. Whether it’s the traditional batch archive, the real-time archive, or the HSM-style archive, the choice ultimately hinges on your specific needs and the interplay of technical and non-technical factors within your organization. It’s an art and a science – a delicate balance of preservation, accessibility, and resource optimization.

W. Curtis Preston
Contributor

W. Curtis Preston—known as Mr. Backup—is an expert in backup, storage, and recovery, having worked in the space since 1993. He has been an end-user, consultant, analyst, product manager, and technical evangelist.

He’s written four books on the subject, Backup & Recovery, Using SANs and NAS, and Unix Backup & Recovery.

The opinions expressed in this blog are those of W. Curtis Preston and do not necessarily represent those of Foundry, its parent, subsidiary, or affiliated companies.

More from this author