LONGREAD: bringing out your dead data

Digital storage technology has a reputation for being permanent. Where a lot of analogue technologies like microfilm and magnetic videotape lose fidelity with every copy and degrade (leading to the characteristic 'ghosting' effect of old VHS tapes), information stored as ones and zeroes is theoretically reproducible and transmissible with perfect clarity.

Moreover, the price of computer storage has plummeted throughout the history of the computer era.

Click image to zoom Tap image to zoom

But the advent of the cloud also removed the final frontier to our perception of infinite storage: cost. Today paying for data storage isn't our direct concern.

"The advent of the cloud removed the final frontier to our perception of infinite storage.” - Drew Turney

The physical storage is a negligible proportion of the price of software as a service (SaaS) and platform as a service (PaaS) tools, so we're more tempted than ever to just keep everything - and disinclined to put formal migration or retention policies in place.

But the cloud has lulled us into a false sense of security. If you worked with computers during the 1980s or 1990s you might have stumbled across some of the limitations of digital storage, including its illusion of permanence.

Where words or paint on paper or canvas is still with us centuries or millennia later (given proper care and storage) if you saved a LotusNotes or WordPerfect 1.0 document on a Syquest or Zip disk back in the day and you now need it, it will cost real money and time to retrieve – if it's even possible.

Files for dummies

The first thing to understand when considering data lost in time because of old and unsupported storage media or file formats is the way a computer deals with computer files.

Click image to zoom Tap image to zoom

Step ahead

The metadata contained in a file header are usually stored at the start of the file but might be present in other areas too, often including the end, depending on the file format or the type of data contained.

Character-based (text) files usually have character-based headers, whereas binary formats usually have binary headers, although this is not a rule.

Text-based file headers usually take up more space, but being human-readable, they can easily be examined by using simple software such as a text editor or a hexadecimal editor.

As well as identifying the file format, file headers may contain metadata about the file and its contents. For example, most image files store information about image format, size, resolution and colour space, and optionally authoring information such as who made the image, when and where it was made, what camera model and photographic settings were used (Exif), and so on.

File metadata is commonly used by software reading or interpreting the file while it loads into memory, is saved, moved, etc.

It can also be used by an operating system to quickly gather information about a file without loading it all into memory but doing so uses more of a computer's resources than reading directly from the directory information. For instance, when a graphic file manager has to display the contents of a folder, it must read the headers of many files before it can display the appropriate icons, but these will be located in different places on the storage medium and therefore take longer.

A folder containing many files with complex metadata such as thumbnail information may require considerable time before it can be displayed.

Click image to zoom Tap image to zoom

The simplest file headers – those of text files – haven't changed much in 40 years. There's an industry standard way to represent characters as binary code and because text is such a pivotal part of computing, almost every operating system from your Windows desktop to high end, industry specific enterprise OSs have the means to decode and display it.

It gets trickier as the file headers contain information about more complicated functions like formatting of the text. The major consumer word processing applications all do it differently and there are differences between subsequent versions of the same products too.

So even if you can get hold of your old WordPerfect file, you might open it in your current word processor and get a bunch of gibberish along with the text, maybe not even all the text in the file.

That's assuming you still have access to the file. If you had the foresight to migrate it from an old floppy disk to a CD or even an early cloud service it's much easier.

Even then, software companies come and go and the files types they work with are sometimes proprietary, one off and difficult to access. Compatibility with the files it creates can be broken when operating systems move on, for instance, and the longer you wait, the bigger a problem it will become.

But more variables can stack up even after you clear that hurdle. Does your current OS know how to read the particular media? Do you still have the equipment to interface the two so a computer can load it? Did you accidentally overwrite or delete the file you want?

There are ways around almost any problem but the time they take and the money they cost is commensurate with the extent of the loss or search.

Some data retrieval experts will tell you almost anything is recoverable given enough time and expertise and we never know what fixes technology might bring in the future.

After all, the US National Archives still hold the Nixon tapes in the hope the infamous missing 18 minutes (attributed to an error made my Nixon's secretary on September 29, 1973) might one day be recoverable.

Click image to zoom Tap image to zoom

The latest thing

We all do different things with different formats of data, so figuring out the technology mix to make sure old data is accessible is very much a case by case proposition.

But as Salmon warns, make sure correct data use is part of your culture before you even consider software or servers.

"One of the best ways to future-proof is to set up processes which mean the data is always in use," she says.

"Once businesses establish procedures that extract maximum value from their data, using it continually will future-proof it, even with relatively incremental changes over time."

Cassetta agrees. 

"Although we tend to look at technology for the answer, the solution starts with people," he says. "One of the biggest opportunities organisations have today is to help people understand the different types of information they deal with. Information security is everyone's responsibility."

The answer, he thinks, is ensuring the data has been categorised in a way to easily find it in the future, something he points out libraries have been doing for centuries.

"Books are filled with a ton of information but if they're not properly categorised it becomes difficult to find what you're looking for,” Cassetta says. “In our data driven world we categorise information using metadata, and that's the key to future proofing it."

Once it's time to institute the systems to do so, what next? Do you have to resave documents every three years into a new format to keep them current? Strip out text in case a future word processor doesn't recognise the formatting?

File types like XML (extensible markup language) define rules which encode documents to be readable by both computers and users. There are literally hundreds of formats based on it from online tools like RSS and SVG (which allows for interactive 2D graphics) to productivity tools like Microsoft Office and Apple iWork.

We're also deep into the open source era, which means applications and file types which welcome under-the-hood tinkering by software engineers are widespread.

So if your open source tool of choice – from your desktop copies of OpenOffice documents to your Linux web server – needs action down the track long after their original developers have closed up and moved on, they're theoretically still accessible with the right skillset.

Dying digitally

The internet has bought the concept of 'born digital' to popular consciousness. Just like plenty of businesses can be run completely in the cloud by far-flung colleagues on their phones or tablets and who don't even need an office, a lot of data is created, stored, actioned and retrieved completely online where file formats and 'reproducibility' is someone else's problem.

But the consumer PC era was in full swing for a decade or more prior to the advent of the internet going mainstream, which means a lot of data sitting around on 5.25 inch floppy disks, 3.5 inch floppy disks, DAT tapes, early removable hard drives from companies like SyQuest and iOmega (Zip, Jaz) and more.

Following Apple's lead after the company dropped the floppy disk drive in 1998, most other PC manufacturers did the same. Late model Apple laptops now don't have an optical CD drive either, which – if history is any teacher – spells doom for another storage format (beside which, how often have you sent data to clients or suppliers on CD recently?).

Right now the equipment which lets you access those old disks and formats you have gathering dust are pretty widespread on eBay and it's cheaper to do that to retrieve even a single file than engage a data-retrieval service. But time rolls on and one day it might be as hard to get a working CD drive as it is to find a 60s-era reel to reel tape drive now.

"Sure, vinyl records have seen a revival, but who has a Betamax or Laserdisc player anymore?" Greg Andrzejewski, director of research and development at Madison, Wisconson-based Gillware Data Recovery says.

"For that matter, how many households still have their old collections of VHS tapes and a machine to play them? What about HD-DVD, the HD video format that lost the format wars to Blu-Ray? All sorts of media formats came and went or never caught on and line this ever-growing graveyard."

It was an era of widely dispersed and unregulated players all trying to make their own formats and systems the industry standard and Dean Riach, Pacific Southeast lead for cloud provider Veritas, says data policies – which were a mainstay of the mainframe era – were disregarded when more distributed computing came about.

"The pace of technology and storage formats outpaced the ability to classify data," he says, "so a number of formats – particularly in the data protection space – were being adopted across tape formats.”

“It was made worse by the fact organisations used backup engines from different vendors over the years and didn't have policies in place. The ability to recover data from long term retention became extremely difficult, even without taking tape media degradation into consideration."

Click image to zoom Tap image to zoom


Any kind of data retrieval process is going to take time and cost money and you have to ask yourself if it's worth the expense to your bottom line.

In some cases it's a catch 22 because you don't know if the data will give you the insight you want until after you go to the trouble and expense of getting it.

Sometimes the business case to do so is solid.

"An obvious example of gaining value from legacy content is using several years' worth of data to analyse trends in sales," Sue Clarke, senior analyst of market research firm Ovum says.

"It lets you more accurately predict spikes in demand for certain products so you can meet demand."

Today, of course, it's usually about more than getting old spreadsheets with sales or stock figures back. You might have years’ worth of stuff and under most circumstances retrieving it would cost you even more time and money having to go through it all once you get it back.

Big data - a concept which didn't even have a name back when you created and saved those documents - might help.

UK predictive data management provider 1plusx is using it in several industries – letting recruitment advertisers better target candidates, for example. Where display recruitment ads are usually stored in individual data siloes, 1plusx's system can combine them with site audience data and shine a light on the methods which work for various candidates.

"The intellectual capital of a business isn't just in its employees' brains, it's in the data the business has built up since it started running," Fiona Salmon, managing director of 1plusX says. "Running historical data through machine learning algorithms can generate what's called probabilistic data – accurate predictions which can be used to inform decisions."

As you'd expect, 1plusX says it's found the same thing in historical data – the problem is getting to it.

"Not only is there often too much data for humans to be able to analyse, we can be clouded by subjective factors and prejudices," Salmon says. "Using AI on historical data can add the equivalent of millions of data scientist man hours and make far better decisions faster."

The long arm of the law

Like several companies in recent high-profile cases have found out, data retention may be a compliance or legal issue.

If your organisation is over a certain size the law is very specific about your behaviour – you have to make data breaches public knowledge and adhere to strict privacy laws, both of which can depend on access to historical data.

"A lot of enterprises retain business content because of a lack of understanding of the types of content that need to be kept for compliance purposes," Ovum's Sue Clarke says.

"Retention periods regularly run into decades, in certain circumstances over 100 years, so you need to carry out regular technology refreshes so the media your content is stored on is always supported and periodically update file formats to make sure you can access it.”

Even so, you need to be reasonably sure you can monetise the data – or that it can better inform the data you already have – before you go to the trouble of retrieving it.

Gillware Data Recovery generally finds data decreases in value the older it is, and Andrzejewski has found retrieving or converting data falls into one of three categories; documents needed for legal proceedings, scientific data obtained by old equipment and what's termed 'sentimental' data – photos, video, or documents someone wants for personal reasons like when they lose a loved one.

You also need to consider the sample size of the data you're retrieving - especially if you want to apply AI or big data principles to it. In medical research, for instance, there are a huge number of ailments but often a small number of patients participating in research.

"The question then is whether it'd be cheaper to run the medical research and trials again or rescue the data," Salmon says.

"Is there something distinctive or at least rare which means the data's unavailable through any other method? Machine learning will only really be able to build upon it when there's sufficient volume to identify trends."

Out with the old?

There's yet another wrinkle to consider – you never know when or where old formats might still be used.

Even though Apple released OSX for the Mac in 2001, many publishers and design firms (one of the biggest markets for the Mac at the time) didn't upgrade because the most popular page layout application at the time – Quark Xpress – didn't release a Mac OSX compatible version until 2003.

A decade later most companies in creative design had long since dumped Xpress for the cheaper, native Adobe product InDesign, but some holdouts were still using Mac OS 8 or 9, then a ten-year old OS, simply to stay with old versions of Quark Xpress.

Even weirder, floppy and zip disks are still in use (though declining) in aviation.

Some companies distribute updates to navigation databases and terrain awareness and warning system (TAWS) on zip and floppy disks where they're uploaded into flight management systems through a device called a SSDTU (Solid State Data Transfer Unit).

All of which means you never quite know when your data is going to come in handy... or what it might be worth.

Future proofing

If you find yourself at the point where you have to pay money or spend time retrieving documents, the next logical question you'll probably ask is; how do I avoid this happening again?

You don't want to end up like finance company Morgan Stanley, who in 2006 paid a $US15 million fine for failing to retain emails after an earlier investigation because backup tapes had been overwritten.

"Over time, organisations can reduce the effort by engaging employees to help identify, categorise, and classify data as it's being created so the extraction and migration process down the road has more context and takes less time to manage," Clarke says.

And the best way to do that, according to Mark Cassetta, VP of product development at data management provider TITUS, is to have formal data policies.

"It's just good business," he says. "There are usually two opposing views about data – it's a valuable asset which can leverage historical data to support business decisions, but data can also be toxic, especially as it gets older.”

“The longer it sits, the more context changes, and it doesn't do any good to keep it around. The challenge many organisations face is not knowing if the data's valuable or toxic because they don't know what they have, where it is or why they're keeping it, all of which impact costs."

Cassetta adds if you have the right information governance processes, clear data retention guidelines and technologies in place to make data management easier, you can sleep easy knowing what should be kept and what can be safely deleted.

"After all, organisations are built on trust and reliability,” he says. “Securing the data created and protecting data collected is critical to maintaining trust with employees, customers, and partners."

Drew Turney is a freelance technology journalist

The views and opinions expressed in this communication are those of the author and may not necessarily state or reflect those of ANZ.

editor's picks

08 Dec 2017

Data, ethics and the creepy line

Joanna Jordan | bluenotes contributor

Data can provide better customer outcomes– but businesses have a responsibility to use it correctly.

06 Jun 2018

The digital sun is rising in south east Asia

Donna Webster | Director of Capability Development, Asialink Business

The digital economy is transforming every aspect of social and business life in ASEAN regions.