Data storage: Most of it is junk

Computerworld reports we added 281 exabytes of data to the global information data storage total in 2007.

An exabyte is a billion gigabytes. Which means in a year we added 800MB of data for each of the world’s 6 billion people. As much data as a 30 metre high stack of books.

It’s a lot of information.

Or maybe not. Storage experts believe that anywhere from 80 to 90 percent of stored data is anything but valuable.

Worthless storage, junk information

In 2002 I spoke to Rob Nieboer, who at the time was StorageTek’s Australian and New Zealand storage strategist. He revealed the vast bulk of data stored on company systems is worthless.

He says, “I haven’t met one person in the last three years who routinely deletes data. However, as much of 90 percent of their stored data hasn’t been accessed in months or years. According to Strategic Research, when data isn’t accessed in the 30 days after it is first stored there’s only a two percent chance it will get used later.”

At the same time companies often store many data files repeatedly in the same file system. Nieboer says it’s not unusual for a single system to hold as many as 15 separate copies of the same file.

Data storage Parkinson’s Law

According to Rosemary Stark (also interviewed in 2002 when she was Dimension Data’s national business manager for data centre solutions), storage obeys a version of Parkinson’s Law.

She said, “It’s a case of if you build it, they will come. Put together a system with 2GB of storage and pretty quickly it will fill up with data. Buy a system with 200GB of storage and that will also fill up before too long.”

Like Nieboer, Stark said there’s a huge problem with multiple copies of the same information but she estimates the volume of unused archive material to be closer to 80 percent. But she said 80 percent isn’t all junk: “It’s like the paper you keep on your desk. You don’t want it all, there may be a lot you can safely throw away but sometimes there are things you need to keep just in case you need them again later.”

Needles and haystacks

Although many companies focus on the economic cost of storing vast amounts of junk information, there’s a tendency to overlook the performance overhead imposed by unnecessary data. In simple terms, computer systems burn resources ploughing through haystacks of trash to find valuable needles of real information.

There are other inefficiencies. Stark said she has seen applications, for example databases, that use, say, 300 Terabytes of storage even though the actual data might only be 50 Terabytes. This happens when systems managers set aside capacity for anticipated needs. The situation is a little like a child’s mother buying outsize clothes on the grounds that the youngster will eventually grow into them.

Nieboer said there are inherent inefficiencies in various systems.

Mainframe disks are typically only 50 percent full. With Unix systems disks might only be 40 percent full, with Windows this falls to 30 percent.