Guest Post by Alistair Croll : Gathering Moss, Data Gravity, and Context

Dave McCrory first proposed the idea of data gravity a few years ago. Since that time, he’s expanded and refined the concept, adding elements of Shannon’s law and the idea that data transmission is a form of “friction.” It can be complicated stuff. So I’m going to offer a couple of simple examples of what I believe to be one of the most important notions in computing today: why data wants to be together.

mccrorywell

In computing, the Von Newmann Bottleneck is a basic limitation on how fast computers can be. Essentially, it says that the speed with which information can get from memory (where data is stored) to computing (where data is acted upon) is the limiting factor in computing speed.

This is the reason that when you buy a chip, there’s a local cache, and often a Layer 2 cache, right on the chip. Even the time it takes for electrons to travel the tiny distance between the Central Processing Unit (CPU) and the memory on the computer imposes a delay on each operation. And that delay adds up quickly when a computer is performing billions of operations a second.

The bottleneck doesn’t just exist on a computer’s motherboard. Microsoft researcher Jim Gray spent much of his career looking at the economics of data. He concluded that, compared to the cost of moving bytes around, everything else is free. Getting information from, say, your hard drive to a cloud server takes time—as anyone who’s ever uploaded a video will agree. And within a data center, moving bytes from a shared hard drive or storage service to a computing service has a cost.

Having all the data in one place would mean the least amount of moving it around, and as a result, the least cost. This is the fundamental principle of data gravity as Dave first explained it. Just as two planets might compete, gravitationally, for a third planet between them, so two data centers or cloud providers “compete” to pull data towards themselves. If all else were equal, we’d wind up with one big data center.

But all else is seldom equal.

There are plenty of reasons you may not want all your data in one place—privacy, legislation, and the cost of a service provider’s storage fees. Dave searched for a model that could explain these complexities, and the friction that stops data from centralizing. In 2012, during a fascinating phone call I wish I’d recorded, Dave announced that he’d found such a model in the way nations negotiate trade tariffs and balance-of-trade agreements between countries and cities. This model actually borrows from gravitational theory.

Since that time, Dave continued to refine his thinking on the subject, and eventually realized that it was in fact a form of information theory. He spoke about this at Cloud Connect earlier this year.

Let me offer, as Dave often does in presentations, a piece of raw data: 32.

On its own, this isn’t very useful. It takes context to make it relevant. When I tell you that the number refers to degrees Fahrenheit, you can now put it into context. That piece of metadata has made the data more useful. It’s now informative.

You’ve brought your own context to this, too. As a reader, you know that water is common on your planet; and that 32 degrees Fahrenheit is the temperature at which water freezes. You’ve brought all your memories about snow, and ice, and winter along with you.

If I now tell you that it is 32 degrees Fahrenheit in Montreal right now, that’s even more context—and it’s far more useful if you’re on a plane headed there right now. You’ve got useful knowledge. And if I also tell you it’s July, then the knowledge is surprising and unusual.

The more context you have about information, the more useful and relevant it is. In networking and information theory, Shannon’s law (technically, the Shannon-Hartley Theorem) governs how much information can be squeezed down a wire in the form of bits and bytes. The way that we squeeze more throughput out of a network is by adding context to each end. If I tell you that someone on the other end of a phone line will tell you the current temperature, in degrees Fahrenheit, in Montreal, then all that person needs to say is “32.” I’ve already given you the context to make it useful. Of course, I’ve also reduced the utility of that connection—I can’t use it to tell you, say, the current price of a stock.

golden_record_cover_sm(As a sidenote, the Voyager spacecraft carried a gold disc that an alien might play. The disc also included a ton of information about what the disc was, how to play it, and so on—because an alien race would have zero context. It was the ultimate in uncompressed, inefficient data that had maximum utility for a recipient.)

So sending data down a wire, at a simple and rarefied level, is often about the tradeoff between efficiency (which we get through compressing data and caching context at either end) and utility (which we get by letting that connection handle many things at once, without context at the end.) This is how wide-area-network acceleration works; it’s also how a ZIP file stores information.

Consider the following file.

Genericfile

You don’t know anything about this file. There’s no file extension, and no useful information in the name. What if I change its name as follows?

Namedfile

Now you might have some insight. The letters DSC stand for Digital Still Camera, and it’s likely that somewhere on your hard drive, you have a folder that looks like this:

searchthismac

If I open the file using the most generic tool I have on my computer—a text editor—and look at a random spot in the file, here’s what I see:

Nonsensetext

This isn’t very informative. Note that I’ve already skipped a number of steps here—I could view this as a series of ones and zeros; and it might not even be a file that works on my Mac OS computer. But the file can show us more. At the top, there’s some additional data I can read:

Fileinfotxt

From this, I can conjecture that the file is, in fact, an image—since it appears to have been taken by a Canon PowerShot. Knowing this, I can tell my computer it is a JPEG file (which is the most common format for images) and try to get a preview of it.

Finderpic

As it turns out, it’s a picture. I don’t know what, or where, or when, but it’s a picture. And now I can find more context. In this case, it’s not just a picture—it’s one I have published on Flickr. And Flickr has a considerable amount of metadata about the picture in question.

All sizes  Israel 2009  Flickr - Photo Sharing! 2013-06-03 09-39-24

That includes who took it (me.)

Flickr: Your Photostream 2013-06-03 09-39-47

It also includes what rights I claim from it (a Creative Commons Attribution/Share-Alike license.)

Creative Commons — Attribution-ShareAlike 2.0 Generic — CC BY-SA 2.0 2013-06-03 09-38-23

What’s more, the picture appears in a photostream, alongside other pictures from the same trip. It has metadata about date and time (and were it taken with a smartphone, it would have location information too.)

Israel 2009  Flickr - Photo Sharing! 2013-06-03 09-40-34

Note that we’re moving beyond what’s in the file itself—I can see information about how many people have seen it as well, which is strictly Flickr’s context, and not mine.

Favorites  Israel 2009  Flickr - Photo Sharing! 2013-06-03 09-45-44

Some of those people have commented, and even added it to their own lists.

Israel 2009  Flickr - Photo Sharing! 2013-06-03 09-40-12

Flickr also extracts the information the camera includes in the file, and mines it for their own purposes. Here’s what it knows about my picture:

Dates
  • Taken on December 2, 2009 at 12.45PM EDT (edit)
  • Posted to Flickr December 4, 2009 at 4.42PM EDT (edit)
Exif data
  • Camera Canon PowerShot SX10 IS
  • Exposure 0.001 sec (1/800)
  • Aperture f/5.0
  • Focal Length 53.8 mm
  • ISO Speed 80
  • Exposure Bias 0 EV
  • Flash Off, Did not fire
  • Orientation Horizontal (normal)
  • X-Resolution 72 dpi
  • Y-Resolution 72 dpi
  • Software QuickTime 7.6.3
  • Date and Time (Modified) 2009:12:04 10:36:54
  • Host Computer Mac OS X 10.6.1
  • YCbCr Positioning Centered
  • Date and Time (Original) 2009:12:02 12:45:07
  • Date and Time (Digitized) 2009:12:02 12:45:07
  • Max Aperture Value 5.0
  • Metering Mode Multi-segment
  • Color Space sRGB
  • Sensing Method One-chip color area
  • Compression JPEG (old-style)

This, in turn, helps them publish research on things like what kind of camera is most popular right now.

flickr-rankings

The journey my photo has taken, from the raw ones and zeros of the data itself, through the layers of context that it’s gained as it’s labeled and it moves into a shared environment, are all things that cause “friction” when I remove it. Were I to leave Flickr and switch to another photo sharing site, that picture would lose much of its appeal. It wouldn’t be considered popular or “interesting,” a rating Flickr assigns to photographs that people like. I wouldn’t have a comment thread around it, or know who’d seen it. The picture would lose context.

Were I to repatriate the image to my hard drive, and then strip away filenames and metadata, I’d be making it increasingly less useful. And this is the resistance Dave’s talking about when we tease apart data that is centralized.

Consider, for example, a company whose sales pipeline resides in Salesforce.com. There, the data is wrapped in context—the software that’s used to edit and analyze it; a history of which employees have accessed which customers; which prospects are neglected; and so on.

If that company decides to leave Salesforce, they’re welcome to extract their data. But they’ll get it in the form of raw, comma-separated value (.csv) information. Much of the context is gone. The data, without the context, is far less useful. That’s an important lesson: software is context. And removing context is, from the point of view of information, like fighting gravity.

There are things we can do to mitigate this removal of context. The smarter computers are at inferring context, the more they remove friction. Consider, for example, what happens when I ask Google’s image search what it thinks of my photograph:

Google Search 2013-06-03 09-42-46

Not only will Google accurately identify the image, and provide other images and an abundance of additional information, it will even show me where my particular image has been used on the web:

Matching images

With this data, I could (if I were so inclined) contact users and try to enforce my claim to the rights of the image.

So a theory of Data Gravity needs to consider several things:

  • Context comes from linking two pieces of data (such as the image contents, and the fact that it’s an image) together.
  • The more context we have, the more we turn raw bits into usable knowledge
  • Often, context comes from somewhere else. On my computer, the fact that an image is an image is stored in the file system, not the file itself. With cloud computing, that file is likely to be far away
  • As data is manipulated by software, it generates more data, which is a form of context. When the data is in a public place, it gathers more context as it interacts with other data. It “gathers moss.”
  • Data that is centralized can be compared, annotated, and tracked as others use it.
  • As software gets smarter it can often infer context usefully.

All of these observations affect the tendency of data to be centralized (for cost, efficiency, proximity to other sources of context, and utility) or the ease with which it can be moved around and repurposed.

This is important for sovereignty, because it means that countries might need to legislate against such agglomeration. It also conveys a strong first-mover advantage similar to the network effects described in Metcalfe’s Law (the more nodes there are of something, the more useful it becomes.) Amazon’s S3 storage service, for example, has a huge lead over other sources; indeed, the reason Amazon’s East Coast data center is so favored—despite chronic overcrowding—is simply that’s where everyone else is.

What Dave’s latest work does is incorporate these ideas into a workable explanation of Data Gravity. That name is sticky—unusually good branding, and a term that’s spread around the cloud community like wildfire. But it’s also a misleading term. Behind it all is the notion that data which is near other data is more useful, and the tendency of data to cling together comes from the usefulness of the resulting knowledge.

Welcome

Welcome to DataGravity.org.

Here you will find posts exploring Data Gravity and related Data Physics ideas, concepts, equations, formulas, and theories.

There are forums located here

The Twitter Handle @DataGravity and the Hashtag #datagravity are great places for discussion as well.

Please join the community and participate in the exploration and evolution of Data Gravity and Data Physics.

-Dave McCrory

A Formula for Data Gravity

Background

Before creating DataGravity.org I first blogged about Data Gravity on my personal blog in December of 2010 and several times since then.  I have watched the concept of Data Gravity grow beyond anything that I ever expected.  I have also watched as a startup-company decided to name itself DataGravity.  As I began to speak about Data Gravity to others and answer questions, I realized that maybe it was something more than simply a novel concept describing an effect.  This began my quest for a formula that allows Data Gravity to be calculated.

The Search

I started out by doing what everyone does , I Googled Gravity Formula and I Googled Data Gravity Formula and something caught my eye, the first hit from Data Gravity Formula returned the Gravity model of trade on wikipedia I found this fascinating. It turns out that Gravity Formulas and Models are used in many different industries to predict all sort of different things, including favorability of trade agreements (which is what the Gravity model of trade is all about).  I then began trying to learn more about the properties of Gravity (the Physics kind) and vetting out different thoughts and ideas with people over Skype, at conferences, and on Twitter.  There is a long list of people who have contributed to the evolution of both my thinking and this formula.  At the bottom of this post is a list of people who helped me along the way, not that this journey is complete yet as I believe there is a long way to go.

The first thing that I learned was that in order to have Gravity, you must calculate Mass. While this is trivial Physics, applying this to an abstract concept is a bit more difficult. After a great deal of time and many versions, I have a current Mass formula for Data and a Mass formula for Applications (either or both of these could change at some point). Originally, the effort was looking at Volume as being the actual volume of the Data or the Size of the Application which continues to be the thought. However, Density is an entirely different story. Density was originally going to be calculated as the number of Requests per second against the Data. I arrived at this by looking at the aforementioned Gravity model of trade. This changed several times, but ultimately I settled on Density being the compression (or the Entropy) of the Data. This is closer to the original thinking of the Data having different value, but compression certainly doesn’t equate to value.

After settling on the calculation for Data Mass, I turned my sites on calculating Data Gravity itself and began going down the rabbit hole of incredibly complex ways of calculating reads and writes and many other aspects, models, and variables. I realized that this was getting incredibly complex, difficult to measure, comprehend, and calculate, so I threw it out and started over. This led to changing the approach and ultimately is how I ended up with the current formulas. I will write more about the additional discoveries that I made along the way in future posts, now on to Data Gravity!

A Formula for Data Gravity (possibly)

First a few caveats:

  1. This formula needs PLENTY of REAL DATA run through it.  I don’t have access to enough data to run against it to begin to say it is validated, this is where I need the community’s help.  
  2. This formula COULD OR LIKELY WILL CHANGE, it is the best working formula I have, I am hoping that the community helps improve it (or validate it).
  3. My hope is that additional formulas and changes to this formula will increase the accuracy and utility of this formula and models to make them more and more valuable.

Calculating Data Mass

The formula for Mass in Physics is:

Mass = Volume times Density    or    M = V * D

Data Mass REQUIRES that the data be attached to a Network

A Network in this definition can be a SATA Interface on a PCI bus at a micro-scale or your Facebook Data being accessible over the Internet at a macro-scale.

Data Mass variables are defined as follows:

Volume = Total Size of the Data set measured in Megabytes

Density = Compression Ratio of the Data (Unless the Data is compressed, this will usually be 1)

An Example:

If a Database is 5GB in size and has compression turned on and assuming the compression ratio is 2:1, Data Mass would be calculated as follows.

Mass = 5,000MB * 2

So the Data Mass of the Database is 10,000 Megabytes

Data Mass is easy to calculate, but beyond knowing how much Data you have stored, how useful is it? Data Mass by itself, not very useful, but in the presence of the Network and an Application, things get more interesting.

Calculating Application Mass

Application Mass is a bit more difficult to calculate (and may likely change after input from the community).
Currently Application Mass is calculated by first calculating Volume and then calculating then Density, finally multiplying the Volume and Density together.

Application Volume is calculated as follows:
Application Volume = Amount of Memory Used added to the Amount of Disk Space Used in Megabytes
Application Volume = (Memory Used + Disk Used)

Application Density is calculated as follows:
Application Density = The Compression Ratio of the Memory in Megabytes (usually 1) added to the Compression Ratio of the Disk Space
Used in Megabytes (usually 1) added with the Total Amount of CPU Utilization in GHz (across all cores)

Application Mass is calculated by using the results from the Application Volume and Application Density formulas above:
Application Mass = Application Volume times Application Density
AppMass = AppVolume * AppDensity


Why is Memory Volume Needed and CPU Utilization Important?
In most cases Applications use memory for higher performance storage (Data Gravity Inception Anyone?). Measuring Memory Volume is important as in scenarios such as caching, (i.e. Memcache) a great deal of requests/responses are done against and in memory Data set. CPU Utilization must be measured because from an Application’s Data viewpoint, the CPU represents the Applications work on the Data (Transformations against the Data set). This is common in many different types of Applications and causes a balance that needs to be struck between do you move the Application to the Data, or the Data to the Application.

Calculating Data Gravity

In order for Data Gravity to exist the Data Mass and the Application Mass have to be within the same Networked Universe (basically the Data has to be reachable by the Application, otherwise there isn’t any gravity).

Now that we have two Masses (A Data Mass and an Application Mass) we can now finally calculate the Force in Megabytes per second squared that Data Gravity has between the Data and the Application. To do this, a few more variables have to be added and the information gathered:

Network Bandwidth:
This is the average useable bandwidth from the Application to the Data in Megabytes per second (Megabits must be converted to Megabytes)

Network Latency:
This is the average latency from the Application to the Data in seconds (milli, micro, and nano seconds all must be converted to seconds)

Number of Requests per second:
This is the average number of requests from the Application to the Data per second (taken over the same sample time as Latency and Bandwidth)

Average Size of Requests:
This is the average size of each request from the Application to the Data measured in Megabytes (bytes, Kilobytes, Gigabytes must be converted to Megabytes)

With the above four variables and the Data Mass and Application Mass, we can now calculate Data Gravity:
Data Gravity Force equals the product of Data Mass multiplied by Application Mass then multiplied by the Number of Requests per Second
this result is then divided by the square of the Network Latency added to the quotient of the Average Size of Requests divided by the Network Bandwidth

The same thing in words instead of pictures:

And for the more Mathematically inclined:

The Resulting Data Gravity Force is measured in Megabytes per second squared or (MB/s)²

What can be done with Data Gravity Force?
(Please note that these are IDEAS and are therefore SPECULATIVE until any of this is proven)

Depending on your needs/goals, you maybe want to embrace Data Gravity or you may want to resist Data Gravity.
Data Gravity (potentially) has many applications, a few are listed below:

Reasons to move toward a source Data Gravity (Increase Data Gravity)

  • You need Lower Latency for your Application or Service
  • You need Higher Bandwidth for your Application or Service
  • You want to generate more Data Mass more quickly
  • You are doing HPC
  • You are doing Hadoop or Realtime Processing

Reasons to resist or move away from a source of Data Gravity (Decrease Data Gravity)

  • You want to avoid lock-in / keep escape velocity low
  • Application Portability
  • Resiliency to Increases in Latency or Decreases in Bandwidth (aka Availability)

Data Gravity and Data Mass may have other uses as well:

  • Making decisions of movement or location between two Data Masses
  • Projecting Growth of Data Mass
  • Projecting Increases of Data Gravity (Which could signal all sorts of things)

Interesting things occur when you overwhelm the network itself. This could be done by exceeding the bandwidth of the network or by the need to lower latency or increase the bandwidth higher than the current network that is attached to the data can offer. This may drive you to optimize different things in different ways. Caching is a great example of manipulating the request/response stream by creating a temporary or finite amount of secondary Data Mass (The Cache) to increase the Cumulative Data Gravity, while decreasing the Data Gravity of the primary Data Mass (The Source Data Mass). Replication is another strategy that manipulates Data Mass and Gravity (but can also be modeled).

Please share your thoughts in the comments below as to other uses of the Data Gravity, Data Mass, Application Gravity, and Application Mass.

There are many other things that Data Gravity might be able to be used for, so I’m looking for ideas from the community and would like participation from anyone with ideas. In future posts, topics on use, real world scenarios, different configurations in networks, and many other topics will hopefully be covered, with select guest posts as well.

Acknowledgements

and many others that I have forgotten to mention. Also note that out of the 8 people above, half of them are named James!

-Dave McCrory

Follow

Get every new post delivered to your Inbox.