A Formula for Data Gravity


Before creating DataGravity.org I first blogged about Data Gravity on my personal blog in December of 2010 and several times since then.  I have watched the concept of Data Gravity grow beyond anything that I ever expected.  I have also watched as a startup-company decided to name itself DataGravity.  As I began to speak about Data Gravity to others and answer questions, I realized that maybe it was something more than simply a novel concept describing an effect.  This began my quest for a formula that allows Data Gravity to be calculated.

The Search

I started out by doing what everyone does , I Googled Gravity Formula and I Googled Data Gravity Formula and something caught my eye, the first hit from Data Gravity Formula returned the Gravity model of trade on wikipedia I found this fascinating. It turns out that Gravity Formulas and Models are used in many different industries to predict all sort of different things, including favorability of trade agreements (which is what the Gravity model of trade is all about).  I then began trying to learn more about the properties of Gravity (the Physics kind) and vetting out different thoughts and ideas with people over Skype, at conferences, and on Twitter.  There is a long list of people who have contributed to the evolution of both my thinking and this formula.  At the bottom of this post is a list of people who helped me along the way, not that this journey is complete yet as I believe there is a long way to go.

The first thing that I learned was that in order to have Gravity, you must calculate Mass. While this is trivial Physics, applying this to an abstract concept is a bit more difficult. After a great deal of time and many versions, I have a current Mass formula for Data and a Mass formula for Applications (either or both of these could change at some point). Originally, the effort was looking at Volume as being the actual volume of the Data or the Size of the Application which continues to be the thought. However, Density is an entirely different story. Density was originally going to be calculated as the number of Requests per second against the Data. I arrived at this by looking at the aforementioned Gravity model of trade. This changed several times, but ultimately I settled on Density being the compression (or the Entropy) of the Data. This is closer to the original thinking of the Data having different value, but compression certainly doesn’t equate to value.

After settling on the calculation for Data Mass, I turned my sites on calculating Data Gravity itself and began going down the rabbit hole of incredibly complex ways of calculating reads and writes and many other aspects, models, and variables. I realized that this was getting incredibly complex, difficult to measure, comprehend, and calculate, so I threw it out and started over. This led to changing the approach and ultimately is how I ended up with the current formulas. I will write more about the additional discoveries that I made along the way in future posts, now on to Data Gravity!

A Formula for Data Gravity (possibly)

First a few caveats:

  1. This formula needs PLENTY of REAL DATA run through it.  I don’t have access to enough data to run against it to begin to say it is validated, this is where I need the community’s help.  
  2. This formula COULD OR LIKELY WILL CHANGE, it is the best working formula I have, I am hoping that the community helps improve it (or validate it).
  3. My hope is that additional formulas and changes to this formula will increase the accuracy and utility of this formula and models to make them more and more valuable.

Calculating Data Mass

The formula for Mass in Physics is:

Mass = Volume times Density    or    M = V * D

Data Mass REQUIRES that the data be attached to a Network

A Network in this definition can be a SATA Interface on a PCI bus at a micro-scale or your Facebook Data being accessible over the Internet at a macro-scale.

Data Mass variables are defined as follows:

Volume = Total Size of the Data set measured in Megabytes

Density = Compression Ratio of the Data (Unless the Data is compressed, this will usually be 1)

An Example:

If a Database is 5GB in size and has compression turned on and assuming the compression ratio is 2:1, Data Mass would be calculated as follows.

Mass = 5,000MB * 2

So the Data Mass of the Database is 10,000 Megabytes

Data Mass is easy to calculate, but beyond knowing how much Data you have stored, how useful is it? Data Mass by itself, not very useful, but in the presence of the Network and an Application, things get more interesting.

Calculating Application Mass

Application Mass is a bit more difficult to calculate (and may likely change after input from the community).
Currently Application Mass is calculated by first calculating Volume and then calculating then Density, finally multiplying the Volume and Density together.

Application Volume is calculated as follows:
Application Volume = Amount of Memory Used added to the Amount of Disk Space Used in Megabytes
Application Volume = (Memory Used + Disk Used)

Application Density is calculated as follows:
Application Density = The Compression Ratio of the Memory in Megabytes (usually 1) added to the Compression Ratio of the Disk Space
Used in Megabytes (usually 1) added with the Total Amount of CPU Utilization in GHz (across all cores)

Application Mass is calculated by using the results from the Application Volume and Application Density formulas above:
Application Mass = Application Volume times Application Density
AppMass = AppVolume * AppDensity

Why is Memory Volume Needed and CPU Utilization Important?
In most cases Applications use memory for higher performance storage (Data Gravity Inception Anyone?). Measuring Memory Volume is important as in scenarios such as caching, (i.e. Memcache) a great deal of requests/responses are done against and in memory Data set. CPU Utilization must be measured because from an Application’s Data viewpoint, the CPU represents the Applications work on the Data (Transformations against the Data set). This is common in many different types of Applications and causes a balance that needs to be struck between do you move the Application to the Data, or the Data to the Application.

Calculating Data Gravity

In order for Data Gravity to exist the Data Mass and the Application Mass have to be within the same Networked Universe (basically the Data has to be reachable by the Application, otherwise there isn’t any gravity).

Now that we have two Masses (A Data Mass and an Application Mass) we can now finally calculate the Force in Megabytes per second squared that Data Gravity has between the Data and the Application. To do this, a few more variables have to be added and the information gathered:

Network Bandwidth:
This is the average useable bandwidth from the Application to the Data in Megabytes per second (Megabits must be converted to Megabytes)

Network Latency:
This is the average latency from the Application to the Data in seconds (milli, micro, and nano seconds all must be converted to seconds)

Number of Requests per second:
This is the average number of requests from the Application to the Data per second (taken over the same sample time as Latency and Bandwidth)

Average Size of Requests:
This is the average size of each request from the Application to the Data measured in Megabytes (bytes, Kilobytes, Gigabytes must be converted to Megabytes)

With the above four variables and the Data Mass and Application Mass, we can now calculate Data Gravity:
Data Gravity Force equals the product of Data Mass multiplied by Application Mass then multiplied by the Number of Requests per Second
this result is then divided by the square of the Network Latency added to the quotient of the Average Size of Requests divided by the Network Bandwidth

The same thing in words instead of pictures:

And for the more Mathematically inclined:

The Resulting Data Gravity Force is measured in Megabytes per second squared or (MB/s)²

What can be done with Data Gravity Force?
(Please note that these are IDEAS and are therefore SPECULATIVE until any of this is proven)

Depending on your needs/goals, you maybe want to embrace Data Gravity or you may want to resist Data Gravity.
Data Gravity (potentially) has many applications, a few are listed below:

Reasons to move toward a source Data Gravity (Increase Data Gravity)

  • You need Lower Latency for your Application or Service
  • You need Higher Bandwidth for your Application or Service
  • You want to generate more Data Mass more quickly
  • You are doing HPC
  • You are doing Hadoop or Realtime Processing

Reasons to resist or move away from a source of Data Gravity (Decrease Data Gravity)

  • You want to avoid lock-in / keep escape velocity low
  • Application Portability
  • Resiliency to Increases in Latency or Decreases in Bandwidth (aka Availability)

Data Gravity and Data Mass may have other uses as well:

  • Making decisions of movement or location between two Data Masses
  • Projecting Growth of Data Mass
  • Projecting Increases of Data Gravity (Which could signal all sorts of things)

Interesting things occur when you overwhelm the network itself. This could be done by exceeding the bandwidth of the network or by the need to lower latency or increase the bandwidth higher than the current network that is attached to the data can offer. This may drive you to optimize different things in different ways. Caching is a great example of manipulating the request/response stream by creating a temporary or finite amount of secondary Data Mass (The Cache) to increase the Cumulative Data Gravity, while decreasing the Data Gravity of the primary Data Mass (The Source Data Mass). Replication is another strategy that manipulates Data Mass and Gravity (but can also be modeled).

Please share your thoughts in the comments below as to other uses of the Data Gravity, Data Mass, Application Gravity, and Application Mass.

There are many other things that Data Gravity might be able to be used for, so I’m looking for ideas from the community and would like participation from anyone with ideas. In future posts, topics on use, real world scenarios, different configurations in networks, and many other topics will hopefully be covered, with select guest posts as well.


and many others that I have forgotten to mention. Also note that out of the 8 people above, half of them are named James!

-Dave McCrory

Posted on June 26, 2012, in Data Gravity. Bookmark the permalink. 21 Comments.

  1. Nice work. I wrote a quick little PowerShell script that can do the calculations for you if you have the values. I figured it might be useful for generating a “real-time” data gravity calculations if you can gather the values required dynamically.

    $Md = Read-Host “Enter Data Mass (Md) (in Megabytes per second squared)”
    $Ma = Read-Host “Enter Application Mass (Ma) (in Megabytes)”
    $n = Read-Host “Enter Requests per Second”
    $l = Read-Host “Enter Latency between m and m (in seconds)”
    $r = Read-Host “Average Request Size (in Megabytes)”
    $b = Read-Host “Bandwidth (in Megabytes per second)”

    $DataGravity = (($Md * $Ma) * $n) / ([System.Math]::Pow(($l + ($r/$b)),2))
    Write-Output $DataGravity

  2. Can the theory of data gravity be expanded to integrate the idea of human attention as an ultimate currency. It seems attention is central to self definition and self determination and is therefore relevant to lock-in. It seems that under sponsorship we have a system were attention is arrested, where time and attention are literally stolen. Sponsorship, which acts as a kind of censorhship leads to lobbying and other filters in a “medium as message” sense which in turn blocks democratic representation. There is a cost per second of attention. Its (total seconds of attention)/(GNP)

  3. I’ve been following this data gravity concept for a while now.. and I am thinking that there might still be some factors needed in the equation.
    As a matter of fact, I am doing an application that automatically detects and visualize the data gravity in a given network for my undergraduate study..

    If you are interested in my progress, just contact me… tnx tnx

  4. “Also note that out of the 8 people above, half of them are named James!”
    – James gravity? 🙂

  5. Good day dave, I have sent you a copy of my paper. . .Remember me? Now I will start the actual work on the project. . It would be nice if you could give me some advice and comments. . thanks. .

  6. Good day dave, I have a question…
    Why do you have to multiply the application mass and data mass by ‘number of request per second’?
    according to the general formula for gravity, you only need to multiply the two masses involved.. thanks. . . .

    • Because the requests per second represent the attraction to the Data itself. If I need the Data from a mass, I must request it, each request represents need or an amount of gravity.

      • I see your point… so the average request size, bandwidth, and latency would be the coming from all the instance of an application, right?
        because, each request may come from different locations but uses the same application… Therefore, the ave request size, bandwidth and latency should be the average of all values taken from each of those individual requests from different locations….

        • Yes, that’s correct, assuming that the application isn’t the end user’s desktop. If it is an end user’s desktop, I would recommend measuring the aggregate requests and response time at the webserver (just to make it easier).

          • Thanks..! Okay, so let’s just say that the application is a web application. What would you recommend to dynamically get the values for bandwidth, average request size and latency? I have tried some methods like using javascript files to get the results but I would like to know if you have better ideas on how to do that.

            • A proxy and/or a cache with counters, or use an Apache Mod and logs. Those would be two decent routes IMO.

              • hello again dave, I would like to ask something about the average request size..
                for example, I have an application connecting to a database server that uses mysql, is the average request size the size of mysql packet? thanks

                • No, it would be the size in bytes of the SQL Query and the response which would be the response to the SQL Query. You can try just averaging the queries or adding the queries and responses together, then averaging this size over time per second to get bytes per second.

                  Hopefully that helps. There is a deeper view of Data Gravity and its relationships that I intend to publish soon, FYI.

  7. Amazing work and kudos for the focus and drill down. This is something that I have been researching from another angle around the social nets for years. I will be following so keep us in the loop. This is a very important concept that will change the game on many things.

  8. Fascinating Dave! Definitely related to my/Gartner’s research on #infonomics. Let’s connect. DM me: @Doug_Laney. May invite you to share your ideas with our entire IM research team. Note, I added your blog and your Value of Information video links to the http://en.wikipedia.org/wiki/Infonomics page I curate. –Doug Laney, VP Research, Gartner, @doug_laney

  9. Can you tell me more about the difficulty you had calculating the number of reads on a database? (Where you wrote, “down the rabbit hole of incredibly complex ways of calculating reads and writes and many other aspects, models, and variable”)

  10. John O'Gorman

    Hi there – very interesting work here. I was wondering if, as in the general theory of relativity, an acceleration component is required. We have gravity because the universe is not only expanding but it is expanding at an ever-increasing rate. If acceleration approaches zero – even at a very high rate of speed – my understanding is that gravity also approaches zero.

    Just thinking out loud…

  11. Hi Dave,

    Very interesting thought process here. I would like to point out that it appears your units are off, at MB^2/s^3 rather than MB/s^2. I think what you have here is the beginning of a formula for Power rather than gravity. Power is defined as Force*distance/time–or better yet the Work which can be performed in a time frame–which may be a better proxy for your reasoning.

    Not sure what would be analogous to distance in data processing though…maybe there is no need.

    Not saying I don’t agree with your thoughts on data mass’s tendency to collect in one place and resistance to being divided! I’m just trying to apply my days as a physicist to what you have here 🙂 The applications remain the same

    Would love to discuss further to flush out my understanding

  1. Pingback: The Value of Data Gravity | Big Data | DATAVERSITY

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: