A Formula for Data Gravity
Before creating DataGravity.org I first blogged about Data Gravity on my personal blog in December of 2010 and several times since then. I have watched the concept of Data Gravity grow beyond anything that I ever expected. I have also watched as a startup-company decided to name itself DataGravity. As I began to speak about Data Gravity to others and answer questions, I realized that maybe it was something more than simply a novel concept describing an effect. This began my quest for a formula that allows Data Gravity to be calculated.
I started out by doing what everyone does , I Googled Gravity Formula and I Googled Data Gravity Formula and something caught my eye, the first hit from Data Gravity Formula returned the Gravity model of trade on wikipedia I found this fascinating. It turns out that Gravity Formulas and Models are used in many different industries to predict all sort of different things, including favorability of trade agreements (which is what the Gravity model of trade is all about). I then began trying to learn more about the properties of Gravity (the Physics kind) and vetting out different thoughts and ideas with people over Skype, at conferences, and on Twitter. There is a long list of people who have contributed to the evolution of both my thinking and this formula. At the bottom of this post is a list of people who helped me along the way, not that this journey is complete yet as I believe there is a long way to go.
The first thing that I learned was that in order to have Gravity, you must calculate Mass. While this is trivial Physics, applying this to an abstract concept is a bit more difficult. After a great deal of time and many versions, I have a current Mass formula for Data and a Mass formula for Applications (either or both of these could change at some point). Originally, the effort was looking at Volume as being the actual volume of the Data or the Size of the Application which continues to be the thought. However, Density is an entirely different story. Density was originally going to be calculated as the number of Requests per second against the Data. I arrived at this by looking at the aforementioned Gravity model of trade. This changed several times, but ultimately I settled on Density being the compression (or the Entropy) of the Data. This is closer to the original thinking of the Data having different value, but compression certainly doesn’t equate to value.
After settling on the calculation for Data Mass, I turned my sites on calculating Data Gravity itself and began going down the rabbit hole of incredibly complex ways of calculating reads and writes and many other aspects, models, and variables. I realized that this was getting incredibly complex, difficult to measure, comprehend, and calculate, so I threw it out and started over. This led to changing the approach and ultimately is how I ended up with the current formulas. I will write more about the additional discoveries that I made along the way in future posts, now on to Data Gravity!
A Formula for Data Gravity (possibly)
First a few caveats:
- This formula needs PLENTY of REAL DATA run through it. I don’t have access to enough data to run against it to begin to say it is validated, this is where I need the community’s help.
- This formula COULD OR LIKELY WILL CHANGE, it is the best working formula I have, I am hoping that the community helps improve it (or validate it).
- My hope is that additional formulas and changes to this formula will increase the accuracy and utility of this formula and models to make them more and more valuable.
Calculating Data Mass
The formula for Mass in Physics is:
Mass = Volume times Density or M = V * D
Data Mass REQUIRES that the data be attached to a Network
A Network in this definition can be a SATA Interface on a PCI bus at a micro-scale or your Facebook Data being accessible over the Internet at a macro-scale.
Data Mass variables are defined as follows:
Volume = Total Size of the Data set measured in Megabytes
Density = Compression Ratio of the Data (Unless the Data is compressed, this will usually be 1)
If a Database is 5GB in size and has compression turned on and assuming the compression ratio is 2:1, Data Mass would be calculated as follows.
Mass = 5,000MB * 2
So the Data Mass of the Database is 10,000 Megabytes
Data Mass is easy to calculate, but beyond knowing how much Data you have stored, how useful is it? Data Mass by itself, not very useful, but in the presence of the Network and an Application, things get more interesting.
Calculating Application Mass
Application Mass is a bit more difficult to calculate (and may likely change after input from the community).
Currently Application Mass is calculated by first calculating Volume and then calculating then Density, finally multiplying the Volume and Density together.
Application Volume is calculated as follows:
Application Volume = Amount of Memory Used added to the Amount of Disk Space Used in Megabytes
Application Volume = (Memory Used + Disk Used)
Application Density is calculated as follows:
Application Density = The Compression Ratio of the Memory in Megabytes (usually 1) added to the Compression Ratio of the Disk Space
Used in Megabytes (usually 1) added with the Total Amount of CPU Utilization in GHz (across all cores)
Application Mass is calculated by using the results from the Application Volume and Application Density formulas above:
Application Mass = Application Volume times Application Density
AppMass = AppVolume * AppDensity
Why is Memory Volume Needed and CPU Utilization Important?
In most cases Applications use memory for higher performance storage (Data Gravity Inception Anyone?). Measuring Memory Volume is important as in scenarios such as caching, (i.e. Memcache) a great deal of requests/responses are done against and in memory Data set. CPU Utilization must be measured because from an Application’s Data viewpoint, the CPU represents the Applications work on the Data (Transformations against the Data set). This is common in many different types of Applications and causes a balance that needs to be struck between do you move the Application to the Data, or the Data to the Application.
Calculating Data Gravity
In order for Data Gravity to exist the Data Mass and the Application Mass have to be within the same Networked Universe (basically the Data has to be reachable by the Application, otherwise there isn’t any gravity).
Now that we have two Masses (A Data Mass and an Application Mass) we can now finally calculate the Force in Megabytes per second squared that Data Gravity has between the Data and the Application. To do this, a few more variables have to be added and the information gathered:
This is the average useable bandwidth from the Application to the Data in Megabytes per second (Megabits must be converted to Megabytes)
This is the average latency from the Application to the Data in seconds (milli, micro, and nano seconds all must be converted to seconds)
Number of Requests per second:
This is the average number of requests from the Application to the Data per second (taken over the same sample time as Latency and Bandwidth)
Average Size of Requests:
This is the average size of each request from the Application to the Data measured in Megabytes (bytes, Kilobytes, Gigabytes must be converted to Megabytes)
With the above four variables and the Data Mass and Application Mass, we can now calculate Data Gravity:
Data Gravity Force equals the product of Data Mass multiplied by Application Mass then multiplied by the Number of Requests per Second
this result is then divided by the square of the Network Latency added to the quotient of the Average Size of Requests divided by the Network Bandwidth
The same thing in words instead of pictures:
And for the more Mathematically inclined:
The Resulting Data Gravity Force is measured in Megabytes per second squared or (MB/s)²
What can be done with Data Gravity Force?
(Please note that these are IDEAS and are therefore SPECULATIVE until any of this is proven)
Depending on your needs/goals, you maybe want to embrace Data Gravity or you may want to resist Data Gravity.
Data Gravity (potentially) has many applications, a few are listed below:
Reasons to move toward a source Data Gravity (Increase Data Gravity)
- You need Lower Latency for your Application or Service
- You need Higher Bandwidth for your Application or Service
- You want to generate more Data Mass more quickly
- You are doing HPC
- You are doing Hadoop or Realtime Processing
Reasons to resist or move away from a source of Data Gravity (Decrease Data Gravity)
- You want to avoid lock-in / keep escape velocity low
- Application Portability
- Resiliency to Increases in Latency or Decreases in Bandwidth (aka Availability)
Data Gravity and Data Mass may have other uses as well:
- Making decisions of movement or location between two Data Masses
- Projecting Growth of Data Mass
- Projecting Increases of Data Gravity (Which could signal all sorts of things)
Interesting things occur when you overwhelm the network itself. This could be done by exceeding the bandwidth of the network or by the need to lower latency or increase the bandwidth higher than the current network that is attached to the data can offer. This may drive you to optimize different things in different ways. Caching is a great example of manipulating the request/response stream by creating a temporary or finite amount of secondary Data Mass (The Cache) to increase the Cumulative Data Gravity, while decreasing the Data Gravity of the primary Data Mass (The Source Data Mass). Replication is another strategy that manipulates Data Mass and Gravity (but can also be modeled).
Please share your thoughts in the comments below as to other uses of the Data Gravity, Data Mass, Application Gravity, and Application Mass.
There are many other things that Data Gravity might be able to be used for, so I’m looking for ideas from the community and would like participation from anyone with ideas. In future posts, topics on use, real world scenarios, different configurations in networks, and many other topics will hopefully be covered, with select guest posts as well.
- James Urquhart
- Andrew Clay Shafer
- James Watters
- Adrian Cockcroft
- Simon Wardley
- Joe Weinman
- James Governor
- James Bayer
- Brian Katz
and many others that I have forgotten to mention. Also note that out of the 8 people above, half of them are named James!