Skip to main content

Big Data

Silently, steadily, the three-foot-long cylindrical sensor floats in the cold ocean current, 3,000 feet below the surface. Every 10 days it wakes up. A tiny on-board electric motor pumps oil stored inside the cylinder out into an external bladder and—now positively buoyant—the sensor begins to rise. The device breaks the surface top side up, its two-foot antenna extending high enough above the waves to beam a stream of data to an orbiting satellite far overhead: a precise location fix, along with information gathered during the ascent describing water temperature, salinity, and pressure. Once the data are delivered, the pump reverses direction, pulling the oil back within the cylinder, which descends and floats along for another 10 days before waking up again. Right at this moment there are more than 3,500 of these sensors dispersed through the global oceans, delivering by way of satellite more than 100,000 data sets a year.

Thousands of miles away, in a leafy suburban neighborhood north of Baltimore, an array of more than 50 devices connects to a couple of football fields of sensors and buried probes. The devices—called motes—are about the size of a dollar bill and a half inch thick. Powered by two AA batteries, they measure soil temperature and moisture, ambient temperature, light, and the level of carbon dioxide held within the soil, taking new readings every 10 minutes, 24 hours a day, seven days a week. The motes are connected by radio signals to a central receiving node that transmits all the data to a remote computer on the Internet over ordinary phone wires.

In an era when you can ask your phone to get you to the nearest pizza parlor and tell your car to parallel park itself, whole battalions of remote digital sensors hardly seem like news. Until you take all the millions of data points they are collecting and start finding ways of connecting them. Assembled randomly they are nothing more than the visual equivalent of white noise. But ask the right questions—and use the right kind of computational approaches and equipment that are being pioneered at Johns Hopkins—and that field of static turns into a picture like a Seurat painting. None of the millions of data points carries meaning by itself, but read together, it does. This is the new approach to research that is beginning to permeate science, across every discipline. On the Homewood campus, Assistant Research Scientist Inga Koszalka and Associate Research Professor Katalin Szlavecz are bringing new and richer understanding to their disciplines primarily through the application and manipulation of data—lots and lots of data.

How Big is Big?

“A billion here, a billion there, and pretty soon you’re talking about real money,” said Illinois Republican Senator Everett Dirksen back in 1962, when the U.S. federal debt limit was raised to the seemingly staggering level of $300 billion. Today, the number is more than $15 trillion, and Dirksen’s wry observation has special resonance: For most people, numbers with that many zeros after them don’t seem real somehow. This is especially true in the realm of big data, where most of us find it hard to conceptualize the difference between a nest of petabytes and a whole flock of terabytes.

Luckily the inventors of the big data nomenclature seemed to share the late senator’s sense of humor and came up with a system of names for the scale of big data that uses first logic, then description (with perhaps a flight of poetry), and finally ordination to give an overall sense of just how big these numbers are in relation to one another. And it all starts with the lowly bit, a neologism created in 1946 by Princeton statistician John Tukey working at Bell Labs on early computer design. He shortened binary digit into bit, as the name of the basic on-or-off unit of information that is the foundation of computing. A decade later, IBM scientists at work on the first transistorized supercomputer grouped bits in eight to be able to assign a value to every letter in the alphabet. They called these groups of eight bits a byte.

Computers typically need a byte to encode a single character of text. The numbers begin to add up quickly. One page of typed text requires about 2,000 bytes, and already all those zeros are becoming a nuisance. So following standard scientific procedure the term kilo was introduced as a shortcut, making that typed page a handy 2 KB file—until you decide to encode War and Peace, at 1,400 paperback pages, an unwieldy file of a couple of thousand kilobytes, again a mess with all those zeros. So the term megabyte was created to mean 1,000 KB from mega, the Greek word for large. The complete works of Shakespeare total 5 MB, and if computers handled only text, we probably wouldn’t have to go much further than that. But text is easy. A single pop song takes about 4 MB—so how about a whole CD? A thousand megabytes equals a single gigabyte, derived from the Greek word for giant. Computers with gigabyte storage once seemed unimaginable, but since a single Hollywood blockbuster compressed takes up between one and two gigabytes, it’s easy to see the need to continue. Next up is the terabyte at 1,000 gigabytes, which some sources claim derives from the Greek word teras, a monster. More likely it is a shortening (with a wink and a nod) of tetra, the Greek word for four, as this and all subsequent data units are named from the Greek, reflecting in a series of ordinal numbers how far down the list they derive from kilobyte. The monster image is fitting for terabytes though: all the cataloged books in the Library of Congress total an estimated 15 TB, giving some sense of the scale, which is in the trillions of bytes. From here that scale becomes ever more difficult to comprehend:

petabyte (5th) is 1,000 terabytes – 500 billion pages of standard printed text

exabyte (6th) is 1,000 petabytes – five of these would likely hold all the words ever spoken

zettabyte (7th) is 1,000 exabytes – all information in existence guesstimated to be about 1.2 ZB

yottabyte (8th) is 1,000 zettabytes – septillions of bytes or one followed by 24 zeros.

Koszalka is investigating the deep water circulation patterns of the Irminger Sea, a part of the Atlantic Ocean that lies to the east of Greenland. She is learning how—far beneath the waves—seawater travels not in a continuous current but in packets of denser water measuring 30 to 40 kilometers across, moving through in two-to-three-day cycles. The final destination of these waters is the North Atlantic, where they contribute to the large-scale ocean circulation driven by differences in seawater density. By creating simulations of thousands of numerical “floats” that sample fields of a numerical ocean model much as the oceanographic instruments in the real ocean do, Koszalka gains insight and makes observations without standing on the deck of a ship or even getting her feet wet.

One floor away in the Olin building, Szlavecz is studying how soil respiration—a naturally occurring phenomenon that puts carbon in the atmosphere at more than 10 times the rate that comes from burning fossil fuels each year—is affected by routine variations in temperature and rainfall. Every day she gathers her data from the soil without ever getting on her knees or dirtying her hands.

Although engaged in the study of radically different environments located many thousands of miles apart, Koszalka and Szlavecz are power-computHoing enormous quantities of digital data to gain insight into how large natural systems work. Their research takes place within the Department of Earth and Planetary Sciences; but the same kind of large-data-set science is exploding in fields ranging from astronomy to genetics, protein folding to turbulence studies, neurobiology to hydrodynamics. Driven by ubiquitous Internet access, inexpensive remote- sensing technologies, ever more powerful computers, and the continuously falling price of data storage, a new realm of science is opening that promises to revolutionize how we understand the physical world. It is the science of Big Data, and Krieger School researchers are at the very forefront of this effort.

The New Calculus

Most people tend to think of science linearly: as an accelerating series of insights and discoveries that builds continuously upon itself, like a graph line moving upward across time, rising ever more steeply as it goes. Apostles of big data science say that’s not it at all; the history of science is better understood as a series of epochs defined by the tools available to study and understand the natural world. First, and for thousands of years, science was empirical and descriptive, carefully recording what could be seen by the naked eye. Advances in optics and the discovery of lenses introduced a whole new set of tools, and with them a revelatory understanding of the scale of the universe, the Earth’s place in the solar system, and the other, microscopic world invisible to the unaided eye. Then came Kepler, who used observed data to derive analytical expressions about the motion of planets. Kepler’s laws announced the era of analytic science, of Newton and Lavoisier and Maxwell, culminating in Einstein’s theory of general relativity.

At the midpoint of the 20th century, scientists at Los Alamos confronted a new challenge: Although the equations governing nuclear explosions were relatively simple to write down, they were immensely difficult and time consuming to solve. This led to the invention and use of first mechanical and then electronic computers, and the dawning of the age of computational science, which advances understanding through simulations made by solving equations in fields ranging from biology to physics to hydrodynamics.

The arrival of big data science in the last two decades constitutes another scientific revolution. It is perhaps best epitomized by the Human Genome Project, which originally conceived a wet lab approach to sequencing the genome that was expected to take 15 years to complete. But then along came the technique of “shotgun sequencing,” in which the strand of DNA is broken into millions of random small pieces that are sequenced and then reassembled by computers churning through huge volumes of data. This radically different approach allowed then President Bill Clinton to announce the completion of the first “rough draft” of the human genome fully two years ahead of schedule. It represented a landmark success for big data science. For many, it was a pointed sign of things to come.

“Just simply having access to large data sets and more information doesn’t lead to improved knowledge. It needs to be digested in ways that are not very obvious. We need new creative ways to understand and analyze data sets. It’s a very important challenge.”

—Thomas Haine
Professor, Department of Earth and Planetary Sciences

“This is going to completely change the way we think about the nature of knowledge,” says Jonathan Bagger, Krieger-Eisenhower Professor of Physics and Astronomy and vice provost for graduate and postdoctoral programs and special projects. “It’s not by accident that the Web browser was invented at CERN [the European Center for Nuclear Research], which is the big data physics project of our generation. People who are asking the big questions in science understand that there has been a revolution in the tools. It’s just like calculus was invented to enable us to do physics; this is the new calculus for the next 500 years.”

A Lifeline for the Drowning

But if calculus was a system derived to support a science, big data may be better understood as a science—both in computer hardware advances and in the increasing sophistication of the operations they perform—that is uniquely suited for advancing understanding of entire systems. Thomas Haine, professor of physical oceanography, talks about the “grand challenge” of modeling the oceans, such as the Irminger Sea modeling being carried out by Koszalka, who is a postdoc in his group. Understanding a vast and complex system like an ocean is not only intrinsically interesting, he says, but will also provide crucial insights into the process of global climate change. In recent years, the numbers and richness of oceanic measurements and observations have increased dramatically, thanks to the global system of ocean sensors and a complementary program of space observation from satellites using radar altimeters (for sea surface height, which can vary by a meter or more), radiometers (to measure sea-surface temperature), and other instruments. “It raises new challenges in some ways because we’re sort of swamped with data volume,” says Haine, voicing a refrain common among scientists trying to learn how to work successfully in the realm of big data. “Just simply having access to large data sets and more information doesn’t lead to improved knowledge. It needs to be digested in ways that are not very obvious. We need new creative ways to understand and analyze data sets. It’s a very important challenge.”

Calculating the Right Path

Matthew Witten A&S ‘95
Director of CyberKnife Radiosurgery and Chief Physicist in Radiation Oncology, Winthrop University Hospital,Mineola, N.Y.

What if there is not just one solution to a problem but instead a range of solutions, and one of them is optimal?

This is the challenge of the ant colony as it sends its workers out to wander randomly in search of food. When food is discovered, an ant will make its way back to the nest, leaving a trail of pheromone markers drawing other ants. Over time, the shortest route to and from the food will be traversed by the most ants and receive the most pheromone markers, gradually causing the other, longer, routes to disappear, a phenomenon biologists call ant colony optimization. It demonstrates that though individually no ant has the cognitive ability to figure out the best path on its own, together an ant colony can eventually, by elimination, hit upon the optimal solution.

This approach stands in stark contrast to the way science typically deals with problems of enormous complexity, as for instance, the calculations necessary to deliver the right dose of radiotherapy to a small and precisely defined part of a human body that has a tumor. “What is traditionally done is, you make simplifying assumptions,” says Matthew Witten ’95, chief physicist in Radiation Oncology at Winthrop University Hospital in Mineola, N.Y. “You do this so you can write down a simple equation and get a simple answer.” By simplifying something as complex as irradiating a specific and precisely limited part of the human body, radiation oncologists often end up with a treatment plan that is within the range of right solutions, but not optimal. Big data married to space-age technology is changing all that. Witten, who majored in physics at Johns Hopkins and is one of several physicists in his family, works with CyberKnife, a proprietary radiosurgery system that uses sophisticated guidance controls and advanced robotics to deliver radiation with a high degree of accuracy. Treatments last 30 to 90 minutes, depending on the complexity of the tumor, but owing to the much higher accuracy, fewer treatments are required. Since gaining FDA approval in 2001, the CyberKnife has been used in the treatment of more than 100,000 patients worldwide, according to the company website, including pancreatic and other cancer patients at the Johns Hopkins Kimmel Cancer Center. Recently, Witten and research partner Owen Clancey have one-upped the CyberKnife control system, developing a nature-inspired heuristic optimization algorithm to improve the device’s clinical effectiveness, an approach that enhances the capacity for precision by making better use of its computer’s ability to manipulate large sets of data. “We take an evolutionary approach, recognizing there is a population of solutions, maybe considering 10,000 of them, to find a high-quality one,” he says, noting the greater the accuracy, the fewer the courses of treatment required. “We’ve gotten better treatment plans and better quality outcomes. As the accuracy improves, there is far less injury to other tissue and the number of treatments necessary declines. We can treat prostate cancers in just one week.”

Data today is an embarrassment of riches—there’s so much, it’s so detailed, and offers such potential—that it seems to be turning science on its head. “It used to be that we had too little data. In field work, someone in hip boots would measure this and measure that, and it would all fit in a notebook,” says Jonathan Bagger, whose role in the Provost’s Office helps support and coordinate big-data initiatives. “Now we have cheap and ubiquitous sensors providing an unending stream of data. Rather than having too little data, now we have too much.” Bagger notes that in science, as in the business world, increasingly you hear people fretting about the data glut—so much information streaming through the Internet and residing in memory that things bog down. “It clogs up the pipes,” is how he puts it. Help is on the way in the form of new approaches to handling data and a new computational infrastructure that came about almost incidentally, in an effort to create a map of the stars.

In 1992 Johns Hopkins became one of a number of universities jointly cooperating in assembling a photographic record of the night sky using a dedicated 2.5-m wide-angle optical telescope at the Apache Point Observatory in New Mexico. Named the Sloan Digital Sky Survey in recognition of the lead funding agency, the Alfred P. Sloan Foundation, over the next eight years it obtained images covering more than a quarter of the sky and created three-dimensional maps containing more than 930,000 galaxies and more than 120,000 quasars. “Every participating institution had to pick a piece of the system that they would be responsible for, and Hopkins chose building some of the spectrographic instruments as well as managing the database,” recalls Professor Alex Szalay, of the Department of Physics and Astronomy, who served as a lead researcher on the project.

Since Szalay’s research interests involved using statistics to analyze cosmological data, managing the database seemed like a natural fit. “If we can’t properly aggregate the data, then I can’t do my work,” he says. Early on he recognized that it wasn’t enough just to assemble all the digital images in the sky survey; the real challenge was to find a way to make that huge quantity of data accessible, manageable, searchable—in a word, useful. As the project evolved, Szalay enlisted the help of Microsoft’s Jim Gray, widely recognized as one of the world’s foremost database experts. Szalay says the problems they faced in managing the expected 40 terabytes of image data (See “How Big Is Big?”) intrigued Gray, who saw it as the prototype of how science was to be conducted in the coming years.

“This Data-Scope instrument will be the best in the academic world, bar none.”

—Alex Szalay
Professor, Department of Physics and Astronomy

“We promised we would gather all this data with the derived lists and catalogs of galaxies and stars and make it available to the public,” says Szalay. “At that point, public databases were typically only hundreds or thousands of objects when we were talking of hundreds of millions. It was thousands to a million times larger data than anyone had really tackled.” Two decades later, external hard drives holding terabytes of data have become commonplace. But in 1992, storing, moving, and manipulating that much data presented a serious challenge. As it happened though, Intel co-founder Gordon Moore’s famous formulation—that the number of transistors that can be placed inexpensively on an integrated circuit doubles every two years—held true. Computers grew increasingly powerful and memory increasingly cheap and plentiful as the Sloan project unfolded with the beginning of actual data collection in the year 2000. During this time, Szalay, Gray, and their colleagues gradually became convinced they were working on a frontier of new understanding, not just of the universe through the images they were cataloging and storing but of a whole new way of doing science. And they saw the need for new kinds of tools to make the work possible.

This May, Szalay and four other Johns Hopkins co-investigators plan to power up their unique contribution to advancing the science of big data in a large room in Homewood’s Bloomberg Physics and Astronomy building that used to house mission control operations for the FUSE satellite. About 12 racks of off-the-shelf, high-end computer components have been assembled in a new and unique way to create what they believe will be the epoch-changing device of data research. With the ability to handle more than five petabytes of information, read at speeds of 500 gigabytes per second, and draw information from approximately 5,000 disk drives operating in parallel, they expect their creation to outpace the legendary Jaguar supercomputer at the Department of Energy’s Oak Ridge National Laboratory in accessing these petabytes, operating faster by a factor of two. “It will search for patterns and relationships by looking at huge quantities of data from afar, like a telescope,” says Szalay. “But it will also work as a microscope of data, to be able to see not only the big picture but also the tiny details.”

This system, supported by a grant from the National Science Foundation, is appropriately called the Data-Scope, and they confidently expect it will open a new era of computational research. The researchers have identified nearly two dozen research groups within Johns Hopkins alone that currently are struggling with data problems totaling to three petabytes or more. Without Data-Scope, says Szalay, “they would have to wait years to analyze that amount of data.”

The Internet of Things

Homaira Akbari

President and CEO, SkyBitz
Member, Department of Physics and Astronomy Advisory Council at Johns Hopkins

Former Krieger School Postdoctoral Fellow at the European Center for Nuclear Research (CERN)

About two decades ago, the U.S. Defense Department faced an interesting dilemma:

How could it track and monitor its assets used in warfare—things like trucks and Humvees, planes and tanks—in real time to produce useful information for commanders in the field and at home? Within a few years, thanks to funding from the Defense Advanced Research Projects Agency (DARPA), the Air Force and other services were awarding commercial contracts to put devices in their planes and other equipment that could provide a near-continuous stream of information about location, speed, direction, and status via the military’s network of orbiting satellites. Since then, ongoing innovation and improvement have made the devices—commonly referred to as tags—smaller, smarter, and most of all cheaper, to the point where just about any machine or device can be equipped with one. “One day all the non-people things in our lives will have tags on them,” predicts Homaira Akbari, president of SkyBitz, an early pioneer and important player in the field of device telematics. “Everything will be connected.” Welcome to the Internet of Things.

Akbari, a trained nuclear physicist who was a Johns Hopkins postdoctoral fellow at the European Center for Nuclear Research (CERN) before entering the business world, says her company daily monitors hundreds of thousands of assets for clients, things like tractor-trailers, intermodal containers, rail cars, power generators, and heavy equipment. Not only can SkyBitz provide real-time snapshots of, say, the individual location, movement, and status of an entire fleet of trucks; it also keeps that information for many years to provide the ability to study and analyze the data for making future business decisions. All of which means one thing: “We are collecting dramatically larger data sets,” Akbari explains. “Right now, companies are using some of it, but they don’t know what to do with all of it. There is a lot of unstructured data that’s just sitting there.” She is convinced the future belongs to the companies that can best figure out how to analyze and make use of the big data that will be generated in the Internet of Things and, like any good businessperson, is determined to see her company play a leading role. Where some of her clients may feel overwhelmed by a flood of data, data, everywhere, Akbari sees something else. “They are sitting on a gold mine,” she says.

One of the first projects slated for investigation on Data-Scope is Inga Koszalka’s mathematical modeling of the North Atlantic Ocean currents. “Data-Scope will enable us to move all our data to one place and do all the calculations directly on the machine,” she says, noting that Alex Szalay’s insight is that the future of big data lies in moving the analysis to the data, a paradigm shift in which computer-derived large-scale data analysis will begin to drive scientific discovery. “Data Scope will enable us not only to move all our data to one place but also to generate new data through simulations and to analyze them directly on the same machine,” she says. It is a shift that comes perhaps not a moment too soon.

Katalin Szlavecz’s project monitoring soil and other environmental factors in the Cub Hill neighborhood north of Baltimore is, in one sense, a relatively modest research effort, investigating soil respiration in one kind of ecosystem and one particular microclimate. “I’m just one person interested in one aspect of this problem, but anyone who does environmental monitoring ends up creating a lot—a lot—of data,” she says. “Last year we recorded 160 million data points. We’re drowning, and it’s a major headache.”

But cracking the secrets of successfully managing and manipulating big data presents enormous opportunities as well; consider for a moment the names of the companies that have sprung up in the last decade or so that do it successfully: Google, YouTube, Facebook. Big data is in fact a cutting-edge research area in which the United States is an undisputed leader, says Szalay. And in the academic arena, he and his colleagues are racing to position Johns Hopkins at the forefront. “This Data-Scope instrument will be the best in the academic world, bar none,” he has said. “There is really nothing like this at any university right now.”