Much Information

Big data is the future. It’s challenging and overwhelming us right now. How UI researchers are trailblazing ways through the megasphere of petabytes.

Winter night, clear skies, deep cold. Orion rising.

Step outside. Overhead, most people will see the twinkling firmament, the Milky Way, a constellation or two. Maybe a rising moon. Jon Thaler sees big data.

The University of Illinois astrophysicist is a leading researcher in the Dark
Energy Survey, a project to photograph the unfolding of the universe through the last 8 billion years.

“The universe is actually 13.7 billion years old,” says Thaler, a pleasant, grizzled man who has moved from particle physics to astrophysics and cosmology during his academic career. “But 8 billion years is as far back as we can get in our present project.” The survey’s 570-megapixel camera, set in an observatory high in the Andes, is taking a picture of the sky every two minutes, capturing recursive views of 300 million galaxies and 100,000 galaxy clusters, within which Thaler and his colleagues are even now searching for clues to the beginning of everything.

Supernovas are such clues—exploding stars that can be seen for just a few months. Light from supernovas helps researchers understand the speed at which dark energy is causing the universe to expand. Finding one among the camera’s images is like looking for a screw that fell off your mower somewhere on the 10 million square miles of lawn behind your house. Launched in August, the Dark Energy Survey is, over its five-year span, expected to generate a petabyte of data. That’s the equivalent of 58,292 Hollywood movies or 20 million four-drawer filing cabinets filled with text.

And that’s big data—enormous amounts of information and tiny, elusive, crucial revelations, which form a vast, ever-more mountainous realm of information embodying past, present and future. Take historical data of all kinds. Add the explosion of knowledge across the corporate world and scientific research. Throw in information generated online. The result? Huge data sets and gargantuan challenges of access, application, processing and storage.

Proton-proton collisions and the trouble with text
Hardly anyone has got bigger big data than particle physicists. Like Mark Neubauer, one of a dozen UI particle physicists who research the Higgs boson, a subatomic particle thought to be the ultimate guiding organizer in the creation of matter. Called “a needle in a haystack of needles,” the Higgs boson is so elusive that approximately 1 quadrillion proton-proton collisions (and the work of thousands of physicists) have been required to discover it. The collisions are the business of the Large Hadron Collider at the CERN research facility in Switzerland, which generates a few petabytes of data per year—information that must be enhanced, refined and shared via a global network of computer clusters. Through Neubauer’s efforts, one such cluster is now housed at Illinois, allowing particle physicists worldwide to analyze the data.

On July 4, 2012—to the gratification of the Illinois team and their colleagues around the world—a Higgs-like particle was observed. Joy followed this year on Oct. 8, when Peter Higgs and Francois Englert received the Nobel Prize for their theory, posited almost 50 years ago, that the particle exists. Now, “we’re no longer searching for a particle,” Neubauer’s colleague, Tony Liss, told The[Champaign-Urbana] News-Gazette. “We’re now trying to measure its properties with some precision.”

But the world also is weighted down with mega-troves of information that do not readily enter into computer-friendly relationships. “We need to better understand what’s in the data,” says UI researcher Dan Roth, who has made it his life’s work to teach computers to process text and imagery. “Eighty-five percent or more of the information corporations and organizations have to deal with is unstructured text, including emails, websites and social media,” he observes. “It’s far harder to access than a database, where you know the information and can query it in a certain way.”

One effective model for querying text has emerged from the Cline Center for Democracy at Illinois. Led by center director Pete Nardulli, a UI professor with dual appointments in political science and law, students and researchers there have transformed an archive of 75 million news reports, dated 1946-2005, into a major informational asset. After initially reading some of the local and national news articles on civil strife, the team identified semantic structures and key terms to allow computing resources to take over.

The center has gone on to create a worldwide database of conflicts big and small over that 60-year time period. Nardulli notes that data mining and tracking present an enormous breakthrough in political science.

“We’ve never had tools like this working in a truly interdisciplinary initiative,” he says. “These resources have greatly increased our capacity to study civil strife and to identify many more such events taking place in the world.”

Big data gets personal
We ourselves are fountains of data, spewing information through our mobile devices, our credit cards, our GPS use and our rewards memberships at CVS (which reward us in turn with novella-length receipts). Mike Shaw, a business professor who holds an endowed chair in management information systems at Illinois, believes big data—including images, geographic information, social media feedback, retail transactions and information collected through cellphones, sensors and other devices—will bring greater efficiency and profitability to the business world. Areas that can benefit range from facilities and supply chain management to predictive analytics, whereby companies forecast, and manipulate, consumer purchases and preferences.

Shaw is particularly interested in how companies can go online to monitor and enhance brand and customer experience. He cites Barack Obama’s innovative use of social media to marshal support through his website, where more than 2 million accounts were created during his 2008 presidential campaign by people eager to share information and get others involved. “Obama didn’t read a textbook on this—he developed it,” Shaw observes. “It was very ingenious.” Companies using big data should, Shaw says, “respond to what they find out about their customers and also seek new possibilities to act on.” These fresh insights can be subsequently refined using feedback from managers and customers.

Social media is the playground of big data, where celebrities flog their fame and old friends find each other and new products sally forth in search of buyers. The bulletin board of Facebook bulges with more than a billion users. Twitter’s 100 million users send 500 million tweets every day. This enormous, endlessly individualistic realm draws researchers in quest of what people think—and what they know. Miles Efron, a UI associate professor of library and information science, uses archived Twitter feeds—one of his sample sizes ran to 16 million tweets, which, he says, “seemed big at the time”—to search for key words and terms crucial to his research. (He also conducts similar research on the vast online encyclopedia Wikipedia.) He theorizes that, nestled amid the gazillions of 140-characters-or-less tweets of trivia, factoids and spam, awaits deep intelligence of unfolding events—cryptic messages about rising floodwaters, creeping fires, gunfire in the distance. His objective is to create algorithms that will enable news companies and journalists to search Twitter and identify major events as they break. It is easy to see how such capability would be alluring not only to Yahoo and CNN but to government agencies as well.

And when big data collides with Big Brother, individual privacy may be compromised—as made explicit by ongoing revelations that the U.S. National Security Agency is collecting huge amounts of information on the telephone and Internet habits of U.S. citizens and persons of interest worldwide. One danger is that access to data within the NSA itself is much less restricted than it used to be, raising serious questions of safeguards. “There’s a tremendous responsibility to make sure they’re using the data they’re collecting responsibly,” says UI computer scientist Roy Campbell, who is leading the Grainger Engineering Breakthroughs initiative to boost big data and bioengineering research at Illinois.

Most of us have online profiles profoundly scored with the tracks of our activities there. A great deal of information—right down to books purchased on Amazon and movies downloaded from Netflix—creates patterns that are “uniquely identifying—kind of like DNA,” according to Marianne Winslett, a UI computer scientist. Winslett, who just got back from four years in Singapore where she headed the University’s Advanced Digital Sciences Center, has developed code to deliberately introduce “noise” into data sets, scrambling the information so that it masks the identities of the individuals from whom it is taken but can still be used for research. While Winslett’s research pertains to genomic profiles and other biomedical information, she sees tremendous potential for applications of this technology in many kinds of statistical analyses.

Gigantic genomics
While relatively small scraps of intel may point to who we are, what we are entails a huge amount of information. The human genome consists of 3 billion nucleotides (base pairs of DNA) stored in 23 chromosomes, and drawing genetic connections to higher-order traits involves “huge spans of biology between levels and many, many computations to go from level to level,” according to Gene Robinson. Celebrated for his work linking the genes of honeybees to their behavior, Robinson leads the Institute for Genomic Biology on campus, where research pursues subjects as disparate as climate change, antibiotics and social behavior. With the mapping of genomic profiles about to become much less expensive, the world is, in his view, on the cusp of a genomic revolution. “There are an estimated 1 million species of plants and animals,” Robinson says. “Genomics will sequence them all.” He observes that clinics, such as Mayo, are considering making genomic sequences a standard part of treatment and that “there’s already talk in some countries of having all babies sequenced at birth.” The outcome will be hundreds of millions of genomic profiles linked to enormous amounts of information up the biological hierarchy—the better to understand the many, many long and tangled paths between genes and the beings they create.

With genomics data already doubling roughly every five months, the need for the technology to handle it is ever-more urgent. Spearheaded by the IGB and the UI Coordinated Science Lab, CompGen is an interdisciplinary venture to meet this challenge. At the forefront of the effort—and on the “bleeding edge” of information technology—is creating a new kind of computer that can store huge amounts of information and rapidly retrieve specific items for processing. UI computer engineer Steve Lumetta recently received $2.6 million in go-ahead funding from the National Science Foundation to lead a team that will design and build such a device. CompGen leaders at Illinois also include Ravi Iyer, past vice chancellor for research; Victor Jongeneel, director of High-Performance Biological Computing; and computer scientist Saurabh Sinha.

Faced with challenges that include huge volumes of information and data bottlenecks, Lumetta envisions new levels of miniaturization and processing speeds up to 100 times those of today. The payoff, he says, will be nothing less than better lives. Computational genomics will open the path for personalized medicine, allowing treatment to be tailored to individuals based on genetic predispositions as well as medical history. “The potential for helping people is quite large,” Lumetta says. Rather like the data itself, which will be measured in exabytes. One exabyte equals a thousand petabytes.

And five exabytes add up to all the words ever spoken.

Big data goes wild
Just as work—and traffic—seem to expand to fill the space available, so information is growing with the capacity to collect it. For UI library and information scientist Carole Palmer, PHD ’96 LIS, whose specialty is data curation, “there is a huge need for research libraries and computer services to be ready to maintain these assets.” With Yellowstone National Park as the unlikely but brilliant nexus, Palmer is leading a project to develop a framework for all the scientific data on features of the environment there, from microbes to wolves to weather. Her co-investigator is Bruce Fouke, a UI geobiologist who researches microorganisms in the park’s hot springs.

In the marvelous ecosystem of Yellowstone, “everything is linked,” Fouke says, “from DNA and RNA sequences of life forms to rock composition to climate models.” Fouke has assigned a team of Illinois students to seek out scientific information gathered in the park, information that dates back decades. Ultimately, access to this information will benefit future research and Yellowstone itself, informing park management decisions with scientific data.

This is one example of the evolution of big data into resources shared by all. In May, the White House issued an executive order to make government information open and machine-readable so as to “fuel entrepreneurship, innovation and scientific discovery.” In September, Acxiom, a leading data broker that has collected information on the majority of U.S. households, opened a website where people can view—and edit—some of the things the company knows about them.

UI astronomer Robert Brunner observes that big data has created a new paradigm of science. Traditionally given over to studies that collect data and test theories through observation and simulation, science now also has access to huge redoubts of information. Research will be replicable and, says Brunner, “fully transparent so that work will not just be verifiable—it will be a foundation for work by others.” He sees the issue as how to “pull new insights out of data” and even envisions a “datascope”—software that can look at information both from a long perspective (like a telescope) and up close (in the manner of a microscope), so both macro- and micro-data can be employed in the generation and testing of new hypotheses.

What such innovations can and will do is vastly speed up the scientific process. With the right data sets and computing resources, studies that might have taken months or years to run could be completed in weeks. “As this fourth paradigm grows, you can either be trying to catch up or trying to steer the conversation,” Brunner points out.

“I think the University of Illinois needs to steer the conversation.”

A conversation that will go from the genes that make us who we are to the outer edge of everything that has ever been in time, space and the universe.

Big data.

Really big.