Ilkay Altintas: Advancing Computational Research with Scientific Workflows
It was a lark that brought Ilkay Altintas to San Diego. The year was 2001. She had just finished her M.S. thesis and was working at the Middle East Technical University in Ankara, Turkey, when she discovered an open position at the San Diego Supercomputer Center (SDSC). That job, related to scientific data management, launched her career where, in a relatively short period of time, she has carved out a particularly useful specialty and has had an impressive impact on helping computational scientists in a wide variety of disciplines.
“Pi” Person of the Year
Cut to 2013 when Altintas, now with a Ph.D. and serving as Director of a Center of Excellence (described below), received the first “Pi Person of the Year” award at SDSC. The pi stands, not for principal investigator, but rather the mathematical constant π. In this case, it underscores that Altintas’ work spans both scientific applications (in fact many of them) aand computer science (cyberinfrastructure). She literally has one “pi” leg in each camp.
The award also recognized her as a fitting symbol of the kind of interdisciplinary work being done at SDSC and her remarkable track record in landing research grants. More significantly, she has some of the most frequently cited peer-reviewed research papers related to scientific workflows. In a recent online search of the Web of Science, her papers ranked #2, #27, #42, and #49.
Furthermore, in an article in Procedia Computer Science titled “Exploring the e-Science Knowledge Base through Co-citation Analysis” (see full citation under “References,” below), Altintas is cited as one of the top-10 “turning-point” authors. The paper’s authors used the knowledge domain visualization software CiteSpace to analyze the e-Science knowledge base, as pertaining to grid, desktop grid, and cloud computing, to identify landmark articles and authors irrespective of the number of times their articles have been cited.
Focus on Scientific Workflows
It’s Altintas’ work as Director of SDSC’s Scientific Workflow Automation Technologies Laboratory that has earned her particular acclaim. A scientific workflow is software composed of a series of computational and/or data manipulation tools or steps to produce an application that can be run especially on high-performance computers to produce data for subsequent analysis or comparison with other data sets. These workflows are proving to be science accelerators as they reduce, in some cases dramatically, the time to results for scientists.
Altintas’ focus on workflows was a natural outgrowth of various developments in computational and computer science. Component-based systems, like those she worked on for her M.S. thesis research, were being used to modularize and model entire software systems. Against that backdrop, Altintas, through her work at SDSC, began building scripted workflows. But, she acknowledges, while they helped scientists focus on scientific questions rather than the technical details of their computations, the workflows seemed like “black boxes” to the researchers who used them without understanding exactly how they were producing the results.
At the time, in the late 1990s, grid computing was coming into its own and began to support middleware tools. Service-oriented computing was also becoming popular. With the emergence of distributed systems, software developers needed to integrate resources and pass data among them. Against this computational complexity, Altintas became interested in how to program a string of processes in a more intuitive way. She also began to notice commonalities in user requirements across what seemed like very different application areas. This, to her, suggested the notion of re-use.
Putting these needs together, Altintas envisioned the way forward: a grassroots workflow effort based on an open-source platform.
The Kepler Workflow System
Altintas and colleagues built such a system on top of a modeling tool for engineering called Ptolemy II, named after the 1st-century mathematician/astronomer. Following this naming tradition, Altintas and colleagues named their system Kepler after the revolutionary 17th-century scientist known for his laws of planetary motion. The name provided brand recognition and, in retrospect, anticipated the wide-ranging impact the system was to have.
The Kepler project was initiated in August 2003 with a first Beta release in 2004 and ongoing release cycles since (the latest is version 2.4), managed by SDSC. One of the keys to the system’s success is that it goes one step beyond open-source software: Anyone can become part of the Kepler community and offer core functionality or modules to be deployed on top of Kepler releases. Altintas says that “development is applications-driven, with all functionality suggested by the community using, or wanting to use, it.”
Workflows in NBCR
In addition to her other responsibilities, Altintas is co-principal investigator (with Philip Papadopoulos) of Core 4 in NBCR. In this role, she focuses on developing practical cyberinfrastructure for multi-scale refinement, modeling, and analysis workflows. According to her, “Kepler helps integrate and build NBCR applications so they can execute transparently on a variety of computing resources. The software modules can be mixed and matched depending on the scientist’s purpose and goals. I’m always listening for inputs and outputs as a mechanism to guide development of a particular workflow.”
NBCR, in fact, serves as the application hub for Kepler. “Kepler has reusable building blocks – we’ve used most of them a fair number of times,” says Altintas. “It’s easy to put them together in rapid application prototypes and, from there, scale up execution or publish them as software packages that others can use. We do all of that at NBCR.” Within NBCR, Kepler now supports everything from bioinformatics and drug design applications, to complex microscopy and imaging applications, to patient-specific cardiac modeling. Application of Kepler in such diverse biomedical environments pushed further development, resulting in bioKepler.
Like applications that push the boundaries of technology in computer science, NBCR provides the ideal scientific applications to give bioKepler a demanding workout. bioKepler provides a graphical user interface (GUI) to connect big data with domain-specific biomedical tools to analyze that data in a scalable fashion. The GUI can be used to link tools together to create an application logic that can then be run in batch mode on high-performance computing and in cloud environments.
Kepler and Reproducing Scientific Results
Significantly, Kepler also helps address a hot topic in the science community: provenance, that is, the ability to accurately reproduce the scientific breadcrumb trail that produced the results. Given the occasional scandal surrounding scientific conclusions based on analysis of false data, scientists are paying increasing attention to this issue to reproduce other’s results and verify their integrity. Reproducibility is especially important—and challenging—for multi-scale modeling, which is NBCR’s niche. It’s a field in which a single computational experiment may require upwards of 200 steps. Kepler workflows not only support reproducibility but, in addition to final results, they promote sharing of accurate scientific methods.
Synergy between the Developers and Application Scientists
Kepler has always fueled synergy between its developers and the application scientists who use it. As scientists become trained in its use, they, in turn, bring more challenging scientific questions to the table, which, in turn, spur more development. “What if…?” is probably the most common question in Altintas’ lab.
“The scientists also advertise for us,” says Altintas. “As their applications grow in number, we are able to test the platform more. Everyone wins. The more scientists learn about and use this technology, the more useful, robust, and comprehensive the ecosystem we’re developing becomes.”
The Importance of Training
But for scientists to go beyond the “black box” issue that has prevented progress, they need to understand workflow components and how they are put together to ensure the validity of their results. Further, they need to understand how workflows work so they can begin developing their own to address more complicated scientific questions. The impact of Kepler will scale as more scientists gain this understanding.
Altintas and her team provide training in various ways. Sometimes it’s a formal “bootcamp” for informatics and computational data science in which they focus on end-to-end processes to achieve specific results. Training is also provided through academic projects and for industry on a recharge basis. NBCR just co-sponsored a scalable bioinformatics bootcamp in late May, which will return in the fall, and a “hackathon”—an event at which scientists will gather with Kepler experts to develop their workflows—is scheduled for later this month (July 2014).
One satisfied bootcamp participant recently reported that, after learning how to use Kepler, he was able to achieve, in two days, results that previously would have taken him two years. And his experience is hardly unique.
While UCSD doesn’t offer workflow classes per se, it has just approved a new M.S. degree program in Data Science, in which the study and application of workflows will be part of some project-based courses. Altintas expects to be part of the team that teaches these classes. In addition, she and NBCR Director Rommie Amaro are exploring the possibility of using online training, such as Massive Online Open Courses (MOOCs), to more broadly enable researchers to make effective use of the tools like Kepler that NBCR develops and makes publicly available.
Workflows for Data Science: A Center of Excellence
Just in April of this year, Altintas inaugurated a Center of Excellence at SDSC, called Workflows for Data Science, or WorDS. Its goals include providing the ability to access and query data; scale computational analysis to higher-performance computers; increase software re-use and workflow reproducibility; save time, energy, and money; and formalize and standardize workflow processes. “Our focus is on use cases, not technology,” says Altintas. In addition to the eye-catching amount of funding Altintas has contributed—currently $8M—the center has published an impressive list of peer-reviewed papers.
Here the applications areas served are much broader than biomedical science and include environmental observatories, oceanography, geoinformatics, and computational chemistry. The areas of expertise represented by center staff include research on scientific workflow automation technologies, big data applications, workflows adapted for cloud systems, development and consulting services, and workforce development.
One of the most grounded projects that Altintas and her colleagues are working on is WiFire, a project funded in 2013 by the National Science Foundation. Its goal is to be able to predict where a fire will head while it’s burning. The team is building a cyberinfrastructure that integrates cameras and other data sensors mounted on radio antennas throughout San Diego County, high-speed communications networks, high-resolution imagery, and high-performance computing. When a fire starts, data from the sensor network along with satellite and weather data will be fed into an SDSC supercomputer to generate a model of the fire’s behavior. The system will be able to compute the progress of the flames faster than real time to provide advanced warning to help fire fighters make decisions to deploy their resources most effectively. The system got its first test during May 2014 when, over a few-day period, 11 fires raged across northern San Diego County.
The WiFire team, representing various UCSD labs, SDSC, Calit2, the Computer Science and Engineering department, the Mechanical and Aerospace Engineering department, and the High Performance Wireless Research and Education Network, envisions this testbed as a precursor to a national, and ultimately global, fire fighting cyberinfrastructure.
Underscoring her publishing record, Altintas pointed to several recent publications. (A full citation is listed below the description of each paper’s content.) The first three reflect work to promote progress in the other three NBCR cores, respectively:
- Atomic-to-subcellular Simulation and Discovery (called “Core 1”)
- Tools to Assemble Virtual Whole Cells for Multi-use, Multi-modal Simulations (Core 2)
- Multi-scale Modeling: Subcellular to Organ Biophysics (Core 3)
Her paper on automated Kepler scientific workflows discusses development of workflows to support NBCR computer-aided drug discovery and molecular dynamics simulations. The workflows aim to standardize simulation and analysis, and promote best practices within these communities. Each component is developed as a stand-alone workflow, making it easy to integrate and extend into larger frameworks based on user needs.
Pek U. Ieong, Jesper Sørensen, Prasantha L. Vemu, Celia W. Wong, Özlem Demir, Nadya P. Williams, Jianwu Wang, Daniel Crawl, Robert V. Swift, Robert D. Malmstrom, Ilkay Altintas, and Rommie Amaro, Progress towards Automated Kepler Scientific Workflows for Computer-aided Drug Discovery and Molecular Simulations, Procedia Computer Science, Vol. 29, 2014, pp. 1745-1755, doi: 10.1016/j.procs.2014.05.159.
A second paper describes a Kepler workflow developed for Electron Tomography (ET) programs called EPiK. ET provides high-resolution images of complex cellular structures, such as cytoskeletons, organelles, viruses, and chromosomes, which typically lead to very large data sets. EPiK embeds a tracking process (called IMOD) and uses filtered backprojection (from a software tool called TxBR) and iterative reconstruction methods. The group tested a 3D reconstruction process using EPiK on ET data. EPiK, which offers logical viewing, easy handling, convenient data sharing, and extensibility as the workflow is developed further, can serve as a toolkit for biology researchers.
Ruijuan Chen, Xiaohua Wan, Ilkay Altintas, Jianwu Wang, Daniel Crawl, Sébastien Phan, Albert Lawrence, and Mark Ellisman, EPiK: A Workflow for Electron Tomography in Kepler, Procedia Computer Science, Vol. 29, 2014, pp. 2295-2305, ISSN 1877-0509, http://dx.doi.org/10.1016/j.procs.2014.05.214. http://www.sciencedirect.com/science/article/pii/S1877050914003913
Another paper describes how Kepler can provide a solution for bioinformatics pipelines in the context of emerging genomics technologies producing massive amounts of data requiring complex analysis, the development of increasing numbers of analysis tools, and the increasing difficulty of integrating them. In this paper, the team describes how they used Kepler to integrate several external tools, including Bioconductor packages, AltAnalyze, a Python-based open source tool, and an R-based comparison tool, to build an automated workflow to meta-analyze online and local microarray data
Zhuohui Gan, Jennifer C. Stowe, Ilkay Altintas, Andrew D. McCulloch, and Alexander C. Zambon, Using Kepler for Tool Integration in Microarray Analysis Workflows, Procedia Computer Science, Vol. 29, 2014, pp. 2162-2167, ISSN 1877-0509, http://dx.doi.org/10.1016/j.procs.2014.05.201. http://www.sciencedirect.com/science/article/pii/S1877050914003780
Another paper presents an easy-to-use, scalable approach to build and execute big data applications using Kepler’s actor-oriented modeling in data-parallel computing. The work is based on two bioinformatics use cases focused on next-generation sequencing data analysis to verify the feasibility of the team’s approach.
J. Wang, D. Crawl, I. Altintas, and W. Li, Big Data Applications Using Workflows for Data Parallel Computing, Computing in Science & Engineering, Vol. PP, Issue 99, April 16, 2014, doi: 10.1109/MCSE.2014.50.
A final paper Altintas offered discusses how to design a workflow as a service architecture (WFaaS) with independent services. The team’s design addresses how to efficiently respond to continuous workflow requests from users and schedule their execution in the cloud. They propose four heuristic workflow-scheduling algorithms for the WFaaS architecture and analyze the differences among them and best ways to apply them in terms of performance, cost, and price/performance ratio.
Jianwu Wang, Prakashan Korambath, Ilkay Altintas, Jim Davis, and Daniel Crawl, Workflow as a Service in the Cloud: Architecture and Scheduling Algorithms, Procedia Computer Science, Vol. 29, 2014, pp. 546-556, doi: 10.1016/j.procs.2014.05.049.
Advising Industry on Big Data Applications
What’s the next step for Altintas? “We’re beginning to do big data science,” she says, “especially applications in machine learning and business intelligence. This interest of course broadens our focus beyond academia to include industry. Marketing, for example, is a demanding user of big data. There’s a gap there that we are confident we can address. We’re already working with industrial partners in drug design, biotech, and energy consumption to provide services to scale up their big data analysis applications to high-performance computers.”
Navonil Mustafee, Nik Bessis, Simon J. E. Taylor, and Stelios Sotiriadis, Exploring the e-Science Knowledge Base through Co-citation Analysis, Procedia Computer Science, Issue 19, 2013, pp. 586-593; doi:10.1016/j.procs.2013.06.078.