A Distributed Data Archival, Analysis and Visualization Cyberinfrastructure for Data-intensive Collaborative Research.
PetaShare Storage is Online!
PetaShare storage is now online and accepting allocation proposals. [Read More]
The Petashare project is derived and motivated by the needs of the different research activities pursued by the collaborating scientists from various disciplines. Outlines below are examples of these research activities and their respective data requirements that have driven the development of this project.
Research in Molecular Biology (Winters-Hilt, Bishop): The research group of Dr. Stephen Winters-Hilt at the University of New Orleans (UNO) will utilize PetaShare to foster its research in biophysics and molecular biology. Nanopore-based single-molecule detection has recently become established as a new tool in single molecule biophysics. This has been made possible by two recent advances: (i) low-noise measurement of ionic currents through single, stable, nanoscopic channels, and (ii) the development of modern cheminformatics methods for channel current feature extraction and pattern recognition. Using a nanopore detector, it is possible to obtain at least Angstrom-level resolution at the termini of duplex DNA molecules.
Preliminary results indicate similar analyses are possible with single antibody molecules. Furthermore, a clear transition in channel current blockade signal is observed in the nanopore detector upon binding and dissociation of antigen to the captured anantibody. It is hypothesized that nanopore-based detection can be used to study single molecule dynamics of the antibody-antigen interaction and analyze conformational changes that occur in the antibody upon binding to antigen. If proven, this hypothesis presents numerous ramifications for understanding antibody interactions, rapid immunological screening, and nanopore/antibody-based biosensing in general. For biosensing and immunological screening in general, both protein and DNA-based molecules (aptamers) are being studied on the detector.
The UNO cheminformatics program alone generates several terabytes of channel current measurements per month; this estimate includes only raw data, not the computational overlay of annotation, and this estimate continues to grow. Likewise, the UNO bioinformatics program focuses on gene structure analysis, and the data used includes terabytes of genomic, EST, and other DNA coding information entries in GenBank (the NIH genetic sequence database), and protein entries at SwissProt protein knowledgebase. Rapid access to a high capacity distributed storage system will allow access and exchange of large amounts of data that is critical to such efforts.
The UNO nanopore detector cheminformatics program generates and analyzes terabytes of data monthly, and as detector facilities are doubled in the upcoming months, the data management problems are expected to become more acute. The UNO cheminformatics program also performs channel current analysis for other groups housed at University of California - Santa Cruz (UCSC) and Harvard University that only work on the detector biophysics side of the problem. In the past year alone, channel current analysis was performed by UNO for these groups for terabytes of offsite data.
Given the sheer amount of data generated, data sharing is critical to growth of research collaborations in the cheminformatics and bioinformatics fields. For the latter, many efforts are underway at UNO, one of which is to perform ab initio, purely statistically-based, gene structure analysis -- which entails a very complete and up-to-date statistical analysis of genomic, EST, and other DNA coding information entries in GenBank, and protein entries at SwissProt, which also entails terabytes of data storage. Currently, this group does not have an efficient data sharing solution for the terascale datasets they are dealing with. PetaShare will greatly improve the efficiency of their research activities. It will make data sharing and collaboration with other sites easier, and will automate the management of offline data decreasing the need for human involvement in low level data handling tasks.
Bishop: Dr. Bishop's molecular biology research group at Tulane University studies the structure and dynamics of nucleosomes using all atom molecular dynamics simulations. Nucleosomes are the molecular complexes that fold DNA into chromatin in eukaryotic cells and are therefore intimately related to the genetic processes of transcription, regulation, replication and repair. A typical all-atom simulation of the nucleosome includes approximately 150,000 atoms and represents 30-50ns of molecular dynamics.
Dr. Bishop's group is developing theoretical and computational methods to complement the all-atom simulations by providing a coarse grain description of DNA. The complete tool set to be developed will allow researchers to A) analyze entire genomes for nucleosome stability, B) interactively investigate the 3D structure of the millions of basepairs given nucleosome occupancy estimates determined from the stability analysis (Boltzmann probability distribution) and C) construct atomic models of individual nucleosomes. The goal is to provide a means of rapidly assessing nucleosome stability for entire genomes since nucleosomes control DNA folding and accessibility within eukaryotic cells and then investigate sequences of interest in atomic detail. The genetic sequences currently under investigation include promoter and encoding sequences for the nuclear hormone receptors and transposable elements (ALUs), which constitute approximately 10% of the human genome.
The all-atom simulations performed by Dr. Bishop's group are large simulations by current standards and require significant computational and data storage resources, as well as coordinated data access and processing techniques. Each simulation will require approximately 3 weeks of run time on a 24-node Linux cluster and approximately 50-100 GB for trajectory storage and data analysis. A series of such simulations will be conducted during the next 5 years to investigate nucleosome positioning and stability as a function of DNA sequence and will yield around 5TB of data.
PetaShare will provide Dr. Bishop's group with support for data storage and retrieval so that the data can be rapidly analyzed and disseminated. As with all molecular dynamics studies of biomolecular systems, visualization is an essential component of the analysis process and requires rapid, interactive access to the molecular dynamics trajectories. This LONI-powered distributed instrumentation will be an essential infrastructure for remote visualization of the large datasets used in these research activities.
Research in Coastal Studies (Walker, Levitan, Mashriqui, Twilley): The Earth Scan Laboratory led by Dr. Nan Walker is a fully operational satellite receiving station managed and operated by faculty, staff, and students of the Coastal Studies Institute and the Department of Oceanography and Coastal Sciences at Louisiana State University. With its three antennas, it captures approximately 40GB of data from six unique satellite sensor systems in Earth orbit each day. The data captured have spatial resolutions ranging from 4km (GOES GVAR) to 250m (Terra/Aqua MODIS) to 12m per pixel (Radarsat-1 SAR) and repeat coverage from minutes to days.
The GOES GVAR data are particularly useful for hurricane track and intensity research and forecasting, as they enable surveillance of hurricane movement, the Gulf of Mexico Loop Current, atmospheric moisture and temperature, and oceanic sea surface temperature changes on short time-scales. Data updates are obtained as frequently as every 3-5 minutes during hurricane events and every 15 minutes normally. These data are used in real-time for hurricane tracking (in collaboration with Louisiana state officials) and for LSU hurricane research led by Dr. Marc Levitan and Dr. Hassan Mashriqui; and for wetland and environmental studies led by Dr. Robert Twilley. Ready access to the GOES data in real-time has enabled the development of new techniques for enhancing the prediction of hurricane movement and along-track intensity changes. The LSU satellite measurements have also been used for the surveillance and study of hurricane-related flooding, coastal circulation, harmful algal blooms, river impacts and hypoxia.
While the satellite data from the Gulf Coast used in Dr. Walker's research is rich in information, its full utilization is often stymied by the sheer volume of data involved. Of the approximately 40 GB of new data acquired each day, only a portion is permanently archived and on-line storage is time limited to about 1 week for most satellite sensors. Current capabilities do not even allow for on-line access of GOES data from a single hurricane. The proposed new distributed storage system would provide for essential improvements in online storage of the satellite measurements from the GOES GVAR and other sensors, providing ready access for short-term process studies and for multi-year studies pertinent to hurricane research. Future research will require integrating the GOES data with data from other satellite sensors (e.g. altimeters, microwave radiometers), in-situ measurements, and selected measurements from atmospheric and oceanic models. The proposed system would facilitate this integration of data and exploitation of new simulation and visualization tools for embarking on new avenues of hurricane research. Other hurricane and coastal research projects would be similarly enhanced by the additional on-line storage and ready access to satellite measurements for time-series studies of environmental change.
The radar data from Radarsat-1 SAR and multi-spectral data from MODIS and SPOT have proven essential for mapping coastal and urban flooding during hurricane events and were particularly valuable in the case of Hurricane Katrina when flooding was abnormally long-lived. Although the LSU Earth Scan Lab has the capability to capture SAR data, the lack of a distributed data archival, analysis, and visualization system like PetaShare has inhibited the development of tools to process and analyze this data in near real-time for emergency response needs. They envisage that PetaShare will alleviate many of the existing barriers to effective utilization of these valuable and unique satellite datasets. The PetaShare system will be particularly important for the research in storm surge modeling and hurricane track prediction. This will create a direct impact on the states threatened by hurricanes, especially Louisiana, since it will create the necessary instrumentation for faster and more efficient hurricane track and storm surge simulations with less human interaction. If we consider that an evacuation alone at the scales of recent hurricanes costs close to one billion dollars, we can better understand how higher accuracy and efficiency in these simulations are important. PetaShare will help to improve the forecast and evacuation accuracy by enabling efficient management of data and easy access to data resources.
Research in Petroleum Engineering (White, Allen, Kosar): Dr. Chris White, Dr. Gabrielle Allen and Dr. Tevfik Kosar, together with 15 other researchers at LSU, University of Louisiana at Lafayette, and Southern University of Baton Rouge, aim to develop and deploy a Ubiquitous Computing and Monitoring System (UCoMS) for oil/gas exploitation and management. Being a nationally unique research cluster in IT for energy, UCoMS project addresses key research issues to arrive at appropriate technical solutions in the areas of wireless networked systems, grid computing, and application software. It includes three cohesive and interrelated areas of projects, which together enable the construction of a useful UCoMS prototype for discovery and management of energy resources. The UCoMS proof-of-concept prototype will be developed and deployed utilizing several existing (possibly decommissioned) well platforms in the Gulf of Mexico. The technical solutions will be generally applicable to sensor networks, wireless communications, and grid computing. These solutions will effectively facilitate drilling and operational data logging and processing, on-platform information distribution and displaying, infrastructure monitoring/intrusion detection, seismic processing and inversion, and management of complex surface facilities and pipelines. Decommissioned well platforms can be monitored and safeguarded using UCoMS, with a potential of fostering new industries as well in the future.
Large-scale data storage and manipulation are involved in reservoir simulation and analysis. In a reservoir uncertainty analysis process, considering eleven geologic factors and three engineering factors, the number of simulation runs for full factorial design would 46*38 = 26,873,856 if six factors in the 14 factors have four levels and eight factors three levels. The average result dataset of a single reservoir simulation is up to 50 Megabytes. The result data size would be 1,383,693 Gigabytes. In order to obtain high performance computing capability, these simulations would be executed on a Grid, which means large-scale data transfer would be involved in a geographically distributed environment. In addition, the size of geological and geophysical (G&G) data and well logging data, which are critical for reservoir modeling, is up to terabytes, even petabytes. The existing instrumentation can not meet such large-scale data storage and manipulation requirements.
Meanwhile, drilling processing and real-time monitoring are also data-intensive. During drilling processing, there are multiple gigabytes of data produced even in a short period. Petroleum engineers would view and analyze the drilling processing through these data, and then make decisions in real time. Therefore, it is a challenging issue to determine how to store, share, retrieve, analyze, and visualize these streaming data. Existing data management systems have great difficulty in handling this scenario, which requires large-scale data storage and high-performance data transfer. A system like PetaShare is extremely needed to further research in the area of petroleum engineering.
Research in Computational Cardiac Electrophysiology (Trayanova): Dr. Natalia Trayanova's research group at John Hopkins University is working on the development of cutting-edge computational tools and simulation-experiment approaches in order to advance the understanding of the fundamental mechanisms that underlie rhythm disorders in the heart and to uncover better strategies for prevention and treatment of these disorders. The research conducted by her group includes: cell and tissue responses to electric shocks; structural determinants of shock-induced electroporation in the ventricles; effects of decreasing LV postshock excitable gap on the upper limits of cardiac vulnerability; arrhythmogenesis in acute regional ischemia; and elucidating the role of mechano-electric feedback (MEF) in arrhytmogenesis.
Dr. Trayanova's cardiac electrophysiology research generates large amounts of digital data. While large hard drives have historically been sufficient for storage of these data, the rate of data production is beginning to outpace that of increases in available hard disk size. Two years ago, the lab built a large file server containing nine hard disks in an attempt to alleviate their storage problems. This server was quickly filled, and it is now necessary to offload data not needed for immediate analysis to removable hard drives. This is only the beginning of the lab's storage problems.
As the models become more realistic, the simulations produce geometrically larger data sets. There are three specific types of realism involved. The first, and most problematic type, is that of three-dimensional representation of cardiac tissue. The trabeculae, thin strands of web-like muscle tissue inside the cavities of the heart, are suspected of causing arrhythmic activity, especially during strong shocks. Accurate and useful modeling of these structures requires high-resolution meshes. The data resulting from simulations with a given mesh increase in size as the cube of the linear resolution. Therefore, inclusion of these important structures produces a massive concomitant increase in data set size. The second type of realism involved is that of electrophysiological heterogeneity within the myocardial tissue. New developments in the study of regional disease, and of normal heterogeneities within the cardiac tissue have led to more realistic computational models of the heart. A primary use of these models is for comparison - simulations with and without the newly-modeled heterogeneities. Studies such as this can double to quintuple the number of simulations that would normally be run, increasing the data set size in the same fashion. The third type of realism is involved whey numerical approximation is used to integrate differential equations. In this case, the accuracy of the simulations can depend on the size of the time steps used in the simulations. This alone is not sufficient to require more storage space. However, in some cases, such as in the consideration of magnetic fields, it becomes necessary to output data at a higher temporal resolution. Typically this results in a data set size increase of an order of magnitude.
While the group is currently progressing well on these projects, they face an imminent ceiling of data storage and data retrieval, not easily surmountable by the simple addition of storage devices to their facilities. The data sets to be produced by their future studies range from 100 to 500 GB in size. Without access to a next generation data access, retrieval and management system like PetaShare, their research activities in cardiac electrophysiology face a substantial hindrance.
Research in Computational Fluid Dynamics (Acharya): The Computational Fluid Dynamics (CFD) group at LSU, led by Dr. Acharya, is focused on simulation of turbulent flows including Direct Numerical Simulations (DNS), Large Eddy Simulations (LES), and Reynolds-Averaged Navier Stokes Simulations (RANS). In DNS, the goal is to resolve all relevant scales in the flow up to the dissipative scales, and therefore no models are required. To resolve the small scales, mesh resolution requirements are extreme, and computations are very expensive. In LES inertial range scales are resolved and the small scales are modeled, while in RANS all scales are modeled, and mesh resolution requirements are modest. As an example, for a simulation of a jet in crossflow, at a modes Reynolds number of about 5,000, a mesh of the order of 100 million nodes requiring over a week's worth of computation on a 256 node processor may be required. In terms of storage requirements, the disk space requirements for DNS can be extensive. This is primarily a result of the fourth dimension (time) in these problems.
In order to understand the dynamic behavior of the turbulent flows modeled by DNS, many instances (on the order of ~1,000-10,000) of the flow field must be stored for subsequent analysis. As each instance of the flow, it may contain 1.5x10^8 discrete variables, and storing even 2,000 instances (using single precision real numbers) requires 1.2TB of storage. With the in-house code Tetra, great care is taken to ensure that the data is efficiently stored. The data is stored as binary and has very little redundancy, as the compression ratio using gzip is only a few percent. The 1.2TB estimate is of a medium size simulation only; currently work is in progress on a simulation eight times larger. Due to the limitations in the current instrumentation used for data storage and access in DNS, and due to the size of the data involved, the available datasets are not being processed efficiently, effectively impeding in the group's research activities. The proposed PetaShare system will therefore greatly alleviate the current constraints.
Research in Synchrotron X-ray Microtomography (Willson, Butler): Flow and transport of immiscible fluids through porous media systems are directly related to the geometry and the topology of the pore space as well as the physical and chemical properties of the media and fluids. Dr. Clint Willson and Dr. Les Butler at LSU utilize synchrotron X-ray microtomography to obtain high-resolution, three-dimensional images of various porous media systems. Their early work focused on techniques for quantifying the pore network structure from tomography images of unconsolidated porous media systems; i.e., extracting a physically-representative mapping of the pore space. This work was then extended to include the mapping and characterization of one or more fluid phases within the pore space. This enabled them to correlate the properties and distribution of the fluids with the pore structure providing insight into the physical processes impacting multiphase flow.
More recently, the group's algorithms have been extended so that they can characterize the granular structure as well as the pore structure. The ability to non-destructively evaluate the packing properties of granular materials provides additional research opportunities in material science and other areas. These algorithms have been applied to additional multiphase porous media systems similar to those mentioned previously as well as to a series of sandstone cores collected from an outcrop in Wyoming. The group has also developed collaborative research with faculty from the University of Delaware who are interested in the properties and distribution of water in unsaturated, chemically-heterogeneous porous media. Many different research groups are beginning to use the upgraded tomography beamline. They need to remotely access these large datasets and visualize them at their local workstations, which increases the complexity of the problem.
The proposed PetaShare system will address several obstacles in this research area. It will enable archiving raw data, and hosting reconstructed volumes for inspection and analysis by the users of the tomography beamline. It is very crucial that reconstructed volumes are hosted at a fast access, short term storage system. PetaShare will provide dynamic and transparent placement of data from tertiary to secondary storage, and from there to the very high performance RAM storage. When the collaborators at remote sites would like to remotely visualize the data, it will be directly read from the distributed RAM storage instead of the disks. This will increase the visual analysis performance significantly.
Research in High Energy Physics (Greenwood): Dr. Zeno Greenwood's research group at the Louisiana Tech University is planning to further their research in particle physics using the proposed system. Particle physicists employ high-energy particle accelerators and complex detectors to probe ever-smaller distance scales in an attempt to understand the nature and origin of the universe. The Tevatron proton-anti proton collider located in Fermi National Accelerator Laboratory, Batavia, Illinois, currently operates at the "energy frontier" and has the opportunity to provide new discoveries through the D0 and CDF high energy physics (HEP) experiments. Future experiments, such as ATLAS at the Large Hadron Collider (LHC) in the European Organization for Nuclear Research (CERN) and a proposed electron-positron linear collider have the prospects of either building on discoveries made at Fermilab, or making the discoveries themselves if nature has placed these new processes beyond the energy reach of the Tevatron. Detected in these experiments are objects that typically decay into complex final states and are very rarely produced. Selecting these few complicated interesting events from the enormous volume of data requires a detailed understanding of the detector and physics that can only be quantified by the development of sophisticated analysis algorithms and generation of large sets of detailed Monte Carlo (MC) simulations. Given the small production probabilities of these interesting events, expediting searches for such processes and other critical analyses will require fast, efficient, and transparent delivery of large data sets, together with an efficient resource management and software distribution system.
In addition to problems associated with the magnitude of the data sets, the situation is further complicated by the distribution throughout the world of the collaborating institutions in these experiments. To set the scale, the D0 experiment has 650 collaborators from 78 institutions in 18 countries. The broad geographical distribution of the collaborators, combined with several petabytes of data, poses serious difficulties in data sharing and optimal use of human and computing resources. These problems not only affect high energy physics experiments, but are also shared by other scientific disciplines that generate large volume data sets, such as biology (genomics), meteorology (forecasting), and medical research (3D-imaging). These challenges have been documented by the NSF sponsored GriPhyN initiative and its follow-up iVDGL project, and the DOE sponsored PPDG project. The DOE and NSF efforts have combined to study the computing needs of U.S. physics experiments through the Trillium Collaboration. Ideas and software from these efforts are being incorporated into experiments, such as D0 and ATLAS, to manage and provide access to their petabytes of data.
As part of the D0 experiment's effort to utilize grid technology for the expeditious analysis of data, nine universities have formed a regional virtual organization (VO), the Distributed Organization for Scientific and Academic (DOSAR). One of the founding members of DOSAR is Louisiana Tech University (LaTech). In order to pursue full utilization of grid concepts, the LaTech group, along with five other DOSAR universities, is establishing an operational regional grid called DOSAR-Grid using all available resources, including personal desktop computers and large dedicated computer clusters. They will initially construct and operate DOSAR-Grid utilizing a framework called SAM-Grid being developed at Fermilab. DOSAR-Grid will subsequently be made interoperable under other grid frameworks such as LCG, Teragrid, Grid3 and Open Science Grid.
Dr. Greenwoods group plans to develop and implement tools to support easy and efficient user access to the grid and to ensure its robustness. Tools to transfer binary executables and libraries efficiently across the grid for environment-independent execution will be developed. DOSAR will implement the Grid for critical physics data analyses, while at the same time subjecting grid computing concepts to a stress test to its true conceptual level, down to personal desktop computers. The LaTech group is planning to join the LHC ATLAS experiment and a future experiment under study by the American Linear Collider Physics Group, and will apply the experience gained from DOSAR to these projects.
While data storage facilities exist for DOSAR work in Texas and Oklahoma, sufficient data storage capacity as well necessary instrumentation to efficiently access, retrieve, and analyze data is sorely lacking in Louisiana. The proposed PetaShare system will greatly improve the capability of Louisiana institutions to contribute to the development of grid computing so necessary in present and future High Energy Physics.