A Distributed Data Archival, Analysis and Visualization Cyberinfrastructure for Data-intensive Collaborative Research.
PetaShare Storage is Online!
PetaShare storage is now online and accepting allocation proposals. [Read More]
CS Research and Development
The PetaShare development team involves researchers with expertise in distributed data handling and storage, grid computing, high performance data mining and database systems, and visualization. The expertise of these researchers in their areas will lead PetaShare project to success. In turn, they will be able to utilize PetaShare to foster their own research.
Research in Distributed Data Handling (Kosar): The PI, Dr. Tevfik Kosar, and his group have profound and wide ranging experience in building blocks for solving the distributed data handling problem. They have introduced the concept that the data placement efforts which have been done either manually or by using simple scripts should be regarded as first class citizens and these tasks should be automated and standardized just like the systems that support and administer computational jobs. Data intensive jobs need to be queued, scheduled, monitored and managed in a fault tolerant manner. The PI and his group have designed, and implemented the first prototype batch scheduler specialized in data placement: Stork. Stork provides a level of abstraction between the user applications and the underlying data transfer protocols; allows queuing, scheduling, and optimization of data placement jobs. They have introduced several possible scheduling strategies for data placement.
Kosar's group has designed and implemented a complete data placement subsystem for distributed computing systems, similar to the I/O subsystem in operating systems. This subsystem includes the specialized scheduler for data placement (Stork), a higher level planner aware of data placement jobs, a knowledgebase which can extract useful information from history logs, failure detection and classification mechanism, and some runtime optimization tools. This data placement subsystem provides complete reliability, a level of abstraction between errors and users/applications, ability to achieve load balancing on the storage servers, and to control the traffic on network links. With several case studies, the group has shown the applicability and contributions of their data placement subsystem. The group has shown a method to dynamically adapt data placement jobs to the environment at the execution time. The group has designed and developed a set of disk and memory and network profiling, monitoring and tuning tools which can provide optimal values for I/O block size, TCP buffer size, and the number of TCP streams for data transfers. These values are generated dynamically and provided to the higher level data placement scheduler, which can use them in adapting the data transfers at run-time to existing environmental conditions. The group has also provided dynamic protocol selection and alternate protocol fall-back capabilities to provide superior performance and fault tolerance.
Kosar and his group have introduced the concept of Grid knowledgebase to keep track of the job performance and failure characteristics on different resources as observed by the client side. Grid knowledgebase helps the users to classify and characterize jobs by collecting job execution time statistics. It also enables users to easily detect and avoid black holes in distributed systems. It helps users identify misconfigured or faulty machines and aids in tracking buggy applications. Kosar's group has also designed and implemented a fault tolerant middleware layer that transparently makes data intensive applications in an opportunistic environment fault-tolerant. Its unique feature includes detecting hung transfers and misbehaving machines, classifying failures into permanent, transient, and coming up with suitable strategy taking into account user specified policy to handle transient failures. This middleware also handles information loss problem associated with building error handling in lower layers and allowing sophisticated applications to use this information for tuning.
Research in Grid Computing (Allen, Kosar): Dr. Gabrielle Allen and Dr. Kosar are involved in many active Grid projects. The NSF funded EnLIGHTened Computing project is developing advanced software and middleware that will provide a new generation of scientific applications to be aware of their network, Grid environment and capabilities and to make dynamic, adaptive and optimized use of networks connecting various high end resources. The project is developing software that provides the vertical integration starting from the application down to the optical control plane and is researching and developing software and middleware that enables dynamic and interactive applications to access and allocate network bandwidth. The EnLIGHTened project is building an optical testbed connecting LSU, Starlight, MCNC and Caltech. This will enable the deployment and testing of services providing on-demand, dynamic end-to-end high-speed optical networking connections for scientific applications. PetaShare will leverage the middleware and technologies developed by the EnLIGHTened project in order to better utilize the high-speed networks such as LONI and NLR.
The NSF supported Southeastern Coastal Ocean Observing and Prediction (SCOOP) project being developed to archive and process coastal ocean observing and prediction data. As part of SCOOP project, LSU operates a 1TB archive, with an additional 7TB of off-site storage at the San Diego Supercomputing Center (SDSC) DataCenter in San Diego. The goal of the archive is to store data from operationally run wind and surge models, and also observational data from sources such as buoys in the Gulf of Mexico. The group is developing state-of-the-art co-scheduling techniques bringing computational, data, and network resources together for the use of coastal modeling community.
The group has also taken a leading role in other NSF supported projects such as GridChem which aims to build a computational chemistry grid, and DynaCode: a general Dynamic Data Driven Application Systems (DDDAS) framework with coast and environment modeling applications.
Research in Visualization (Karki): Dr. Bijaya Karki's research group at Louisiana State University has been working on the development of high-end computational/visualization framework for investigation of fundamental materials problems. His group performs large-scale simulations on massively parallel machines (Superhelix and Supermike supercomputers at LSU) to study a wide range of physico-chemical properties of solid and liquid materials, primarily including those of direct geophysical implications. These simulations result in massive multivariate and time-dependent data sets. The atomistic molecular dynamics (MD) modeling deals with systems comprising of several thousands to several millions atoms, resulting data in gigabytes per MD step. The data are often collected over pico- to nano-second periods, which mean several thousands of MD steps. Moreover, these data represent scalar (e.g., temperature, charge distribution), vector (atomic displacement) and tensor (stress and strain variables) fields. On the other hand, quantum mechanical calculations on systems containing a few tens to a few hundreds of atoms typically produce electronic charge densities, velocity gradients and associated tensorial masses on a finite regular 3D mesh (e.g., 500 x 500 x 500 grid).
Karki's group deals with massive datasets, which are time-dependent (dynamic), irregular and multivariate in the nature. Visualization provides an efficient-effective solution for gaining insight into such datasets. They adopt the application-based approach in visualization to meet domain specific needs and to justify the effectiveness of the visualization methods for the particular application. Our current visualization activities are: 1) Space-time multiresolution algorithm: Given a dataset of high accuracy, how one can extract as much information regarding the spatio-temporal behavior of the data as possible. They have proposed a scheme to support interactive visualization at space-time multiresolution of the atomistic simulation data. They have adopted two perspectives: processing the complete or nearly complete data, and generating additional data on the fly for the local details using a combined graph-theoretic and statistical approach. b) Multiple dataset visualization: Simultaneous display of multiple datasets, representing multiple samples, or multiple conditions, or multiple simulation times, in the same visualization is required to explore important relationships or differences among the data. Such multiple dataset visualization (MDV) has to handle and render massive amounts of data concurrently.
Karki's group has adopted MDV using two widely used visualization techniques, namely, isosurface extraction and texture-based rendering. To improve the MDV interactivity, they have also proposed a technique based on combination of hardware-assisted texture mapping and general clipping. They have applied clipping in combination with multiresolution rendering to visualize large-scale 3D data. The algorithm performs all clipping operations in the low-resolution mode to achieve an interactive frame rate and supports the best-view position in the high-resolution mode. c) Remote and collaborative visualization: Due to the distributed nature of data and interested researchers, it is important to support remote and collaborative visualization over the network. They have taken such initiative to remotely visualize/analyze of large collections of elasticity data within the client-server framework. It allows them to interactively visualize multivariate elastic moduli (i.e., elastic constant tensors) and elastic wave propagation in an anisotropic crystal under influence of pressure, temperature and compositional factor. The option for on-line data reposition for clients is supported to expand the elasticity database.
Ever larger atomic and electronic systems are now being performed using parallel and distributed computing environments. The resulting datasets pose tremendous challenges to development of local and remote visualization tools. Representation and interaction are two major issues. First is to render the various aspects of data, for example, simultaneous display of multiple variables in 3D and stereoscopic environments for different specific services such as rapid navigation through the data, emphasis on data features, realistic rendering of data. Second is how to interact with the system to explore the data from different prospects during the visualization process, for example, by having a real-time walk through the massive datasets. To achieve a real-time rendering speed of a few frames per second requires a preprocessing of the data before the data being sent to graphics for rendering. A parallel and distributed visibility culling (removing data that are out of the scene), multi-resolution rendering (drawing at varying levels of details) and re-sampling of unstructured data (to get regular data) schemes can dramatically enhance the frame rate. Finally, their visualization system should allow on-fly processing/ manipulation of data to interactively refine the information hidden in the data.
Research in High Performance Data Mining and Database Systems (Triantaphyllou, Abdelguerfi): Research by Dr. Evangelos Triantaphyllou and his associates at Louisiana State University is centered on high performance data mining and knowledge discovery from databases. Triantaphyllou's group is investigating answers for the research questions such as how to improve the accuracy of classification and prediction models determined during the data mining process They are looking for finding ways to balance the overfitting and overgeneralization properties of these models. As there are many different data mining approaches such as decision trees, neural networks, and support vector machines, there might be different ways for achieving this optimal balance. At the same time, a robust theory for achieving this balance between overgeneralization and overfitting should be based on some common ground. Other related challenges the group is dealing with are determining ways for partitioning large size databases and also for developing fast, but still effective, heuristics for analyzing large scale data mining and knowledge discovery problems. Developing easily scalable algorithms is a key objective in Triantaphyllou's research activities in this area. In order for research in the above areas to be carried out successfully it is paramount to have means to access, process, and archive vast amounts of data. The data access and storage means proposed in PetaShare are highly synergistic with the computational goals of the data mining and knowledge discovery efforts currently under way at LSU.
Abdelguerfi: Dr. Mahdi Abdelguerfiâs research activities at the University of New Orleans include spatio-temporal databases, i.e., the modeling of a dynamic world, which involves objects whose position, shape and size change over time. Of particular interest are implementation of spatio-temporal indexing schemes and proximity query processing in network spatial database systems. The huge increase in size of spatio-temporal databases has brought with it an increased requirement for additional CPU and I/O resources to handle the querying, retrieval, and viewing of this data. To achieve the required performance levels, large spatio-temporal database systems have been increasingly required to make use of parallelism. The Geospatial Information Database (GIDB) Portal System is an outcome of joint research between UNO and the Naval Research Laboratory at Stennis Space center in Mississippi.
Current joint research work with the U.S Army Corps of Engineers (USACE), New Orleans District, is to achieve interoperability through the integration of heterogeneous COTS-Based spatio-temporal systems. This project is motivated by the fact that as spatio-temporal data and applications have been constantly growing and diversifying, their management becomes more and more challenging. This has given rise to the need for a globally integrated spatio-temporal management system.
In the aftermath of Hurricane Katrina, terabytes of data such as LIDAR and Aerial Photography have been collected by USACE (New Orleans District), quickly outpacing their capacity as well as that of UNO to store this data and provide access to the users. A far worse problem has emerged as a number of different offices and agencies have been contributing data through different channels, making it difficult to assimilate everything at once. UNO' Computer Science storage space is very limited while USACE's storage is limited to a SANs setup that has become prohibitively expensive to expand. Abdelguerfi has been conducting joint research with USACE on a GIS search engine to find and rank spatial datasets to cope with the sudden influx of data. By crawling through a file servers and looking at maps that people make with our data, the planned indexing program is expected to determine popular datasets and build relationships between separate datasets. Ultimately, the goal is to build a system that can search this index on keywords and minimum bounding rectangle to determine potentially valuable data. A related project dealing with the extension of GIDB with a fully automated GIS engine to search, discover, bind, and portal OpenGIS web services is being investigated jointly with NRL at Stennis Space center. The utility of the above two GIS search engines depends on the quantity of available data. However, data availability requires significant resources in terms of accessing and managing storage. The PetaShare system is significantly required in this regard.