CSE 710 – Wide Area Distributed File Systems

Spring 2014 – Project Ideas

 

Project-1: FuseDLS: Design and Implementation of a Fuse-based file system interface to a Cloud-hosted Directory Listing Service:

 

The Cloud-hosted Directory Listing Service (DLS) prefetches and caches remote directory metadata in the Cloud to minimize response time to the thin clients (such as smartphones, Web clients etc) to enable efficient directory traversal before issuing a remote third-party data transfer request. Conceptually, DLS is an intermediate layer between the thin clients and the remote servers (such as FTP, GridFTP, SCP etc) which provides access to directory listings as well as other metadata information. In that sense, DLS acts as a centralized metadata server hosted in the Cloud. When a thin client wants to list a directory or access file metadata on a remote server, it sends a request containing necessary information (i.e., URL of the top directory to start the traversal, along with required credentials for authorization and authentication) to DLS, and DLS responds back to the client with the requested metadata.

During this process, DLS first checks if the requested metadata is available in its disk cache. If it is available in the cache (and the provided credentials match the associated cached credentials), DLS directly sends the cached information to the client without connecting to the remote server. Otherwise, it connects to the remote server, retrieves the requested metadata, and sends it to the client. Meanwhile, several levels of subdirectories will be prefetched at the background in case the user wants to visit a subdirectory. Any metadata information on DLS server is cached and periodically checked with the remote server to ensure freshness of the information. Clients also have the option to refresh/update the DLS cache on demand to make sure they are accessing the server directly, bypassing the cached metadata. DLS’s caching mechanism can be integrated with several optimization techniques in order to improve cache consistency and access performance.

FuseDLS will be a virtual file system interface that allows users to access the Cloud-hosted DLS service and parse remote storage server contents as convenient as accessing the local file system. It will enable mounting remote storage servers into the users’ local host. Although filesystem mounting normally requires root privileges, FuseDLS will allow non-root users to be able to mount remote file systems locally. FuseDLS will be based on FUSE which is a simple interface to export a virtual file system to the Linux kernel in user space. Whenever system I/O calls are made towards mounted FuseDLS resource, FUSE will capture these I/O calls in kernel and forward them to user space library called libFuse. This library will map local system I/O calls into remote storage I/O calls. The FUSE library is available in most Linux distributions today. It is a very practical way of implementing a user-level file system. The students will be able to use this very convenient tool to develop the client side of a wide area file system. Access to the Clous-hosted DLS service and the necessary API will be provided to the students.

Project-2:  MDS: Design and Implementation of a Distributed Metadata Server for Global Name Space in a Wide-area File System: 


 

One of the important features of a distributed storage or file system is providing a global unified name space across all participating sites/servers, which enables easy data sharing without the knowledge of actual physical location of the data. This feature depends on the “location metadata” of all files/datasets in the system being available to all participating sites. The design and implementation of such a metadata service which would provide high consistency, scalability, availability, and performance at the same time is a major challenge.

A central metadata server is generally easy to implement and ensures consistency but it is also a single point of failure leading to low availability, low scalability and low performance in many cases. Ensuring high availability requires replication of metadata servers at local sites. Synchronously replicated metadata servers provide high consistency but introduce a big synchronization overhead which degrades especially the write performance of the metadata operations. Asynchronously replicated metadata servers provide high performance but introduce conflicts and consistency issues across replicated servers. Fully distributed approaches can be more scalable but may suffer from performance and consistency.

In this project, the students will study different metadata server (MDS) layouts in terms of high availability, scalability, consistency and performance. They will design a distributed or replicated (or a hybrid) metadata approach which would achieve all of these four features with minimal sacrifice. This approach will be implemented as part of Ori and GlusterFS file systems.

 

Project-3: SmartFS: Design and Implementation of a Serverless Distributed File System for Smartphones: 


 

In this project, the students will develop a distributed file system (SmartFS) for file access and sharing across multiple Android smartphones. This will be a serverless file system, meaning it will not require any external server component nor any of the participating phones acting like a server. In that sense, this will be a peer-to-peer (p2p) distributed file system with POSIX interface. Each phone will be able to export certain portions of their local file system to other users (i.e. enable data sharing), and other phones will be able to locate and import/mount those remote files/directories to their local file system. Performance and scalability will be the major design considerations. The authorization and authentication of remote clients will also be an important component of the project. The connectivity between SmartFS participating phones can be either through WIFI or through 4G. Android phones will be provided to the students to test their implementation.

 

Project-4: PowerFS: Energy-Aware File System Design

 

Reducing the power consumption has become a major design consideration across a spectrum of computing solutions, from supercomputers and datacenters to handhelds and mobile computers. Servers that run Google’s data centers have been estimated to consume millions of dollars in electricity costs per year, and the total worldwide spending on power management for enterprises was a staggering $40 billion. There has been a large body of work on managing power and improving energy efficiency in computing at different levels, including computer architecture, operating systems, file systems, application level tools and schedulers. In this project, the students will design an energy-aware file system. They will take an existing file system such as HDFS or GlusterFS, analyze the current power consumption levels of this file system, and then modify the file system to reduce its power consumption. The changes to be made to the file system can be on a single component such as the CPU scheduler, I/O scheduler, or the memory management unit which can potentially have a great impact on its power consumption.