With the explosive growth of Web 2.0 applications, the need for an "infinite" storage arose. Web 2.0 applications like Flickr, Facebook, Wikipedia or Youtube have a high demand for storage space. The storage market faces new challenges because the existing solutions are too expensive to accommodate this growth. On one side, storage vendors are building high-available and scalable storage solutions that use commodity hardware, so that they can have more competitive products in the market. On the other side, the services that demand these volumes cannot wait for the release of new products and are building their own storage solutions based on commodity hardware.
This project aims to build a high-available and scalable distributed storage system based on commodity hardware. For more information about how this project is being done, please read the Disclaimer section below.
This project was born as a project for the course of Distributed Systems from Carnegie Mellon University. The aim of the course project was to build a distributed system that follows the major properties for distributed systems, like High-availability, Fault-tolerance and Scalability.
The group of students (we) that proposed this project aims to build something useful, fun and to learn new technologies at the same time. We want to experience the difficulties faced on building a distributed storage and to come up with solutions to resolve at least some of these problems. Being this a course project, the priorities of the project are to build a high-available, fault-tolerant, scalable and distributed storage solution.
Performance is a mandatory keyword for a distributed storage solution, specially if you want to deploy it in your own business. While we will take performance into consideration as much as possible, it is not a priority of the project. However, at the end of the course project, we will most likely have a clear idea of what can be improved and leave such improvements for future work.
To better understand the architecture of dumpFS please see the image below. The communication between the several components will be described in more detail in the following sections. Please note that this is a logical architecture, some of the components might be physically together (e.g. Monitoring component).
The architecture is composed by three components: Cerebrum, Monitoring and Storage. These components will be implemented in Erlang. An explanation of each component is provided in the following sections.
Cerebrum is the main component of dumpFS. It will be the interface between the storage, monitoring and client applications through a set of RESTful Webservices. Cerebrum needs to store metadata in a database. Mnesia will be used as the database engine.
If the client wants to retrieve a file from dumpFS it will contact Cerebrum (GET) to know the best location to get the file from. Cerebrum will then answer with a set of locations and it is up to the client to get the file directly from the storage nodes.
If the client wants to store a new file (PUT), it will contact Cerebrum to know the location (which storage node) for storing the file. In the request, the client should provide the file size and a MD5 hash (checksum) of the file. Cerebrum will then choose a set of servers for storing the file based on the status of each storage node (free disk space) and based on the hash of the file (to spread the files among the servers in an uniform/weighted way). The client will then contact the primary storage node directly in order to store the file. Along with the file, the client should also send the message gotten from Cerebrum, which includes the set of storage nodes where the file will reside. The primary storage node will be responsible for replicating the file based on the replication policy (and the message gotten from the client).
If the client wants to delete a file (DEL), it will contact Cerebrum to delete the file. Cerebrum will then contact the storage nodes where the file reside and instruct them to delete the file.
The storage nodes, as the name implies, will be responsible for storing the files. The storage node (primary storage node) will receive the file directly from the client and will replicate it to two other storage nodes (based on the replication policy). The storage nodes will also provide files to clients.
For retrieving a file, the client will first contact the Cerebrum to get a set of locations. Then, it will contact a storage node directly through a RESTful Webservice (GET) to get the file (based on the <fileID>).
For storing a file, the client will first contact Cerebrum to know where the file will be stored, then it will contact the primary storage node directly to store the file. Along with the file, the client should also send the message gotten from Cerebrum, which will inform the primary storage node where it should replicate the file to. The primary storage node should then replicate the file based on the replication policy and return an OK to the client saying the file was stored (and replicated) successfully or ERROR otherwise.
Storage node Webservice API:
The Monitoring component will be responsible for the High Availability and Fault-tolerance characteristics of dumpFS.
The monitoring component will watch for storage node failures. If a certain storage node is not reachable for more than a certain period of time, the monitoring component will start to replicate the files that were stored in that storage node into another available storage node. For that purpose it will contact Cerebrum to know all the files that were stored in the faulty storage node.
The monitoring component will also keep metrics about each storage node, in order to help Cerebrum to make a decision when a client wants to retrieve or store a file. Examples of such metrics are the free disk space on each node and the load on each storage node.
For more technical information about the Monitoring component please read the Monitoring page.
The demo is a frontend or a service that will use the dumpFS API for the storage solution.
Flow for getting a file:
- The client API contacts the webservice and invokes the GET method.
- The Cerebrum will consult its Metadata in order to find out where the file is located. By default the replication policy defines that each file is stored in three locations.
- The Webservice will return the checksum and a group of locations of the file to the client, based on the availability and load of each storage node.
- The API should try to retrieve the file based on the order of the response, i.e., the first location should be the node with the least load.
- When transmission of the file is finished, the storage node sends related metrics to the Monitor.
Flow for putting a file:
- The client API contacts the Cerebrum to find out what the group of storage nodes that will store the file is. The information sent by the client API should contain: filename, size, checksum (MD5 hash).
- Cerebrum will reply with a group of storage nodes. Cerebrum determines the location of the file based on availability and free storage space of each storage node.
- The client API sends the file to the primary storage node (File upload) along with the message gotten from Cerebrum. The storage node will then replicate the file to the other storage nodes. When the replication is completed it will answer with an OK to the client or ERROR otherwise.
- When finished, the storage node sends related metrics to the Monitor
Click here for some draft notes about Erlang, etc.. We just need a place to post some tech bits..
For more Information about Trac and this project click here.