|Version 19 (modified by nuno, 4 years ago)|
With the explosion growth of Web 2.0 applications, the need for an "infinite" storage arose. Web 2.0 applications like Flickr, Facebook, Wikipedia or Youtube have a high demand for storage space. The storage market faces new challenges because the existing solutions are too expensive to accomodate this growth. On one side, storage vendors are building high-available and scalable storage solutions that use commodity hardware, so that they can have more competitive products in the market. On the other side, the services that demand these volumes cannot wait for the release of new products and are building their own storage solutions based on commodity hardware.
This project aims to build a high-available and scalable distributed storage system based on commodity hardware. For more information about how this project is being done, please read the Disclaimer section below.
This project was born as a project for the course of Distributed Systems from Carnegie Mellon University. The aim of the course project was to build a distributed system that follows the major properties for distributed systems, like High-availability, Fault-tolerance and Scalability.
The group of students (we) that proposed this project aims to build something useful, fun and to learn new technologies at the same time. We want to experience the difficulties faced on building a distributed storage and to come up with solutions to resolve at least some of these problems. Being this a course project, the priorities of the project are to build a high-available, fault-tolerant, scalable and distributed storage solution.
Performance is a mandatory keyword for a distributed storage solution, specially if you want to deploy it in your own business. While we will take performance into consideration as much as possible, it is not a priority of the project. However, at the end of the course project, we will most likely have a clear idea of what can be improved and leave such improvements for future work.
To better understand the architecture of dumpFS please refer to the image below. The communications between the several components will be described in more detail on the following sections. Please note that this is a logical architecture, some of the components might be physically together (e.g. Monitoring component).
The architecture is composed by three components: Cerebrum, Monitoring and Storage. An explanation of each component is provided in the following sections.
Cerebrum is the main component of dumpFS. It will be the interface between the storage, monitoring and client applications through a set of RESTful Webservices.
If the client wants to retrieve a file from dumpFS it will contact Cerebrum (GET) to know what is the best location to get the file from. Cerebrum will then answer with a set of locations and it is up to the client to get the file directly from the storage nodes.
If the client wants to store a new file (PUT), it will contact Cerebrum to know the location (which storage node) for storing the file. In the request, the client should provide the file size and a MD5 hash (checksum) of the file. Cerebrum will then choose a set of servers for storing the file based on the status of each storage node (free disk space) and based on the hash of the file (to spread the files among the servers in a arbitrary/weighted way). The client will then contact the storage node directly in order to store the file. Along with the file, the client should also send the message gotten from Cerebrum, which includes the set of storage nodes where the file will reside. The primary storage node will be responsible for replicating the file based on the replication policy (and the message gotten from the client).
If the client wants to delete a file (DEL), it will contact Cerebrum to delete the file. Cerebrum will then contact the storage nodes where the file reside and instruct them to delete the file.
The storage nodes, as the name implies, will be responsible for storing the files. The storage node (primary storage node) will receive the file directly from the client and will replicate it to two other storage nodes (based on the replication policy). The storage nodes will also provide files to clients.
For retrieving a file, the client will first contact the Cerebrum to get a set of locations. Then, it will contact a storage node directly through a RESTful Webservice (GET) to get the file (based on the <fileID>).
For storing a file, the client will first contact Cerebrum to know where the file will be stored, then it will contact the primary storage node directly to store the file. Along with the file, the client should also send the message gotten from Cerebrum, which will inform the primary storage node where it should replicate the file to. The primary storage node should then replicate the file based on the replication policy and return an OK to the client saying the file was stored (and replicated) successfully or ERROR otherwise.
Storage node Webservice API:
The Monitoring component will be responsible to monitor the storage nodes and to keep measures to be used for a better quality of service.
For more technical information about the Monitoring component please read the Monitoring page.
The demo is just a frontend or a service that will use the dumpFS API for the storage solution.
(and so on and so forth)
Flow for getting a file:
- The API contacts the webservice and invokes the GET method.
- The Cerebrum will consult its Metadata in order to find out where the file is located. By default the file is replicated in three locations, so this list will have three entries.
- The Webservice will return the list of the locations of the file to the client, based on the availability and load of each storage node. Each entry of the response should contain the following information: url, checksum.
- The API should try to retrieve the file based on the order of the response, i.e., the first entry of the response should be the node with the least load.
- When transmission of the file is finished, the storage node sends related metrics to the Monitor
Flow for putting a file:
- The API contacts the Cerebrum to find out what is the primary storage node that will store the file. The information sent by the API should contain: filename, size, etc.
- Cerebrum will reply with a URL for the file storage upload. Cerebrum determines the location of the file based on availability and free storage space of each storage node. Cerebrum may contact storage node to validate upload and provide a replication trail; or the primary storage node may contact Cerebrum to ask for replication nodes and filesize. For the latter scenario, each node has a queue directory and all the uploaded files will be uploaded to that directory. There's a script that will run and check each file in the queue directory. For each file, it will contact Cerebrum and invoke a special method (REP) with the filename as an argument. Cerebrum will then reply with a set of locations for replication and the file size. It's up to the primary storage node to replicate the file when the file size in the queue directory is the same as the value returned by the Cerebrum.
- The API sends the file to the primary storage node.
- When finished, the storage node sends related metrics to the Monitor
For more Information about Trac and this project click here.