| Version 2 (modified by nuno, 4 years ago) |
|---|
Monitoring
Here you can find more technical information about the Monitoring component. This page is also a draft, a place to take notes about ideas and questions that arise. You can find some questions that were not answered yet.
Monitoring is composed by several modules such as availability, replication in case of failure, load and available space. This component is built using Erlang's gen_server_cluster module.
Availability
All the nodes are part of a cluster (using gen_server_cluster). If a storage node fails, it will be marked as offline and anytime Cerebrum returns a list of storage nodes for a particular file (return of a GET or PUT command) it will not include the offline node as one of the locations. After a pre-defined period of time, if the node is still offline it will start the replication process (see replication module below).
Replication
If a server is offline for a certain period of time, Monitoring will start replicating the files on that node into another server.
- Get a list of all the files stored in the offline node.
- For each file:
- Calculate a new location for the file
- Run the replication algorithm used by the primary storage nodes to replicate the file into the new location
- Update the metadata (change group) to reflect the changes
- If the offline node recovers and comes back online, then delete all files from the node.
Q: Lots of questions here related to what to do when the faulty server gets online.
Load
This module should calculate a score to each node. The score is based on the load of each node, as well as bandwidth metrics provided by each storage node. In other words, when a client gets a file, after the request is finished, the storage node will contact the Monitoring component to tell how much data was retrieved and in how many seconds.
This score is used to sort the list of storage nodes that hold a file that is returned when the client API calls the Cerebrum's GET method.
Available space
This module keeps a list of the available space on each node. This information is used by Cerebrum when the client calls the PUT command, to determine which storages will hold the file.