Here is the problem:
"Agents" installed on many different servers send "heartbeat" signals up to a central server every 5 seconds. How can I find the ones that have missed their开发者_StackOverflow中文版 heartbeat for more than 10 seconds actively and raise an alert?
The problem is simple if you don't think about scalablity. In the simplest form, you can record the timestamp of the latest heartbeat received from each agent in a database table and run a regular query to find the ones older than the threshold.
This solution is however not scalable to millions of agents.
I am looking for algorithms or techologies that make this possible.
- Use a map: AgentId --> LastHearbeatTime
- Use 11 sets (assuming a resolution of 1 second is enough), each holds Ids of Agents reported in a 1-second window.
Every time an agent report a hearthbeat: 1. Find it in the map 2. Delete it from the relevant set 3. Update it in the map 4. Add it to the relevant set
Define a thread: Once per second, the oldest set expires. It should be empty. If it doesn't - it contains Ids of agents which did not report. Once a set expires, you can reuse it (cyclic array of sets).
I believe it can be implemented without locks (maybe you'll need 12 sets).
Without knowing language and platform it's somewhat hard to advise you on a detailed implementation, however my advice is somewhat similar to Lior Kogan's. In my opinion, however, you only need two sets and no map is involved:
Say you have two variables representing sets, A and B.
Every heartbeat removes the agent id from set A. Every 5 seconds, a different thread raises an alert for every agent id in B, then sets B = A, and last but not least creates a set with all of the agent ids and sets A to equal that (if the number of agent ids is really large, you can prepare the new set between one check and the other and only sleep for the remaining time). Locking would only be needed while changing the variables pointing to each set, provided you use a lock-free set collection. Performance will largely depend on the algorithmic complexity of said implementation, and if you go down this way, you should privilege the one with best performance (not necessarily best big-O, for instance if wost-case latency matters to you) for removals.
As a side note, if memory is not an issue or failures are relatively few, when you check whether you need to raise alerts and do so, you can do that on a thread of its own and getting possibly interesting performance speedups (again, the platform and runtime matter, for in erlang that would be a breeze but in Windows the cost of creating a full-blown new thread might exceed the performance benefit if the failures are few) at the cost of keeping the old B set in memory.
MongoDB is great for this type of use. While not exactly an algorithm, it does fit the bill for a fundamental technology that is needed to create this service. We use it here at CopperEgg for our RevealCloud product to do exactly what you say - we send an alert when the system has gone away for some bit of time - sampling every 5 seconds. I'd love to hear more about your thoughts and use case. Can you provide more details?
精彩评论