开发者

Restarting agent program after it crashes

开发者 https://www.devze.com 2023-02-16 02:12 出处:网络
Consider a distributed bank application, wherein distributed agent machines modify the value of a global variable : say \"balance\"

Consider a distributed bank application, wherein distributed agent machines modify the value of a global variable : say "balance"

So, the agent's requests are queued. A req开发者_高级运维uest is of the form wherein value is added to the global variable on behalf of the particular agent. So,the code for the agent is of the form :

  agent
    {
     look_queue(); // take a look at the leftmost request on queue without dequeuing

     lock_global_variable(balance,agent_machine_id);    
     /////////////////////  **POINT A**
     modify(balance,value);
     unlock_global_variable(balance,agent_machine_id);  
     /////////////////// **POINT B**
     dequeue();      //  once transaction is complete, request can be dequeued
    }

Now, if an agent's code crashes at POINT B, then obviously the request should not be processed again, otherwise the variable will be modified twice for the same request. To avoid this, we can make the code atomic, thus :

agent
{
 look_queue(); // take a look at the leftmost request on queue without dequeuing

 *atomic*
 {   
  lock_global_variable(balance,agent_machine_id); 
  modify(balance,value);
  unlock_global_variable(balance,agent_machine_id);
  dequeue();      //  once transaction is complete, request can be dequeued
 }
}       

I am looking for answers to these questions :

  1. How to identify points in code which need to be executed atomically 'automatically' ?
  2. IF the code crashes during executing, how much will "logging the transaction and variable values" help ? Are there other approaches for solving the problem of crashed agents ?
  3. Again,logging is not scalable to big applications with large number of variables. What can we in those case - instead of restarting execution from scratch ?
  4. In general,how can identify such atomic blocks in case of agents that work together. If one agent fails, others have to wait for it to restart ? How can software testing help us in identifying potential cases, wherein if an agent crashes, an inconsistent program state is observed.
  5. How to make the atomic blocks more fine-grained, to reduce performance bottlenecks ?


Q> How to identify points in code which need to be executed atomically 'automatically' ?
A> Any time, when there's anything stateful shared across different contexts (not necessarily all parties need to be mutators, enough to have at least one). In your case, there's balance that is shared between different agents.

Q> IF the code crashes during executing, how much will "logging the transaction and variable values" help ? Are there other approaches for solving the problem of crashed agents ?
A> It can help, but it has high costs attached. You need to rollback X entries, replay the scenario, etc. Better approach is to either make it all-transactional or have effective automatic rollback scenario.

Q> Again, logging is not scalable to big applications with large number of variables. What can we in those case - instead of restarting execution from scratch ?
A> In some cases you can relax consistency. For example, CopyOnWriteArrayList does a concurrent write-behind and switches data on for new readers after when it becomes available. If write fails, it can safely discard that data. There's also compare and swap. Also see the link for the previous question.

Q> In general,how can identify such atomic blocks in case of agents that work together.
A> See your first question.

Q> If one agent fails, others have to wait for it to restart ?
A> Most of the policies/APIs define maximum timeouts for critical section execution, otherwise risking the system to end up in a perpetual deadlock.

Q> How can software testing help us in identifying potential cases, wherein if an agent crashes, an inconsistent program state is observed.
A> It can to a fair degree. However testing concurrent code requires as much skills as to write the code itself, if not more.

Q> How to make the atomic blocks more fine-grained, to reduce performance bottlenecks?
A> You have answered the question yourself :) If one atomic operation needs to modify 10 different shared state variables, there's nothing much you can do apart from trying to push the external contract down so it needs to modify more. This is pretty much the reason why databases are not as scalable as NoSQL stores - they might need to modify depending foreign keys, execute triggers, etc. Or try to promote immutability.

If you were Java programmer, I would definitely recommend reading this book. I'm sure there are good counterparts for other languages, too.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号