开发者

Creating a formula for calculating device "health" based on uptime/reboots

开发者 https://www.devze.com 2022-12-19 00:26 出处:网络
I have a few hundred network devices that check in to our server every 10 minutes. Each device has an embedded clock, counting the seconds and reporting elapsed seconds on every check in to the server

I have a few hundred network devices that check in to our server every 10 minutes. Each device has an embedded clock, counting the seconds and reporting elapsed seconds on every check in to the server. So, sample data set looks like

CheckinTime               Runtime
2010-01-01 02:15:00.000   101500
2010-01-01 02:25:00.000   102100
2010-01-01 02:35:00.000   102700

etc.

If the device reboots, when it checks back into the server, it reports a runtime of 0.

What I'm trying to determine is some sort of quantifiable metric for the device's "health".

If a device has rebooted a lot in the past but has not rebooted in the last xx days, then it is considered healthy, compared to a device that has a big uptime except for the last xx days where it has repeatedly rebooted. Also, a device that has been up for开发者_StackOverflow中文版 30 days and just rebooted, shouldn't be considered "distressed", compared to a device that has continually rebooted every 24 hrs or so for the last xx days.

I've tried multiple ways of calculating the health, using a variety of metrics: 1. average # of reboots 2. max(uptime) 3. avg(uptime) 4. # of reboots in last 24 hrs 5. # of reboots in last 3 days 6. # of reboots in last 7 days 7. # of reboots in last 30 days

Each individual metric only accounts for one aspect of the device health, but doesn't take into account the overall health compared to other devices or to its current state of health.

Any ideas would be GREATLY appreciated.


You could do something like Windows' 7 reliability metric - start out at full health (say 10). Every hour / day / checkin cycle, increment the health by (10 - currenthealth)*incrementfactor). Every time the server goes down, subtract a certain percentage.

So, given a crashfactor of 20%/crash and an incrementfactor of 10%/day:

  • If a device has rebooted a lot in the past but has not rebooted in the last 20 days will have a health of 8.6

  • Big uptime except for the last 2 days where it has repeatedly rebooted 5 times will have a health of 4.1

  • a device that has been up for 30 days and just rebooted will have a health of 8

  • a device that has continually rebooted every 24 hrs or so for the last 10 days will have a health of 3.9

To run through an example:

Starting at 10
Day 1: no crash, new health = CurrentHealth + (10 - CurrentHealth)*.1 = 10
Day 2: One crash, new health = currenthealth - currentHealth*.2 = 8 But still increment every day so new health = 8 + (10 - 8)*.1 = 8.2
Day 3: No crash, new health = 8.4
Day 4: Two crashes, new health = 5.8


You might take the reboot count / t of a particular machine and compare that to the standard deviation of the entire population. Those that fall say three standard deviations from the mean, where it's rebooting more often, could be flagged.


You could use weighted average uptime and include the current uptime only when it would make the average higher.

The weight would be how recent the uptime is, so that most recent uptimes have the biggest weight.


Are you able to break the devices out into groups of similar devices? Then you could compare an individual device to its peers.

Another suggestions is to look in to various Moving Average algorithms. These are supposed to smooth out time-series data as well as highlight trends.


Does it always report it a runtime of 0, on reboot? Or something close to zero (less then former time anyway)?

You could calculate this two ways. 1. The lower the number, the less troubles it had. 2. The higher the number, it scored the largest periods.

I guess you need to account, that the health can vary. So it can worsen over time. So the latest values, should have a higher weight then the older ones. This could indicate a exponential growth.

The more reboots it had in the last period, the more broken the system could be. But also looking at shorter intervals of the reboots. Let's say, 5 reboots a day vs. 10 reboots in 2 weeks. That does mean a lot different. So I guess time should be a metric as well as the amount of reboots in this formula.

I guess you need to calculate the density of the amount of reboots in the last period.

You can use the weight of the density, by simply dividing. Because how larger the number is, on which you divide, how lower the result will be, so how lower the weight of the number can become.

Pseudo code:

function calcHealth(machine)
float value = 0;
float threshold = 800;

for each (reboot in machine.reboots) {
    reboot.daysPast = time() - reboot.time;

    // the more days past, the lower the value, so the lower the weight
    value += (100 / reboot.daysPast);
}

return (value == 0) ? 0 : (threshold / value);
}

You could advance this function by for example, filtering for maxDaysPast and playing with the threshold and stuff like that.

This formula is based on this plot: f(x) = 100/x. As you see, on low numbers (low x value), the value is higher, then on large x value. So that's on how this formula calculates the weight of the daysPast. Because lower daysPast == lower x == heigher weight.

With the value += this formula counts the reboots and with the 100/x part it gives weight to the reboot, on where the weight is the time.

At the return, the threshold is divided through the value. This is because, the higher the score of the reboots, the lower the result must be.

You can use a plotting program or calculator, to see the bending of the plot, which is also the bending of the weight of the daysPast.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号