I have an SQL Table named samples, defined like this:
sampledate (datetime, 24 records per day per parameter)
parameterID (int)
value (decimal)
valid (bit, 1=valid data, 0=invalid data)
the couple sampledate and parameterid are unique.
each sampledate is in the format 02/02/2011 12:00, so there are 24 rows per parameterid per day or less (a probe can fail or be in maintenance, for example, and it will output less than 24 samples).
I have to calculate the average daily values per parameter. The average is valid for a given day only if
- at least 18 valid values are present
- no more than 5 invalid consecutive values are present
Condition 1) is pretty simple to achieve, for a given @parameter:
SELECT CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, sampledate))) as avgdate,
AVG(value) as avg, parameterID,
isValid = CASE
WHEN COUNT(value) > 17 THEN 1
ELSE 0
END
FROM samples
WHERE parameterId=@pa开发者_JS百科rameter
GROUP BY parameterId, CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, sampledate))), valid
HAVING valid = 1
ORDER BY sampledate
How can I add condition 2, which boils down to counting consecutive 0s in a 24hrs span, possibly with the best performances?
we have millions of samples, and cursors are slow.
And here goes my recursive CTE solution, and it is parameterisable:
WITH
seq_samples AS (
SELECT
sampledate, parameterID, value, valid,
avgdate = CAST(FLOOR(CAST(sampledate AS float)) AS datetime),
rownum = ROW_NUMBER() OVER (
PARTITION BY parameterID, CAST(FLOOR(CAST(sampledate AS float)) AS datetime)
ORDER BY sampledate)
FROM samples
),
rec_samples AS (
SELECT
sampledate, parameterID, value, valid, avgdate, rownum,
inv_seq_num = 1 - valid
FROM seq_samples
WHERE rownum = 1
UNION ALL
SELECT
ss.sampledate, ss.parameterID, ss.value, ss.valid, ss.avgdate, ss.rownum,
inv_seq_num = CASE ss.valid WHEN 1 THEN 0 ELSE rs.inv_seq_num + 1 END
FROM seq_samples ss
INNER JOIN rec_samples rs ON ss.avgdate = rs.avgdate
AND ss.parameterID = rs.parameterID AND ss.rownum = rs.rownum + 1
)
SELECT
avgdate,
parameterID,
avgvalue = AVG(value)
FROM rec_samples
GROUP BY avgdate, parameterID
HAVING SUM(CAST(valid AS int)) >= @minimal_valid_count
AND MAX(inv_seq_num) <= @critical_invalid_count
Your idea basically is implemented here. Additional numbering is used, which is only applied to invalid rows and is only broken by date transitions and valid values' occurrences. In the end MAX
is applied to the numbering column to find out if the max number has not exceeded @critical_invalid_count
. And for the other parameter it is obviously enough to check the sum of valid
attributes.
So, there you are.
EDIT for the seq_samples
CTE (to be applied to your adapted version of the original query).
seq_samples AS (
SELECT
*,
rownum = ROW_NUMBER() OVER (
PARTITION BY parameterID, avgdate
ORDER BY sampledate)
FROM (
SELECT
sampledate, parameterID, value, valid,
avgdate = CAST(FLOOR(CAST(sampledate AS float)) AS datetime)
FROM samples
) s
),
SSMS showed me significant, practically unbelievable difference in performance between my original query and the modified one. (This is only based on the figures from the estimated execution plan.) I don't know what adaptations you had to make to my original solution, but I hope the improvement I've witnessed will not be completely lost because of them.
This is a solution that uses basically the same approach as the one used by Dems. I think the logic in my solution is a bit different. (Or maybe it is just differently structured...)
WITH sortedsamples AS (
SELECT
sampledate,
parameterID,
value,
valid,
avgdate = CAST(FLOOR(CAST(sampledate AS float)) AS datetime),
rownum = ROW_NUMBER() OVER (
PARTITION BY parameterID, CAST(FLOOR(CAST(sampledate AS float)) AS datetime)
ORDER BY sampledate
)
FROM samples
)
SELECT
ss1.parameterID,
ss1.avgdate,
avg = AVG(value),
isValid = CAST(CASE
WHEN SUM(CAST(ss1.valid AS int)) < 18 THEN 0
ELSE MIN(CAST(ss1.valid | ISNULL(ss2.valid, 1) | ISNULL(ss3.valid, 1)
| ISNULL(ss4.valid, 1) | ISNULL(ss5.valid, 1) AS int))
END AS bit)
FROM sortedsamples ss1
LEFT JOIN sortedsamples ss2 ON ss1.avgdate = ss2.avgdate
AND ss1.parameterID = ss2.parameterID AND ss1.rownum = ss2.rownum + 1
LEFT JOIN sortedsamples ss3 ON ss1.avgdate = ss3.avgdate
AND ss1.parameterID = ss3.parameterID AND ss1.rownum = ss3.rownum + 2
LEFT JOIN sortedsamples ss4 ON ss1.avgdate = ss4.avgdate
AND ss1.parameterID = ss4.parameterID AND ss1.rownum = ss4.rownum + 3
LEFT JOIN sortedsamples ss5 ON ss1.avgdate = ss5.avgdate
AND ss1.parameterID = ss5.parameterID AND ss1.rownum = ss5.rownum + 4
GROUP BY ss1.parameterID, ss1.avgdate
A couple of thoughts and comments...
There are many ways of making a date-time into a date only value. I use DATEADD(DAY, DATEDIFF(DAY, 0, ), 0). But for the following code I suggest we just pretend there is a justDate field, to make things shorter :)
Sequence is important, and the table doesn't have a "sequence id". ROW_NUMBER() can give this to you...
ROW_NUMBER() OVER (PARTITION BY parameter_id, justDate ORDER BY sampledate) AS "sequence_id"
There seem to be a couple of ways of doing this.
For each sample, join on to the next sample, five times. I don't like it, but it's probably the simplest option...
WITH
sequenced_samples
AS
(
SELECT
parameterID AS "parameter_id",
sampledate AS "sample_date_time",
DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) AS "sample_date",
ROW_NUMBER() OVER (PARTITION BY parameter_id, DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) ORDER BY sampledate) AS "sequence_id",
CASE WHEN valid = 1 THEN value ELSE NULL END as "value",
-(valid - 1) AS "invalid" -- turns 1 to 0, and 0 to 1
FROM
samples
)
SELECT
"sample_1".parameter_id,
"sample_1".sample_date,
AVG(value) AS average_value
FROM
samples "sample_1"
LEFT JOIN
samples "sample_2"
ON "sample_2".parameter_id = "sample_1".parameter_id
AND "sample_2".sample_date = "sample_1".sample_date
AND "sample_2".sequence_id = "sample_1".sequence_id + 1
LEFT JOIN
samples "sample_3"
ON "sample_3".parameter_id = "sample_1".parameter_id
AND "sample_3".sample_date = "sample_1".sample_date
AND "sample_3".sequence_id = "sample_1".sequence_id + 2
LEFT JOIN
samples "sample_4"
ON "sample_4".parameter_id = "sample_1".parameter_id
AND "sample_4".sample_date = "sample_1".sample_date
AND "sample_4".sequence_id = "sample_1".sequence_id + 3
LEFT JOIN
samples "sample_5"
ON "sample_5".parameter_id = "sample_1".parameter_id
AND "sample_5".sample_date = "sample_1".sample_date
AND "sample_5".sequence_id = "sample_1".sequence_id + 4
GROUP BY
"sample_1".parameter_id,
"sample_1".sample_date
HAVING
5 > MAX("sample_1".invalid + "sample_2".invalid + "sample_3".invalid + "sample_4".invalid + "sample_5".invalid)
AND 17 < (COUNT(*) - SUM("sample_1".invalid))
The next is slightly more intelegent (but only slightly) but I'm not sat anywhere with access to MS SQL Server, so I can't tell if it's any more performant.
Instead of 4 joins, join just once, but matching 5 sequential samples. The doing two levels of grouping.
WITH
sequenced_samples
AS
(
SELECT
parameterID AS "parameter_id",
sampledate AS "sample_date_time",
DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) AS "sample_date",
ROW_NUMBER() OVER (PARTITION BY parameter_id, DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) ORDER BY sampledate) AS "sequence_id",
CASE WHEN valid = 1 THEN value ELSE NULL END AS "value",
-(valid - 1) AS "invalid" -- Turns 0 to 1, and 1 to 0
FROM
samples
)
,
checked_samples
AS
(
SELECT
"sample".parameter_id,
"sample".sample_date,
"sample".value,
"sample".invalid,
SUM("next_5_samples".invalid) AS "sequence_invalidity"
FROM
samples "sample"
INNER JOIN
samples "next_5_samples"
ON "next_5_samples".parameter_id = "sample".parameter_id
AND "next_5_samples".sample_date = "sample".sample_date
AND "next_5_samples".sequence_id >= "sample".sequence_id + 1
AND "next_5_samples".sequence_id <= "sample".sequence_id + 4
GROUP BY
"sample".parameter_id,
"sample".sample_date,
"sample".valid,
"sample".value
)
SELECT
parameter_id,
sample_date,
AVG(value)
FROM
checked_samples
GROUP BY
parameter_id,
sample_date
HAVING
5 > MAX(sequence_invalidity)
AND 17 < (COUNT(*) - SUM(invalid))
The final option is to use recursive Common Table Expressions to loop through the records one by one. This is more complicated to code that a Cursor, but is (in my experience) much faster.
EDIT: The following query had a left join in the recursive-CTE, and now it doesn't.
WITH
sequenced_samples
AS
(
SELECT
parameterID AS "parameter_id",
sampledate AS "sample_date_time",
DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) AS "sample_date",
ROW_NUMBER() OVER (PARTITION BY parameter_id, DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) ORDER BY sampledate) AS "sequence_id",
value,
valid
FROM
samples
)
,
recursed_totals
AS
(
SELECT
parameter_id,
sample_date,
sequence_id - 1 AS "next_sequence_id",
CASE WHEN valid = 1 THEN value ELSE 0 END AS "cumulative_value",
valid AS "cumulative_count",
CASE WHEN valid = 1 THEN 0 ELSE 1 END AS "cumulative_invalid",
CASE WHEN valid = 1 THEN 0 ELSE 1 END AS "max_cumulative_invalid"
FROM
sequenced_samples
WHERE
sequence_id = (
SELECT
COUNT(*)
FROM
sequenced_samples "look_up"
WHERE
"look_up".parameter_id = sequenced_samples.parameter_id
AND "look_up".sample_date = sequenced_samples.sample_date
)
UNION ALL
SELECT
"cumulative_samples".parameter_id,
"cumulative_samples".sample_date,
"next_sample".sequence_id - 1,
"cumulative_samples".cumuatlive_value + CASE WHEN "next_sample".valid = 1 THEN "next_sample".value ELSE 0 END,
"cumulative_samples".valid + ISNULL("next_sample".valid, 0),
CASE
WHEN "next_sample".valid = 0 THEN "cumulative_samples".cumulative_invalid + 1
WHEN "cumulative_samples".cumulative_invalid = 5 THEN 5
ELSE 0
END,
CASE
WHEN "next_sample".valid = 1 THEN "cumulative_samples".max_cumulative_invalid
WHEN "cumulative_samples".cumulative_invalid = "cumulative_samples".max_cumulative_invalid THEN "cumulative_samples".max_cumulative_invalid + 1
ELSE "cumulative_samples".max_cumulative_invalid
END
FROM
recursed_totals AS "cumulative_samples"
INNER JOIN
sequenced_samples AS "next_sample"
ON "next_sample".parameter_id = "cumulative_samples".parameter_id
AND "next_sample".sample_date = "cumulative_samples".sample_date
AND "next_sample".sequence_id = "cumulative_samples".next_sequence_id
)
SELECT
parameter_id,
sample_date,
CAST(cumulative_value AS FLOAT) / CAST(cumulative_count AS FLOAT) AS "average",
cumulative_count AS "valid_samples",
max_cumulative_invalid AS "max_consecutive_invalid_samples"
FROM
recursed_totals
WHERE
parameter_id = @parameter_id
Out of your millions of samples, what percent have less than 5 invalid values per day? If it's a high enough percentage, you're ok, because you can easily exclude them from cursor processing.
If the number of samples with 5 or more invalid values per day is still in the millions, you may be in for a long wait.
your solutions are quite interesting (and I've learnt a lot from them) but I wonder if they can be improved.
For instance, all the solutions (maybe except the one using the recursive cte) are not parametric in the number (N) of invalid consecutive samples. I can imagine that I could have in the future different N for different years or parameters.
I was wondering if a solution could be devised starting from the rownum() solutions: If I could find a way to reset the count ate each break on the valid column, I could simply invalidate a day where I find a row having rownum>N and valid=0, that would be super simple, fast and versatile.
I try to explain this idea better:
let's say I could use rownum or a similar function to obtain this:
date par value valid rownum
2010-01-26 00:00:00.000 25 14.0000000000 1 1
2010-01-26 01:00:00.000 25 15.3000001907 1 2
2010-01-26 02:00:00.000 25 16.8999996185 1 3
2010-01-26 03:00:00.000 25 13.6000003815 1 4
2010-01-26 04:00:00.000 25 16.2000007629 1 5
2010-01-26 05:00:00.000 25 12.1999998093 -1 1
2010-01-26 06:00:00.000 25 17.2000007629 -1 2
2010-01-26 07:00:00.000 25 16.2999992371 1 1
2010-01-26 08:00:00.000 25 18.2999992371 1 2
2010-01-26 09:00:00.000 25 15.0000000000 1 3
2010-01-26 10:00:00.000 25 17.7000007629 1 4
2010-01-26 11:00:00.000 25 16.5000000000 1 5
2010-01-26 12:00:00.000 25 17.3999996185 1 6
2010-01-26 13:00:00.000 25 17.7000007629 1 7
2010-01-26 14:00:00.000 25 18.2999992371 1 8
2010-01-26 15:00:00.000 25 15.1000003815 -1 1
2010-01-26 16:00:00.000 25 16.5000000000 -1 2
2010-01-26 17:00:00.000 25 10.3999996185 -1 3
2010-01-26 18:00:00.000 25 10.8999996185 -1 4
2010-01-26 19:00:00.000 25 10.1000003815 -1 5 <-----!!!!
2010-01-26 20:00:00.000 25 13.6999998093 1 1
2010-01-26 21:00:00.000 25 12.6999998093 1 2
2010-01-26 22:00:00.000 25 15.3999996185 -1 1
2010-01-26 23:00:00.000 25 8.6000003815 -1 2
if N=5, the existence of the row
2010-01-26 19:00:00.000 25 10.1000003815 -1 5
would indicate that the whole day is not valid (let alone the total number of invalid data)
What do you think about this idea?
(I don't know if this should have been an edit or a distinct answer)
Here is another answer using CROSS APPLY and avoiding using ROW_NUMBER().
If, however, more than 5 samples happen at exactly the same time, for the same parameterID, this won't work correctly. If that's the case, you need ROW_NUMBER() again.
SELECT
parameterID AS "parameter_id",
DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) AS "sample_date",
SUM(value) AS "total",
SUM(CASE WHEN valid = 1 THEN value ELSE 0 END) AS "total_valid",
COUNT(*) AS "count",
SUM(valid) AS "count_valid",
MAX(invalid) AS "date_invalidated"
FROM
samples
CROSS APPLY
(
SELECT
CASE WHEN SUM(valid) = 0 THEN 1 ELSE 0 END AS "invalid"
FROM
(
SELECT TOP 5
valid
FROM
samples AS "5_samples"
WHERE
"5_samples".parameterID = "samples".parameterID
AND "5_samples".sampledate >= "samples".sampledate
AND "5_samples".sampledate < DATEADD(DAY, DATEDIFF(DAY, 0, "samples".sampledate), 1)
ORDER BY
sampledate
)
AS "data"
)
AS "check"
WHERE
parameterID = @parameterID
GROUP BY
parameter_id,
DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0)
精彩评论