counting values, consecutive and not on a grouped table_问答_开发者

I have an SQL Table named samples, defined like this:

sampledate (datetime, 24 records per day per parameter)
parameterID (int)
value (decimal)
valid (bit, 1=valid data, 0=invalid data)

the couple sampledate and parameterid are unique.

each sampledate is in the format 02/02/2011 12:00, so there are 24 rows per parameterid per day or less (a probe can fail or be in maintenance, for example, and it will output less than 24 samples).

I have to calculate the average daily values per parameter. The average is valid for a given day only if

at least 18 valid values are present
no more than 5 invalid consecutive values are present

Condition 1) is pretty simple to achieve, for a given @parameter:

  SELECT CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, sampledate))) as avgdate,
         AVG(value) as avg, parameterID, 
         isValid = CASE  
           WHEN COUNT(value) > 17 THEN 1
           ELSE 0
         END 
    FROM samples
   WHERE parameterId=@pa开发者_JS百科rameter
GROUP BY parameterId, CONVERT(DATETIME, FLOOR(CONVERT(FLOAT, sampledate))), valid
  HAVING valid = 1  
ORDER BY sampledate

How can I add condition 2, which boils down to counting consecutive 0s in a 24hrs span, possibly with the best performances?

we have millions of samples, and cursors are slow.

And here goes my recursive CTE solution, and it is parameterisable:

WITH
seq_samples AS (
  SELECT
    sampledate, parameterID, value, valid,
    avgdate = CAST(FLOOR(CAST(sampledate AS float)) AS datetime),
    rownum = ROW_NUMBER() OVER (
      PARTITION BY parameterID, CAST(FLOOR(CAST(sampledate AS float)) AS datetime)
      ORDER BY sampledate)
  FROM samples
),
rec_samples AS (
  SELECT
    sampledate, parameterID, value, valid, avgdate, rownum,
    inv_seq_num = 1 - valid
  FROM seq_samples
  WHERE rownum = 1
  UNION ALL
  SELECT
    ss.sampledate, ss.parameterID, ss.value, ss.valid, ss.avgdate, ss.rownum,
    inv_seq_num = CASE ss.valid WHEN 1 THEN 0 ELSE rs.inv_seq_num + 1 END
  FROM seq_samples ss
    INNER JOIN rec_samples rs ON ss.avgdate = rs.avgdate
      AND ss.parameterID = rs.parameterID AND ss.rownum = rs.rownum + 1
)
SELECT
  avgdate,
  parameterID,
  avgvalue = AVG(value)
FROM rec_samples
GROUP BY avgdate, parameterID
HAVING SUM(CAST(valid AS int)) >= @minimal_valid_count
   AND MAX(inv_seq_num)        <= @critical_invalid_count

Your idea basically is implemented here. Additional numbering is used, which is only applied to invalid rows and is only broken by date transitions and valid values' occurrences. In the end MAX is applied to the numbering column to find out if the max number has not exceeded @critical_invalid_count. And for the other parameter it is obviously enough to check the sum of valid attributes.

So, there you are.

EDIT for the seq_samples CTE (to be applied to your adapted version of the original query).

seq_samples AS (
  SELECT
    *,
    rownum = ROW_NUMBER() OVER (
      PARTITION BY parameterID, avgdate
      ORDER BY sampledate)
  FROM (
    SELECT
      sampledate, parameterID, value, valid,
      avgdate = CAST(FLOOR(CAST(sampledate AS float)) AS datetime)
    FROM samples
  ) s
),

SSMS showed me significant, practically unbelievable difference in performance between my original query and the modified one. (This is only based on the figures from the estimated execution plan.) I don't know what adaptations you had to make to my original solution, but I hope the improvement I've witnessed will not be completely lost because of them.

This is a solution that uses basically the same approach as the one used by Dems. I think the logic in my solution is a bit different. (Or maybe it is just differently structured...)

WITH sortedsamples AS (
  SELECT
    sampledate,
    parameterID,
    value,
    valid,
    avgdate = CAST(FLOOR(CAST(sampledate AS float)) AS datetime),
    rownum = ROW_NUMBER() OVER (
      PARTITION BY parameterID, CAST(FLOOR(CAST(sampledate AS float)) AS datetime)
      ORDER BY sampledate
    )
  FROM samples
)
SELECT
  ss1.parameterID,
  ss1.avgdate,
  avg = AVG(value),
  isValid = CAST(CASE
    WHEN SUM(CAST(ss1.valid AS int)) < 18 THEN 0
    ELSE MIN(CAST(ss1.valid | ISNULL(ss2.valid, 1) | ISNULL(ss3.valid, 1)
                            | ISNULL(ss4.valid, 1) | ISNULL(ss5.valid, 1) AS int))
  END AS bit)
FROM sortedsamples ss1
  LEFT JOIN sortedsamples ss2 ON ss1.avgdate = ss2.avgdate
    AND ss1.parameterID = ss2.parameterID AND ss1.rownum = ss2.rownum + 1
  LEFT JOIN sortedsamples ss3 ON ss1.avgdate = ss3.avgdate
    AND ss1.parameterID = ss3.parameterID AND ss1.rownum = ss3.rownum + 2
  LEFT JOIN sortedsamples ss4 ON ss1.avgdate = ss4.avgdate
    AND ss1.parameterID = ss4.parameterID AND ss1.rownum = ss4.rownum + 3
  LEFT JOIN sortedsamples ss5 ON ss1.avgdate = ss5.avgdate
    AND ss1.parameterID = ss5.parameterID AND ss1.rownum = ss5.rownum + 4
GROUP BY ss1.parameterID, ss1.avgdate

A couple of thoughts and comments...

There are many ways of making a date-time into a date only value. I use DATEADD(DAY, DATEDIFF(DAY, 0, ), 0). But for the following code I suggest we just pretend there is a justDate field, to make things shorter :)

Sequence is important, and the table doesn't have a "sequence id". ROW_NUMBER() can give this to you...

ROW_NUMBER() OVER (PARTITION BY parameter_id, justDate ORDER BY sampledate) AS "sequence_id"

There seem to be a couple of ways of doing this.

For each sample, join on to the next sample, five times. I don't like it, but it's probably the simplest option...

WITH
  sequenced_samples
AS
(
  SELECT
    parameterID AS "parameter_id",
    sampledate AS "sample_date_time",
    DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) AS "sample_date",
    ROW_NUMBER() OVER (PARTITION BY parameter_id, DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) ORDER BY sampledate) AS "sequence_id",
    CASE WHEN valid = 1 THEN value ELSE NULL END as "value",
    -(valid - 1) AS "invalid" -- turns 1 to 0, and 0 to 1
  FROM
    samples
)

SELECT
  "sample_1".parameter_id,
  "sample_1".sample_date,
  AVG(value) AS average_value
FROM
  samples                "sample_1"
LEFT JOIN
  samples                "sample_2"
    ON  "sample_2".parameter_id = "sample_1".parameter_id
    AND "sample_2".sample_date  = "sample_1".sample_date
    AND "sample_2".sequence_id  = "sample_1".sequence_id + 1
LEFT JOIN
  samples                "sample_3"
    ON  "sample_3".parameter_id = "sample_1".parameter_id
    AND "sample_3".sample_date  = "sample_1".sample_date
    AND "sample_3".sequence_id  = "sample_1".sequence_id + 2
LEFT JOIN
  samples                "sample_4"
    ON  "sample_4".parameter_id = "sample_1".parameter_id
    AND "sample_4".sample_date  = "sample_1".sample_date
    AND "sample_4".sequence_id  = "sample_1".sequence_id + 3
LEFT JOIN
  samples                "sample_5"
    ON  "sample_5".parameter_id = "sample_1".parameter_id
    AND "sample_5".sample_date  = "sample_1".sample_date
    AND "sample_5".sequence_id  = "sample_1".sequence_id + 4
GROUP BY
  "sample_1".parameter_id,
  "sample_1".sample_date
HAVING
  5 > MAX("sample_1".invalid + "sample_2".invalid + "sample_3".invalid + "sample_4".invalid + "sample_5".invalid)
  AND 17 < (COUNT(*) - SUM("sample_1".invalid))

The next is slightly more intelegent (but only slightly) but I'm not sat anywhere with access to MS SQL Server, so I can't tell if it's any more performant.

Instead of 4 joins, join just once, but matching 5 sequential samples. The doing two levels of grouping.

WITH
  sequenced_samples
AS
(
  SELECT
    parameterID AS "parameter_id",
    sampledate AS "sample_date_time",
    DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) AS "sample_date",
    ROW_NUMBER() OVER (PARTITION BY parameter_id, DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) ORDER BY sampledate) AS "sequence_id",
    CASE WHEN valid = 1 THEN value ELSE NULL END AS "value",
    -(valid - 1) AS "invalid"  -- Turns 0 to 1, and 1 to 0
  FROM
    samples
)
,
  checked_samples
AS
(
SELECT
  "sample".parameter_id,
  "sample".sample_date,
  "sample".value,
  "sample".invalid,
  SUM("next_5_samples".invalid) AS "sequence_invalidity"
FROM
  samples                "sample"
INNER JOIN
  samples                "next_5_samples"
    ON  "next_5_samples".parameter_id  = "sample".parameter_id
    AND "next_5_samples".sample_date   = "sample".sample_date
    AND "next_5_samples".sequence_id  >= "sample".sequence_id + 1
    AND "next_5_samples".sequence_id  <= "sample".sequence_id + 4
GROUP BY
  "sample".parameter_id,
  "sample".sample_date,
  "sample".valid,
  "sample".value
)
SELECT
  parameter_id,
  sample_date,
  AVG(value)
FROM
  checked_samples
GROUP BY
  parameter_id,
  sample_date
HAVING
  5 > MAX(sequence_invalidity)
  AND 17 < (COUNT(*) - SUM(invalid))

The final option is to use recursive Common Table Expressions to loop through the records one by one. This is more complicated to code that a Cursor, but is (in my experience) much faster.

EDIT: The following query had a left join in the recursive-CTE, and now it doesn't.

WITH
  sequenced_samples
AS
(
  SELECT
    parameterID AS "parameter_id",
    sampledate AS "sample_date_time",
    DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) AS "sample_date",
    ROW_NUMBER() OVER (PARTITION BY parameter_id, DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0) ORDER BY sampledate) AS "sequence_id",
    value,
    valid
  FROM
    samples
)
,
  recursed_totals
AS
(
SELECT
  parameter_id,
  sample_date,
  sequence_id - 1                           AS "next_sequence_id",
  CASE WHEN valid = 1 THEN value ELSE 0 END AS "cumulative_value",
  valid                                     AS "cumulative_count",
  CASE WHEN valid = 1 THEN 0     ELSE 1 END AS "cumulative_invalid",
  CASE WHEN valid = 1 THEN 0     ELSE 1 END AS "max_cumulative_invalid"
FROM
  sequenced_samples
WHERE
  sequence_id = (
                 SELECT
                   COUNT(*)
                 FROM
                   sequenced_samples "look_up"
                 WHERE
                       "look_up".parameter_id = sequenced_samples.parameter_id
                   AND "look_up".sample_date  = sequenced_samples.sample_date
                )

UNION ALL

SELECT
  "cumulative_samples".parameter_id,
  "cumulative_samples".sample_date,
  "next_sample".sequence_id - 1,
  "cumulative_samples".cumuatlive_value + CASE WHEN "next_sample".valid = 1 THEN "next_sample".value ELSE 0 END,
  "cumulative_samples".valid + ISNULL("next_sample".valid, 0),
  CASE
    WHEN "next_sample".valid = 0 THEN "cumulative_samples".cumulative_invalid + 1
    WHEN "cumulative_samples".cumulative_invalid = 5 THEN 5
    ELSE 0
  END,
  CASE
    WHEN "next_sample".valid = 1 THEN "cumulative_samples".max_cumulative_invalid
    WHEN "cumulative_samples".cumulative_invalid = "cumulative_samples".max_cumulative_invalid THEN "cumulative_samples".max_cumulative_invalid + 1
    ELSE "cumulative_samples".max_cumulative_invalid
  END
FROM
  recursed_totals   AS "cumulative_samples"
INNER JOIN
  sequenced_samples AS "next_sample"
    ON  "next_sample".parameter_id = "cumulative_samples".parameter_id
    AND "next_sample".sample_date  = "cumulative_samples".sample_date
    AND "next_sample".sequence_id  = "cumulative_samples".next_sequence_id
)
SELECT
  parameter_id,
  sample_date,
  CAST(cumulative_value AS FLOAT) / CAST(cumulative_count AS FLOAT) AS "average",
  cumulative_count AS "valid_samples",
  max_cumulative_invalid AS "max_consecutive_invalid_samples"
FROM
  recursed_totals
WHERE
  parameter_id = @parameter_id

Out of your millions of samples, what percent have less than 5 invalid values per day? If it's a high enough percentage, you're ok, because you can easily exclude them from cursor processing.

If the number of samples with 5 or more invalid values per day is still in the millions, you may be in for a long wait.

your solutions are quite interesting (and I've learnt a lot from them) but I wonder if they can be improved.

For instance, all the solutions (maybe except the one using the recursive cte) are not parametric in the number (N) of invalid consecutive samples. I can imagine that I could have in the future different N for different years or parameters.

I was wondering if a solution could be devised starting from the rownum() solutions: If I could find a way to reset the count ate each break on the valid column, I could simply invalidate a day where I find a row having rownum>N and valid=0, that would be super simple, fast and versatile.

I try to explain this idea better:

let's say I could use rownum or a similar function to obtain this:

date                       par    value        valid     rownum

2010-01-26 00:00:00.000 25  14.0000000000       1      1
2010-01-26 01:00:00.000 25  15.3000001907       1      2
2010-01-26 02:00:00.000 25  16.8999996185       1      3
2010-01-26 03:00:00.000 25  13.6000003815       1      4
2010-01-26 04:00:00.000 25  16.2000007629       1      5
2010-01-26 05:00:00.000 25  12.1999998093      -1      1
2010-01-26 06:00:00.000 25  17.2000007629      -1      2
2010-01-26 07:00:00.000 25  16.2999992371       1      1
2010-01-26 08:00:00.000 25  18.2999992371       1      2
2010-01-26 09:00:00.000 25  15.0000000000       1      3
2010-01-26 10:00:00.000 25  17.7000007629       1      4
2010-01-26 11:00:00.000 25  16.5000000000       1      5
2010-01-26 12:00:00.000 25  17.3999996185       1      6
2010-01-26 13:00:00.000 25  17.7000007629       1      7
2010-01-26 14:00:00.000 25  18.2999992371       1      8
2010-01-26 15:00:00.000 25  15.1000003815      -1      1
2010-01-26 16:00:00.000 25  16.5000000000      -1      2
2010-01-26 17:00:00.000 25  10.3999996185      -1      3
2010-01-26 18:00:00.000 25  10.8999996185      -1      4
2010-01-26 19:00:00.000 25  10.1000003815      -1      5  <-----!!!!  
2010-01-26 20:00:00.000 25  13.6999998093       1      1
2010-01-26 21:00:00.000 25  12.6999998093       1      2
2010-01-26 22:00:00.000 25  15.3999996185      -1      1
2010-01-26 23:00:00.000 25  8.6000003815       -1      2

if N=5, the existence of the row

2010-01-26 19:00:00.000 25  10.1000003815      -1      5

would indicate that the whole day is not valid (let alone the total number of invalid data)

What do you think about this idea?

(I don't know if this should have been an edit or a distinct answer)

Here is another answer using CROSS APPLY and avoiding using ROW_NUMBER().

If, however, more than 5 samples happen at exactly the same time, for the same parameterID, this won't work correctly. If that's the case, you need ROW_NUMBER() again.

SELECT
  parameterID                                              AS "parameter_id",
  DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0)            AS "sample_date",
  SUM(value)                                               AS "total",
  SUM(CASE WHEN valid = 1 THEN value ELSE 0 END)           AS "total_valid",
  COUNT(*)                                                 AS "count",
  SUM(valid)                                               AS "count_valid",
  MAX(invalid)                                             AS "date_invalidated"
FROM
  samples
CROSS APPLY
(
  SELECT
    CASE WHEN SUM(valid) = 0 THEN 1 ELSE 0 END AS "invalid"
  FROM
  (
    SELECT TOP 5
      valid
    FROM
      samples AS "5_samples"
    WHERE
          "5_samples".parameterID  = "samples".parameterID
      AND "5_samples".sampledate  >= "samples".sampledate
      AND "5_samples".sampledate  <  DATEADD(DAY, DATEDIFF(DAY, 0, "samples".sampledate), 1)
    ORDER BY
      sampledate
  )
    AS "data"
)
  AS "check"
WHERE
  parameterID = @parameterID
GROUP BY
  parameter_id,
  DATEADD(DAY, DATEDIFF(DAY, 0, sampledate), 0)