Process for comparing two datasets_问答_开发者

开发者 https://www.devze.com 2023-03-14 05:35 出处：网络

I have two datasets at the time (in the form of vectors) and I plot them on the same axis to see how they relate with each other, and I specifically note and look for places where both graphs have a similar shape (i.e places where both have seemingly positive开发者_如何学JAVA/negative gradient at approximately the same intervals). Example:

Process for comparing two datasets

So far I have been working through the data graphically but realize that since the amount of the data is so large plotting each time I want to check how two sets correlate graphically it will take far too much time.

Are there any ideas, scripts or functions that might be useful in order to automize this process somewhat?

The first thing you have to think about is the nature of the criteria you want to apply to establish the similarity. There is a wide variety of ways to measure similarity and the more precisely you can describe what you want for "similar" to mean in your problem the easiest it will be to implement it regardless of the programming language.

Having said that, here is some of the thing you could look at :

correlation of the two datasets
difference of the derivative of the datasets (but I don't think it would be robust enough)
spectral analysis as mentionned by @thron of three
etc. ...

Knowing the origin of the datasets and their variability can also help a lot in formulating robust enough algorithms.

Sure. Call your two vectors A and B.

1) (Optional) Smooth your data either with a simple averaging filter (Matlab 'smooth'), or the 'filter' command. This will get rid of local changes in velocity ("gradient") that appear to be essentially noise (as in the ascending component of the red trace.

2) Differentiate both A and B. Now you are directly representing the velocity of each vector (Matlab 'diff').

3) Add the two differentiated vectors together (element-wise). Call this C.

4) Look for all points in C whose absolute value is above a certain threshold (you'll have to eyeball the data to get a good idea of what this should be). Points above this threshold indicate highly similar velocity.

5) Now look for where a high positive value in C is followed by a high negative value, or vice versa. In between these two points you will have similar curves in A and B.

Note: a) You could do the smoothing after step 3 rather than after step 1. b) Re 5), you could have a situation in which a 'hill' in your data is at the edge of the vector and so is 'cut in half', and the vectors descend to baseline before ascending in the next hill. Then 5) would misidentify the hill as coming between the initial descent and subsequent ascent. To avoid this, you could also require that the points in A and B in between the two points of velocity similarity have high absolute values.