Correlation analysis

From Cybis Wiki
Jump to: navigation, search

Correlation analysis is a type of scoring method where you calculate a value for the coovariance between two curves when they lay over each other at a certain position. I.e. the calculated value - the correlation coefficient - is a measure of how well the two curves match each other at that position. When using Curve sliding for the analysis during crossdating, the correlation coefficient is calculated at every possible overlapping position for the curves. Hopefully the highest value found does then correspond to the correct crossdating position.

Contents

The correlation coefficient

The correlation coefficient is cumbersome to calculate so you really need a computer for this type of scoring.

Within CDendro the correlation coefficient used is the Pearson product-moment correlation coefficient.[1]

A coefficient value of 1 means that both curves follow each other exactly. A value of -1 means that the curves behaves exactly contrary to each other, e.g. when the one curve goes up, the other goes down. Correlation coefficient values are always within the limits -1 to +1!

It should be noted that the statistical mathematics for the correlation coefficient are defined on the relations between random variables. It should then be noted that ring width values are not random - when a ring is thick, there is a high probability that the next years ring will also be thick. So the use of any correlation coefficient within dendrochronology should best be motivated by practical observations on its efficiency to find correct crossdatings and its efficiency to sort out incorrect matches.[2]

Note: When comparing ring width curves, we do the correlation coefficient mathematics on the normalized curves! When you document a best value from such a correlation calculus, you should also document the normalization method used, as the requirements on the level of the coefficient to ascertain a dating, differs somewhat with the normalization method used.[2][3]

Definition of the correlation coefficient

Define X and Y as paired curve values. There is one X and one Y for each year when the curves lay at a certain position. Define Mx and My as the mean values (or expected value) of each curve, i.e:

LaTeX: Mx = E(X) and LaTeX: My = E(Y)

Calculate the standard deviations as:

LaTeX: \sigma x = \sqr{E (X-Mx)^2} and LaTeX: \sigma y = \sqr{E (Y-My)^2}

(The standard deviation is a measure of a "normal" (typical) distance from a point on a curve to the mean value of that curve.)

Calculate the correlation coefficient as:

LaTeX: r = \frac{E( (X-Mx) (Y-My))}{(\sigma x )( \sigma y)}

Overlapping

If we slide the curve of one sample so it hangs out a bit on either side of the other curve, it means that only a part of the first curve overlaps the other curve. It is usually not meaningful to test the curve fitting when the overlap is less than 30. For proper crossdating overlaps less than 50-70 should not be considered.

TTest value

The TTest value, also called T-score or T-value, is based on the correlation value but it also takes into account that a match with a short overlap is less worth than a match with a longer overlap when correlation values are the same.

TTest values are calculated according to the formula below, where n is the number of overlapping years and r is the correlation coefficient value.

LaTeX: TTest = r \sqr{ \frac{(n-2)}{(1 - r^2 )} }


What is a good TTest-value?

In general discussions on T-values, values in the range 2.5-3.5 are often mentioned as significant, and then the confidence interval is set to e.g. 95% or 99%, which may be reasonable in many applications, but not in dendrochronology. A confidence of 95% actually means that you indeed expect one randomly generated (i.e. false) match every twentieth year of sliding! Or every 100:th year with 99% confidence. Though when you slide a sample towards a reference you will normally have hundreds or thousands of scores among which all but one are random. Therefore the confidence level has to be set far above what is common in other contexts.

A practical way to find out what is reasonable T-values for a certain normalization method, is by testing a huge amount of samples of specified length (blocks) towards an unrelated long reference. Such a test has shown that a reasonable minimum level of the T-value is often in the range 5.5-6.5, depending on sample length and normalization method.[2]


Use the correlation coefficient to describe the similarity between curves - never just the TTest value!

A high enough TTest value certifies that two ring width curves are properly crossdated when also the overlap between the curves is sufficient. A TTest value given without the corresponding overlap length, says nothing about the similarity between the curves. Worst case is to specify two TTest values - when demonstrating similarity - without telling that the values were achieved with different length of the overlap.

When you want to describe the similarity between curves from two different regions, use the correlation coefficient value as a measure and then also specify the normalization method used.

Notes

  1. . See Wikipedia (English) article about Pearson_product-moment_correlation_coefficient
  2. 2.0 2.1 2.2 Torbjörn Axelson and Lars-Åke Larsson: What is a good TTest value
  3. There seems to be a custom (especially in Europe?) just to note the t-value. It is in those cases usually the t-value according to Baillie/Pilcher normalization which is referred to.
Personal tools