How to do a linear analysis of data:
The first step is to see if indeed this process makes any sense at all.
That is, do the data appear to be linearly related? This step is
usually accomplished by doing a "scatter plot" of the data. This means
plotting them as x-y points on a graph. The choice of ranges for the
axes matters here a great deal. Mischoice of one of the axes could lead
to all the points falling over a very small portion of the graph,
making it impossible to tell whether there is any potential
relationship in the data set. Thus, choose your axes so that the range
of them is approximately the same as that in your data. Thus if you
have a set of body weights ranging from 5 grams to 160 grams, it would
be reasonable to chosse the axis for body weight to run from 0 - 160
grams or so.
Which is the x (horizontal) axis and which is the y (vertical)? If you
have some reason to expect that one of the measurements is the
dependent one (e.g. body weight depends on age, rather than the
reverse), then choose the one that is dependent (weight) as the y axis
and the independent one (age) as the horizontal axis. If you do not
have any reason to expect that one of the measurements is caused by the
other, then the choice of which measurement is plotted on which axes
doesn't matter. Just choose one and go with it.
The next step is to eyeball the data and see if there appears to be any
relationship. If the scatter plot looks like the points might be
described at least approximately by a line (e.g. don't worry if there
is a lot of scatter about any line you might draw on the graph), then
it is reasonable to proceed with fitting a line. The next step then is
eyeballing the data and making a gues at a line withoiut doing any
calculations or using any program. This will not give you an exact
answer, but it will be useful later in checking to see that the line
that is obtained from a computer or other methods makes sense. To
eyeball, just quickly draw a line through the data, eyeball a rough
slope and then write down the formula for the line using the
point-slope form, that is y - y1 = m * (x - x1) where m is the slope
and the point you have chosen is (x1,y1), or use the point-intersept
form if it can easily be estimated where the line crosses the y-axis y
= m * x + b where b is the y-intersept. Be sure in doing this that you
keep your units straight, so that you know what units each measurement,
and thus the slope, is in.
What if the data do not appear to be linearly related? Then don't try
to fit a straight line. Later we'll discuss ways to transform the data
to see if a different realtionship might be a better choice.
If it looks like a linear relationship is a reasonale assumption, the
next step is to find the "least squares fit". This can be done
automatically within Matlab, and many calculators can also do this. The
idea is to choose a line that "best fits" the data in that the data
points have the minimum sum of their vertical distances from this
particular line. One way that you might code a computer to do this is:
(i) Guess at the equation for the line - y = a*x + b..
(ii) Measure the vertical distance from each data point to the line
you have chosen.
(iii) Sum up the distances chosen in (ii).
(iv) Change the slope a and the intercept b to see if you can reduce
the sum obtained in (iii).
(v) Continue this until you get tired or you've done the
best you can.
One of the potential problems with the above is what "distance" to use
in (ii). If you allow some distances to be positive and some negative
(e.g. for a point above the line and a point below the line), then the
distances can cancel, which is not appropriate. So we want all
distances to be positive and the standard way to do this is the just
square each vertical distance found in (ii), and produce the sum of the
square distances to get (iii). The line we would get by going through
step (v) would then be called the "least squares fit" since it chooses
the line so as to minimize the sum of the square deviations of points
from the line.
It turns out that it is not necessary to go through the steps (1)-(v)
above at all. It can be proven that the "best" values of the slope a
and the intercept b can be obtained from a relatively simple formula
that just uses the x and y values for all the data points. Matlab does
this easily for you using the command "C=polyfit(A,W,1)" which will
produce a vector C in which the first value is the best fit for the
slope a and the second is the best fit for the intercept b for the
least squares fit of the vector of data W (on the vertical axis) to the
vector of data A (on the horizontal axis). Think of A as giving a
vector of ages (in days) of bats and the vector W giving the weights of
these bats (in grams). Note that the units of the components of C
depend upon the units the data are measured in. The first component of
C is a slope so it has units Grams per day for the bat example, and the
second component of C (the y-intercept) has the same units as the
measurement on the y-axis (grams in the case of the bats).
Once you have a linear least squares fitted line, you can proceed to
use it to interpolate (find the y-value predicted by the linear fit for
an x-value that falls in the range of the x-values in your data set),
or to extrapolate (find the y-value predicted by the linear fit for an
x-value that falls outside the range of the x-values in your data set).
Thus if you have values for body weights W (in grams) and ages of bats
A (in days) 9, 15, 20, 22, 34, 44 and 49, and it appears that a linear
fit to the data is reasonable, you can interpolate to find the weight
of a bat of age 30 days, or you can extrapolate to find the weight of a
bat of age 60 days. All you do to find these weights is to plug the age
into the equation of the line and calculate the associated y-value
(weight in grams). Matlab makes this easy by using the command
"Y=polyval(C,30)" which will give the best guess according to the
linear fit for the weight of a bat of age 30 days.
How can we tell if a linear fit is any good?
This is where we make use of the notion of a correlation. In common
parlance, we say two measurements are "correlated" if there appears to
be some relationship between them, though this relationship need not be
causal. Thus leaf length and width might be related to each other, but
neither is caused by the other. They might simply be related due to the
age of the leaf or the environmental conditions under which the leaf
developed (e.g. better nutrients and water could lead to a larger
leaf).
There is a formal definition of correlation that we will use which
essentially tells how close to linearly related two measurements are.
Note that this is resticted to being a measure for linear
relationships. If two measurements are related, but not linearly, then
the correlation we estimate may not imply the measurements are closely
related when they actually are. For example, human body weight is
certainly related to age as an individual grows, but growth is not
linear at all and so a correlation may not be the best way of saying
that these two variables are related.
Correlation is measured by the "correlation coefficient" for which the
small Greek letter rho is typically used (I'll call it r here). The
calculation of r follows easily from the x and y values of the data
set. The coefficient r is a measure of the strength of the relationship
between the two measurements, scaled in such a way that if two
measurements fell exactly on a straight line with positive slope, then
r = 1, while if they fell exactly on a line with negative slope, r =
-1. In these cases we say the data are "perfectly positively
correlated" or "perfectly negatively correlated". If r = 0 we say the
data are "uncorrelated" but again this doesn't mean the data are not
related - for example if the data fell on a parabola y = (x-1)^2 for
values of x between 0 and 2, the correlation would be near zero but the
data are certainly closely related.
Again Matlab makes it easy to calculate the correlation coefficient of
two vectors using "corrcoef(A,W)" which computes the correlation
coefficient of the data given in the two vectors A and W. Many
calculators will computethis as well. Note that the correlation
coefficient is a dimensionless number - in calculating it the
dimensions of the measurements cancel out.