Constructing Bar Charts and Histograms:
Bar charts are easy to construct if the underlying variable is ordinal
(e.g. the number of female or male meerkats babysitting) since you
simply make the height of each bar be the number of observations
associated with each nominal value. There is no issue in this case with
defining the width of the range covered by each bar. If the underlying
variable is ordinal but discrete (e.g. the number of tomatoes produced
per plant must be an integer), then the only decision to be made is how
to lump together the possible outcomes in deciding the range covered by
each bar. In this case, generally you should assign equal ranges to
each bar and use 10-20 bars, assuming the underlying data set has
sufficient points. For example, if the number of tomatoes per plant
varied from 1 to 50, you could use 10 bars each of width 5 (1-5, 6-10,
... , 46-50). Difficulties arise here if the number of bars doesn't
divide the range of the data evenly - there is no one solution to this,
because one bar would either have to be wider or narrower than the
others, or else you must extend the range of the data beyond the
observations.
Constructing a histogram for a continuous data set involves:
(1) Decide on the number of classes (bars) for the histogram. Typically
choose this to be 10-20, but if there is a small dataset, choose this
smaller. A rule of thumb might be that the number of observations
divided by the number of classes should be at least 4.
(2) Choose the width of the classes (all widths will be the same) by
dividing the range of the data (Maximum value observed minus minimum
value observed) by the number of classes from step (1). Round this
value up so that the class width has the same precision (number of
significant digits) as the measurements.
(3) Select the smallest observed value as the lower limit of the first
class (e.g. the left hand value for the first class), and add multiples
of the class width from step (2) to obtain the lower limits for each
class.
(4) Find the upper class limits (the right hand value for each class)
by adding the class width to the lower class limits and subtracting
from this the smallest significant digit in the observations. So if the
class width is 0.3 and the lower limit of the first class is 3.7 (the
finest precision of the original data was .1 unit), then the upper
class limit for the first class is 3.7 + .3 -.1 = 3.9 and the next
class would have lower class limit 4.0 (so that there is a gap between
the upper class limit of one class and the lower class limit of the
next class, and this gap is the precision of the original data.
(5) For the boundaries between classes, use the value which is halfway
between the upper class limit of one class and the lower class limit of
the next class. So in the above example, the class boundary would be
3.95 and the lowest class would be 3.7-3.95 with the next class being
3.95-4.25 and so on.
(6) Count up the number of observations in each class and make the
height of the bar for each class equal to these numbers. You can also
constuct a "frequency histogram" by dividing the height of each bar of
the histogram (the number of observations falling in that class) by the
total number of observations (the sample size).
Linear Regression:
One objective in using descriptive statistics is to aid in
understanding whether there is a relationship between two measurements
associated with an observation (e.g. are leaf length and leaf width
related, are weights of bats related to their age since birth, is
respiration rate of an individual related to the temperature of the
environment around that individual). The simplest type of relationship
that would exist between two measurements is that they are proportional
(so that doubling one value leads to a doubling in the other). A slight
extension of this is that they are linearly related (e.g. if the
measurements are L and W, then L= a*W + b where a and b are
constants).
A regression is a formula that provides the "best fit" of a particular
mathematical equation to a set of data. A simple regression occurs if
there are two variables being considered only, and a linear regression
is one in which the relationship is assumed to be linear and the "best
line" which goes through the data on a standard x-y graph is obtained.
This is not meant to imply that one of the variables necessarily
"determines" the other one, but rather that any complex dependence
between them may be adequately summarized as a linear one. Strictly
speaking, when there is reason to believe that there is a causal
relationship between the measurements (e.g. variation in weights of
bats is due to differences in their ages, but variation in ages of bats
is in no way "caused" by their weights) then the term regression is
used, otherwise we say there is a "correlation". Thus variation in
leaf length is correlated with leaf width, but not "caused" by it.
The standard method to find the "best fit" line through a data set is
called least squares - it is the line which minimizes the sum of the
squares of the differences between the height of the line (y-value) and
height (y-value) of each data point, where this sum is taken over all
data points. This best fit line minimizes the sum of the square
residuals. There are formulas for the slope and y-intercept of the best
fit line (the regression line) that can be calculated using the
original data points.