A Math Primer
Basic Concepts of Statistics
Note: This section is not intended to provide a
full
coverage of statistics. A formal book on statistical
methods and applications will be more appropriate
for that. This section, instead, intends to provide a
quick overview of simple statistical approaches used
to establish relationships between data and how
these can be used in solving some environmental
problems.
Introduction
What is Statistics?
Statistics is the discipline concerned with the
collection, organization, and interpretation of
numerical data, especially as it relates to the
analysis of population characteristics by inference
from sampling. It addresses all elements of
numerical analysis, from study planning to the
presentation of final results. Statistics, therefore, is
more than a compilation of computational
techniques. It is a means of learning from data, a
way of viewing information, and a servant of all
science.
In a simplistic way, we can say that Statistics boils
down to two approaches: exploration and
adjudication. The purpose of exploration is to
uncover patterns and clues within data sets.
Adjudication, on the other hand, serves to determine
whether the uncovered patterns are valid and can be
generalized. Both approaches are as important and
none can be minimized in the statistical process of
data analysis. Statistics is a great quantitative tool to
help make any method of enquiry more meaningful
and particularly as objective as possible. However,
one must avoid falling in the trap of the “black hole
of empiricism” whereby data are analyzed with the
hopes of discovering the fundamental “laws”
responsible for observed outcomes. One must first
establish an explanatory protocol of what these
laws/processes can be and then use Statistics
(among other tools) to test the appropriateness, and
sometimes exactness, of such explanations. This
pre-formulation of plausible explanations is at the
core of the “scientific method” and is called
“hypothesis formulation”. Hypotheses are
established as educated hunches to explain observed
facts or findings and should be constructed in ways
that can lead to anticipatory deductions (also called
predictions). Such predictions should of course be
verifiable through data collection and analysis. This
is probably where Statistics come most in handy in
helping judge the extent to which the recovered data
agree with the established predictions (although
Statistics also contributes substantially to
formulation of test protocols and how data might be
collected to verify hypotheses).
Statistics thus seeks to make each process of the
scientific method (observation, hypothesis
formulation, prediction, verification) more objective
(so that things are observed as they are, without
falsification according to some preconceived view)
and reproducible (so that we might judge things in
terms of the degree to which observations might be
repeated).
It is not the scope of this short introduction to go
over the range of statistical analyses possible. In
fact, this text explores only selective issues related
to statistics leaving room for true course in statistics
(applied or theoretical) to develop all concepts more
fully. Below we will talk succinctly about variables,
summary statistics, and the evaluation of linear
relationships between two variables.
A. Measurement
To perform statistical operations we need an object
of analysis. For this, number (or codes) are used as
the quantitative representation of any specific
observation. The assignment of number or codes to
describe a pre-set subject is called measurement.
Measurements that can be expressed by more than
one value during a study are called variables.
Examples of variables are AGE of individuals,
WEIGHT of objects, or NAME of species.
Variables only represent the subject of the
measurement, not any intrinsic value or code.
Variables can be classified according to the way in
which they are encoded (i.e. numeric, text, date) or
according to which scale they are measured.
Although there exists many ways to classify
measurement scales, three will be considered here:
Nominal (qualitative,
categorical)
Ordinal (semi-quantitative, “ranked”)
Scale (quantitative, “continuous”,
interval/ratio)
Nominal variables are categorical attributes that
have no inherent order. For example SEX (male or
female) is a nominal variable, as is NAME and
EYECOLOR.
Ordinal variables are ranked-ordered characteristic
and responses. For example an opinion graded on a
1-5 scale (5 = strongly agree; 4 = agree; 3 =
undecided; 2 = disagree; 1 = strongly disagree).
Although the categories can be put in ascending (or
descending) order, distances (“differences”)
between possible responses are uneven (i.e. the
distance between “strongly agree” and “agree” is
not the same as the distance between “agree” and
“undecided”). This makes the measurement ordinal,
and not scaled.
Scale variables represent quantitative
measurements in which differences between
possible responses are uniform (or continuous). For
example LENGTH (measured in centimeters) is a
scale measurement. No matter how much you cut
down the measurement into a smaller fraction (i.e. a
tenth of a centimeter) the difference between on
measurement and the next still remains the same
(i.e. the difference between 3 centimeters and 2
centimeters or 3 millimeters and 2 millimeters is the
same as that between 2 cm and 1 cm or 2 mm and 1
mm).
Notice that each step up the measurement scale
hierarchy takes on the assumptions f the step below
it and then adds another restriction. That is, nominal
variables are named categories. Ordinal variables
are named categories that can be put into logical
order. Scale variables are ordinal variables that have
equal distance between possible responses.
Data Quality
Something must be said about the quality of data
used. A statistical analysis is only as good as its data
and interpretative limitations may be imposed by
the quality of the data rather than by the analysis. In
addressing data quality, we must make a distinction
between measurement error and processing error.
Measurement error is represented by differences
between the “true” quality of the object observed
(i.e the true length of a fish) and what appears
during data collection (the actual scale measurement
collected during the study). Processing errors are
errors that occur during data handling (i.e. wrong
data reporting, erroneous rounding or
transformation). One must realize that errors are
inherent to any measurement and that trying to
avoid them is virtually impossible. What must be
done is characterize these errors and try minimizing
in the best way possible.
Population and Sample
Most statistical analyses are done to learn about a
specific population (the total number of trouts in a
specific river, the concentration of a contaminant in
a lake’s total sediment bed). The population is thus
the universe of all possible measurements in a
defined unit. When the population is real, it is
sometimes possible to obtain information on the
entire population. This type of study is called
census. However, performing a census is usually
impractical, expensive, and time-consuming, if not
downright impossible. Therefore, nearly all
statistical studies are based on a subset of the
population, which is called sample. Whenever
possible, a probability sample should be used. A
probability samples is a sample in which a) every
population member (item) has known probability of
being sampled, b) the sample is drawn by some
method of chance consistent with these
probabilities, and c) selection probabilities are
considered when making estimates from the
samples.