Chapter 3(Technical English for Statistics)
Chapter 3(Technical English for Statistics)
Data Description
Definitions
Statistic
Characteristic or measure obtained from a sample
Parameter
Characteristic or measure obtained from a population
Mean
Sum of all the values divided by the number of values. This can either be a population
mean (denoted by mu) or a sample mean (denoted by x bar)
Median
The midpoint of the data after being ranked (sorted in ascending order). There are as
many numbers below the median as above the median.
Mode
The most frequent number
Skewed Distribution
The majority of the values lie together on one side with a very few values (the tail) to
the other side. In a positively skewed distribution, the tail is to the right and the mean
is larger than the median. In a negatively skewed distribution, the tail is to the left and
the mean is smaller than the median.
Symmetric Distribution
The data values are evenly distributed on both sides of the mean. In a symmetric
distribution, the mean is the median.
Weighted Mean
The mean when each value is multiplied by its weight and summed. This sum is
divided by the total of the weights.
Midrange
The mean of the highest and lowest values. (Max + Min) / 2
Range
The difference between the highest and lowest values. Max - Min
Population Variance
The average of the squares of the distances from the population mean. It is the sum of
the squares of the deviations from the mean divided by the population size.
Sample Variance
Unbiased estimator of a population variance. Instead of dividing by the population
size, the sum of the squares of the deviations from the sample mean is divided by one
less than the sample size.
Standard Deviation
The square root of the variance. The population standard deviation is the square root
of the population variance and the sample standard deviation is the square root of the
sample variance. The sample standard deviation is not the unbiased estimator for the
population standard deviation.
Coefficient of Variation
Standard deviation divided by the mean, expressed as a percentage.
Chebyshev's Theorem
The proportion of the values that fall within k standard deviations of the mean is at
Average could mean one of four things. The arithmetic mean, the median, midrange, or mode.
For this reason, it is better to specify which average you're talking about.
Mean
Sample Mean:
Median
The data must be ranked (sorted in ascending order) first. The median is the number in the
middle.
To find the depth of the median, there are several formulas that could be used, the one that we
will use is: Depth of median = 0.5 * (n + 1)
Raw Data
The median is the number in the "depth of the median" position. If the sample size is even, the
depth of the median will be a decimal -- you need to find the midpoint between the numbers
on either side of the depth of the median.
Find the cumulative frequencies for the data. The first value with a cumulative frequency
greater than depth of the median is the median. If the depth of the median is exactly 0.5 more
than the cumulative frequency of the previous class, then the median is the midpoint between
the two classes.
Since the data is grouped, you have lost all original information. Some textbooks have you
simply take the midpoint of the class. This is an over-simplification which isn't the true value
(but much easier to do). The correct process is to interpolate.
Find out what proportion of the distance into the median class the median by dividing the
sample size by 2, subtracting the cumulative frequency of the previous class, and then
dividing all that bay the frequency of the median class.
Multiply this proportion by the class width and add it to the lower boundary of the median
class.
Mode
The mode is the most frequent data value. There may be no mode if no one value appears
more than any other. There may also be two modes (bimodal), three modes (trimodal), or
more than three modes (multi-modal).
For grouped frequency distributions, the modal class is the class with the largest frequency.
Midrange
The midrange is simply the midpoint between the highest and lowest values.
Summary
The Mean is used in computing other statistics (such as the variance) and does not exist for
open ended grouped frequency distributions. It is often not appropriate for skewed
distributions such as salary information.
The Median is the center number and is good for skewed distributions because it is resistant to
change.
The Mode is used to describe the most typical case. The mode can be used with nominal data
whereas the others can't. The mode may or may not exist and there may be more than one
value for the mode .
The Midrange is not used very often. It is a very rough estimate of the average and is greatly
affected by extreme values.
Measures of Variation
Range
The range is the simplest measure of variation to find. It is simply the highest value minus the
lowest value.
RANGE = MAXIMUM - MINIMUM
Since the range only uses the largest and smallest values, it is greatly affected by extreme
values, that is - it is not resistant to change.
Variance
"Average Deviation"
The range only involves the smallest and largest numbers, and it would be desirable to have a
statistic which involved all of the data values.
The problem is that this summation is always zero. So, the average deviation will always be
zero. That is why the average deviation is never used.
Population Variance
So, to keep it from being zero, the deviation from the mean is squared and called the "squared
deviation from the mean". This "average squared deviation from the mean" is called the
variance.
One would expect the sample variance to simply be the population variance with the
population mean replaced by the sample mean. However, one of the major uses of statistics is
to estimate the corresponding parameter. This formula has the problem that the estimated
value isn't the same as the parameter. To counteract this, the sum of the squares of the
deviations is divided by one less than the sample size.
Standard Deviation
There is a problem with variances. Recall that the deviations were squared. That means that
the units were also squared. To get the units back the same as the original data values, the
square root must be taken.
The sample standard deviation is not the unbiased estimator for the population standard
deviation.
Sum of Squares
The sum of the squares of the deviations from the means is given a shortcut notation and
several alternative formulas.
Chebyshev's Theorem
The proportion of the values that fall within k standard deviations of the mean will be at least
Chebyshev's Theorem is true for any sample set, not matter what the distribution.
Empirical Rule
The empirical rule is only valid for bell-shaped (normal) distributions. The following
statements are true.
Approximately 68% of the data values fall within one standard deviation of the mean.
Approximately 95% of the data values fall within two standard deviations of the mean.
Approximately 99.7% of the data values fall within three standard deviations of the
mean.
The empirical rule will be revisited later in the chapter on normal probabilities.
Measures of Position
The standard score is obtained by subtracting the mean and dividing the difference by the
standard deviation. The symbol is z, which is why it's also called a z-score.
The mean of the standard scores is zero and the standard deviation is 1. This is the nice
feature of the standard score -- no matter what the original scale was, when the data is
converted to its standard score, the mean is zero and the standard deviation is 1.
The kth percentile is the number which has k% of the values below it. The data must be
ranked.
It is sometimes easier to count from the high end rather than counting from the low end. For
example, the 80th percentile is the number which has 80% below it and 20% above it. Rather
than counting 80% from the bottom, count 20% from the top.
If you wish to find the percentile for a number (rather than locating the kth percentile), then
The percentiles divide the data into 100 equal regions. The deciles divide the data into 10
equal regions. The instructions are the same for finding a percentile, except instead of
dividing by 100 in step 2, divide by 10.
Quartiles (4 regions)
The quartiles divide the data into 4 equal regions. Instead of dividing by 100 in step 2, divide
by 4.
Note: The 2nd quartile is the same as the median. The 1st quartile is the 25th percentile, the 3rd
quartile is the 75th percentile.
The quartiles are commonly used (much more so than the percentiles or deciles).
Five Number Summary
The five number summary consists of the minimum value, lower fourth, median, upper fourth,
and maximum value.
A graphical representation of the five number summary. A box is drawn between the lower
and upper fourths with a line at the median. Whiskers (a single line, not a box) extend from
the fourths to lines at the minimum and maximum values.
The interquartile range is the difference between the third and first quartiles. That's it: Q3 -
Q1
Outliers
Outliers are extreme values. There are mild outliers and extreme outliers.
Extreme Outliers
Extreme outliers are any data values which lie more than 3.0 times the interquartile range
below the first quartile or above the third quartile. x is an extreme outlier if ...
x < Q1 - 3 * IQR
or
x > Q3 + 3 * IQR
Mild Outliers
Mild outliers are any data values which lie between 1.5 times and 3.0 times the interquartile
range below the first quartile or above the third quartile. x is a mild outlier if ...
or