0% found this document useful (0 votes)
4 views

Chapter 3(Technical English for Statistics)

Chapter 3 provides definitions and explanations of key statistical concepts such as mean, median, mode, variance, and standard deviation, as well as various measures of central tendency and variation. It discusses the importance of understanding different types of distributions, including skewed and symmetric distributions, and introduces methods for calculating percentiles, quartiles, and identifying outliers. The chapter emphasizes the significance of these statistical measures in analyzing and interpreting data effectively.

Uploaded by

kirilyakov96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Chapter 3(Technical English for Statistics)

Chapter 3 provides definitions and explanations of key statistical concepts such as mean, median, mode, variance, and standard deviation, as well as various measures of central tendency and variation. It discusses the importance of understanding different types of distributions, including skewed and symmetric distributions, and introduces methods for calculating percentiles, quartiles, and identifying outliers. The chapter emphasizes the significance of these statistical measures in analyzing and interpreting data effectively.

Uploaded by

kirilyakov96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Chapter 3

Data Description
Definitions

Statistic
Characteristic or measure obtained from a sample
Parameter
Characteristic or measure obtained from a population
Mean
Sum of all the values divided by the number of values. This can either be a population
mean (denoted by mu) or a sample mean (denoted by x bar)
Median
The midpoint of the data after being ranked (sorted in ascending order). There are as
many numbers below the median as above the median.
Mode
The most frequent number
Skewed Distribution
The majority of the values lie together on one side with a very few values (the tail) to
the other side. In a positively skewed distribution, the tail is to the right and the mean
is larger than the median. In a negatively skewed distribution, the tail is to the left and
the mean is smaller than the median.
Symmetric Distribution
The data values are evenly distributed on both sides of the mean. In a symmetric
distribution, the mean is the median.
Weighted Mean
The mean when each value is multiplied by its weight and summed. This sum is
divided by the total of the weights.
Midrange
The mean of the highest and lowest values. (Max + Min) / 2
Range
The difference between the highest and lowest values. Max - Min
Population Variance
The average of the squares of the distances from the population mean. It is the sum of
the squares of the deviations from the mean divided by the population size.
Sample Variance
Unbiased estimator of a population variance. Instead of dividing by the population
size, the sum of the squares of the deviations from the sample mean is divided by one
less than the sample size.
Standard Deviation
The square root of the variance. The population standard deviation is the square root
of the population variance and the sample standard deviation is the square root of the
sample variance. The sample standard deviation is not the unbiased estimator for the
population standard deviation.
Coefficient of Variation
Standard deviation divided by the mean, expressed as a percentage.
Chebyshev's Theorem
The proportion of the values that fall within k standard deviations of the mean is at

least where k > 1. Chebyshev's theorem can be applied to any distribution


regardless of its shape.
Empirical or Normal Rule
Only valid when a distribution in bell-shaped (normal). Approximately 68% of the
data lies within 1 standard deviation of the mean; 95% of the data lies within 2
standard deviations; and 99.7% of the data lies within 3 standard deviations of the
mean.
Standard Score or Z-Score
The value obtained by subtracting the mean and dividing by the standard deviation.
When all values are transformed to their standard scores, the new mean (for Z) will be
zero and the standard deviation will be one.
Percentile
The percent of the population which lies below that value. The data must be ranked to
find percentiles.
Quartile
Either the 25th, 50th, or 75th percentiles. The 50th percentile is also called the median.
Decile
Either the 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, or 90th percentiles.
Box and Whiskers Plot (Box Plot)
A graphical representation of the minimum value, lower fourth (hinge), median, upper
fourth, and maximum. Some textbooks define the five values as the minimum, first
Quartile, median, third Quartile, and maximum.
Five Number Summary
Minimum value, lower fourth, median, upper fourth, and maximum.
InterQuartile Range (IQR)
The difference between the 3rd and 1st Quartiles.
Outlier
An extremely high or low value when compared to the rest of the values.
Mild Outliers
Values which lie between 1.5 and 3.0 times the InterQuartile Range below the 1st
Quartile or above the 3rd Quartile.
Extreme Outliers
Values which lie more than 3.0 times the InterQuartile Range below the 1st Quartile or
above the 3rd Quartile.

Measures of Central Tendency


The term "Average" is vague

Average could mean one of four things. The arithmetic mean, the median, midrange, or mode.
For this reason, it is better to specify which average you're talking about.

Mean

This is what people usually intend when they say "average"


Population Mean:

Sample Mean:

Sample Mean for Frequency Distribution:


The mean of a frequency distribution is also the weighted mean.

Median

The data must be ranked (sorted in ascending order) first. The median is the number in the
middle.

To find the depth of the median, there are several formulas that could be used, the one that we
will use is: Depth of median = 0.5 * (n + 1)

Raw Data

The median is the number in the "depth of the median" position. If the sample size is even, the
depth of the median will be a decimal -- you need to find the midpoint between the numbers
on either side of the depth of the median.

Ungrouped Frequency Distribution

Find the cumulative frequencies for the data. The first value with a cumulative frequency
greater than depth of the median is the median. If the depth of the median is exactly 0.5 more
than the cumulative frequency of the previous class, then the median is the midpoint between
the two classes.

Grouped Frequency Distribution

Since the data is grouped, you have lost all original information. Some textbooks have you
simply take the midpoint of the class. This is an over-simplification which isn't the true value
(but much easier to do). The correct process is to interpolate.

Find out what proportion of the distance into the median class the median by dividing the
sample size by 2, subtracting the cumulative frequency of the previous class, and then
dividing all that bay the frequency of the median class.

Multiply this proportion by the class width and add it to the lower boundary of the median
class.
Mode

The mode is the most frequent data value. There may be no mode if no one value appears
more than any other. There may also be two modes (bimodal), three modes (trimodal), or
more than three modes (multi-modal).

For grouped frequency distributions, the modal class is the class with the largest frequency.

Midrange

The midrange is simply the midpoint between the highest and lowest values.

Summary

The Mean is used in computing other statistics (such as the variance) and does not exist for
open ended grouped frequency distributions. It is often not appropriate for skewed
distributions such as salary information.

The Median is the center number and is good for skewed distributions because it is resistant to
change.

The Mode is used to describe the most typical case. The mode can be used with nominal data
whereas the others can't. The mode may or may not exist and there may be more than one
value for the mode .

The Midrange is not used very often. It is a very rough estimate of the average and is greatly
affected by extreme values.

Property Mean Median Mode Midrange

Always Exists No Yes No Yes

Uses all data values Yes No No No

Affected by extreme Yes No No Yes


values

Measures of Variation
Range

The range is the simplest measure of variation to find. It is simply the highest value minus the
lowest value.
RANGE = MAXIMUM - MINIMUM

Since the range only uses the largest and smallest values, it is greatly affected by extreme
values, that is - it is not resistant to change.

Variance

"Average Deviation"

The range only involves the smallest and largest numbers, and it would be desirable to have a
statistic which involved all of the data values.

Average deviation defines as below:

The problem is that this summation is always zero. So, the average deviation will always be
zero. That is why the average deviation is never used.

Population Variance

So, to keep it from being zero, the deviation from the mean is squared and called the "squared
deviation from the mean". This "average squared deviation from the mean" is called the
variance.

Unbiased Estimate of the Population Variance

One would expect the sample variance to simply be the population variance with the
population mean replaced by the sample mean. However, one of the major uses of statistics is
to estimate the corresponding parameter. This formula has the problem that the estimated
value isn't the same as the parameter. To counteract this, the sum of the squares of the
deviations is divided by one less than the sample size.

Standard Deviation

There is a problem with variances. Recall that the deviations were squared. That means that
the units were also squared. To get the units back the same as the original data values, the
square root must be taken.
The sample standard deviation is not the unbiased estimator for the population standard
deviation.

Sum of Squares

The sum of the squares of the deviations from the means is given a shortcut notation and
several alternative formulas.

A little algebraic simplification returns:

Chebyshev's Theorem

The proportion of the values that fall within k standard deviations of the mean will be at least

, where k is an number greater than 1.

"Within k standard deviations" interprets as the interval: to .

Chebyshev's Theorem is true for any sample set, not matter what the distribution.

Empirical Rule

The empirical rule is only valid for bell-shaped (normal) distributions. The following
statements are true.

 Approximately 68% of the data values fall within one standard deviation of the mean.
 Approximately 95% of the data values fall within two standard deviations of the mean.
 Approximately 99.7% of the data values fall within three standard deviations of the
mean.

The empirical rule will be revisited later in the chapter on normal probabilities.

Measures of Position

Standard Scores (z-scores)

The standard score is obtained by subtracting the mean and dividing the difference by the
standard deviation. The symbol is z, which is why it's also called a z-score.
The mean of the standard scores is zero and the standard deviation is 1. This is the nice
feature of the standard score -- no matter what the original scale was, when the data is
converted to its standard score, the mean is zero and the standard deviation is 1.

Percentiles, Deciles, Quartiles

Percentiles (100 regions)

The kth percentile is the number which has k% of the values below it. The data must be
ranked.

1. Rank the data


2. Find k% (k /100) of the sample size, n.
3. If this is an integer, add 0.5. If it isn't an integer round up.
4. Find the number in this position. If your depth ends in 0.5, then take the midpoint
between the two numbers.

It is sometimes easier to count from the high end rather than counting from the low end. For
example, the 80th percentile is the number which has 80% below it and 20% above it. Rather
than counting 80% from the bottom, count 20% from the top.

Note: The 50th percentile is the median.

If you wish to find the percentile for a number (rather than locating the kth percentile), then

1. Take the number of values below the number


2. Add 0.5
3. Divide by the total number of values
4. Convert it to a percent

Deciles (10 regions)

The percentiles divide the data into 100 equal regions. The deciles divide the data into 10
equal regions. The instructions are the same for finding a percentile, except instead of
dividing by 100 in step 2, divide by 10.

Quartiles (4 regions)

The quartiles divide the data into 4 equal regions. Instead of dividing by 100 in step 2, divide
by 4.

Note: The 2nd quartile is the same as the median. The 1st quartile is the 25th percentile, the 3rd
quartile is the 75th percentile.

The quartiles are commonly used (much more so than the percentiles or deciles).
Five Number Summary

The five number summary consists of the minimum value, lower fourth, median, upper fourth,
and maximum value.

Box and Whiskers Plot

A graphical representation of the five number summary. A box is drawn between the lower
and upper fourths with a line at the median. Whiskers (a single line, not a box) extend from
the fourths to lines at the minimum and maximum values.

Interquartile Range (IQR)

The interquartile range is the difference between the third and first quartiles. That's it: Q3 -
Q1

Outliers

Outliers are extreme values. There are mild outliers and extreme outliers.

Extreme Outliers

Extreme outliers are any data values which lie more than 3.0 times the interquartile range
below the first quartile or above the third quartile. x is an extreme outlier if ...

x < Q1 - 3 * IQR

or

x > Q3 + 3 * IQR

Mild Outliers

Mild outliers are any data values which lie between 1.5 times and 3.0 times the interquartile
range below the first quartile or above the third quartile. x is a mild outlier if ...

Q1 - 3 * IQR <= x < Q1 - 1.5 * IQR

or

Q1 + 1.5 * IQR < x <= Q3 + 3 * IQR

You might also like