Chapter 2 Lab 2: Descriptive Statistics
Describing comic sensibility is near impossible. It’s sort of an abstract silliness, that sometimes the joke isn’t the star. —Dana Carvey
The purpose of this lab is to show you how to compute basic descriptive statistics, including measures of central tendency (mean, mode, median) and variation (range, variance, standard deviation).
2.1 General Goals
- Compute measures of central tendency using software
- Compute measures of variation using software
- Ask some questions of a data set using descriptive statistics
2.1.1 Important info
We will be using data from the gapminder project. You can download a small snippet of the data in .csv format from this link (note this dataset was copied from the gapminder library for R) gapminder.csv. If you are using R, then you can install the gapminder package. This method is described later in the R section.
2.2 R
2.2.1 Descriptives basics in R
We learned in lecture and from the textbook that data we want to use ask and answer questions often comes with loads of numbers. Too many numbers to look at all at once. That’s one reason we use descriptive statistics. To reduce the big set of numbers to one or two summary numbers that tell use something about all of the numbers. R can produce descriptive statistics for you in many ways. There are base functions for most of the ones that you want. We’ll go over some R basics for descriptive statistics, and then use our new found skills to ask some questions about real data.
2.2.1.1 Making numbers in R
In order to do descriptive statistics we need to put some numbers in a variable. You can also do this using the c()
command, which stands for combine
<- c(1,2,3,4) my_numbers
There a few other handy ways to make numbers. We can use seq()
to make a sequence. Here’s making the numbers from 1 to 100
<- seq(1,100,1) one_to_one_hundred
We can repeat things, using rep. Here’s making 10 5s, and 25 1s:
rep(10,5)
## [1] 10 10 10 10 10
rep(1,25)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
<- c(rep(10,5),rep(1,25)) all_together_now
2.2.2 Central Tendency
2.2.2.1 Mean
Remember the mean of some numbers is their sum, divided by the number of numbers. We can compute the mean like this:
sum(one_to_one_hundred)/length(one_to_one_hundred)
## [1] 50.5
Or, we could just use the mean()
function like this:
mean(one_to_one_hundred)
## [1] 50.5
2.2.2.2 Median
The median is the number in the exact middle of the numbers ordered from smallest to largest. If there are an even number of numbers (no number in the middle), then we take the number in between the two (decimal .5). Use the median
function. There’s only 3 numbers here. The middle one is 2, that should be the median
median(c(1,2,3))
## [1] 2
2.2.2.3 Mode
R does not a base function for the Mode. You would have to write one for yourself. Here is an example of writing your own mode function, and then using it. Note I searched how to do this on Google, and am using the mode defined by this answer on stack overflow
Remember, the mode is the most frequently occurring number in the set. Below 1 occurs the most, so the mode will be one.
<- function(x) {
my_mode <- unique(x)
ux which.max(tabulate(match(x, ux)))]
ux[
}
my_mode(c(1,1,1,1,1,1,1,2,3,4))
## [1] 1
2.2.3 Variation
We often want to know how variable the numbers are. We are going to look at descriptive statistics to describe this such as the range, variance, the standard deviation, and a few others.
First, let’s remind ourselves what variation looks like (it’s when the numbers are different). We will sample 100 numbers from a normal distribution (don’t worry about this yet), with a mean of 10, and a standard deviation of 5, and then make a histogram so we can see the variation around 10..
<- rnorm(100,10,5)
sample_numbers hist(sample_numbers)
2.2.3.1 range
The range is the minimum and maximum values in the set, we use the range
function.
range(sample_numbers)
## [1] -9.33375 24.65729
2.2.3.2 var = variance
We can find the sample variance using var
. Note, divides by (n-1)
var(sample_numbers)
## [1] 30.17962
2.2.3.3 sd = standard deviation
We find the sample standard deviation us SD. Note, divides by (n-1)
sd(sample_numbers)
## [1] 5.493599
Remember that the standard deviation is just the square root of the variance, see:
sqrt(var(sample_numbers))
## [1] 5.493599
2.2.3.4 All Descriptives
Let’s put all of the descriptives and other functions so far in one place:
<- rnorm(100,10,5)
sample_numbers
sum(sample_numbers)
## [1] 920.7419
length(sample_numbers)
## [1] 100
mean(sample_numbers)
## [1] 9.207419
median(sample_numbers)
## [1] 9.495701
my_mode(sample_numbers)
## [1] 0.8349043
range(sample_numbers)
## [1] -2.196914 23.122972
var(sample_numbers)
## [1] 26.48327
sd(sample_numbers)
## [1] 5.14619
2.2.4 Descriptives by conditions
Sometimes you will have a single variable with some numbers, and you can use the above functions to find the descriptives for that variable. Other times (most often in this course), you will have a big data frame of numbers, with different numbers in different conditions. You will want to find descriptive statistics for each the sets of numbers inside each of the conditions. Fortunately, this is where R really shines, it does it all for you in one go.
Let’s illustrate the problem. Here I make a date frame with 10 numbers in each condition. There are 10 conditions, each labelled, A, B, C, D, E, F, G, H, I, J.
<- rnorm(100,10,5)
scores <- rep(c("A","B","C","D","E","F","G","H","I","J"), each =10)
conditions <- data.frame(conditions,scores) my_df
If you look at the my_df
data frame, you will see it has 100 rows, there are 10 rows for each condition with a label in the conditions
column, and 10 scores for each condition in the scores
column. What if you wanted to know the mean of the scores in each condition? You would want to find 10 means.
The slow way to do it would be like this:
mean(my_df[my_df$conditions=="A",]$scores)
## [1] 8.853652
mean(my_df[my_df$conditions=="B",]$scores)
## [1] 9.803153
mean(my_df[my_df$conditions=="C",]$scores)
## [1] 11.06139
# and then keep going
Nobody wants to do that! Not, me I stopped doing it that way, you should to.
2.2.4.1 group_by and summarise
We can easily do everything all at once using the group_by
and summarise
function from the dplyr
package. Just watch
library(dplyr)
%>%
my_df group_by(conditions) %>%
summarise(means = mean(scores))
## # A tibble: 10 × 2
## conditions means
## <chr> <dbl>
## 1 A 8.85
## 2 B 9.80
## 3 C 11.1
## 4 D 11.3
## 5 E 9.01
## 6 F 6.79
## 7 G 6.52
## 8 H 8.93
## 9 I 9.15
## 10 J 11.0
A couple things now. First, the print out of this was ugly. Let’s fix that. we put the results of our code into a new variable, then we use knitr::kable
to print it out nicely when we knit
the document
<- my_df %>%
summary_df group_by(conditions) %>%
summarise(means = mean(scores))
::kable(summary_df) knitr
conditions | means |
---|---|
A | 8.853652 |
B | 9.803153 |
C | 11.061394 |
D | 11.294845 |
E | 9.012564 |
F | 6.785061 |
G | 6.515950 |
H | 8.927163 |
I | 9.151269 |
J | 11.036461 |
2.2.4.2 multiple descriptives
The best thing about the dplyr
method, is that we can add more than one function, and we’ll get more than one summary returned, all in a nice format, let’s add the standard deviation:
<- my_df %>%
summary_df group_by(conditions) %>%
summarise(means = mean(scores),
sds = sd(scores))
::kable(summary_df) knitr
conditions | means | sds |
---|---|---|
A | 8.853652 | 3.530194 |
B | 9.803153 | 6.492638 |
C | 11.061394 | 5.800476 |
D | 11.294845 | 5.993258 |
E | 9.012564 | 6.015641 |
F | 6.785061 | 7.686094 |
G | 6.515950 | 5.563919 |
H | 8.927163 | 6.636507 |
I | 9.151269 | 4.601282 |
J | 11.036461 | 2.234042 |
We’ll add the min and the max too:
<- my_df %>%
summary_df group_by(conditions) %>%
summarise(means = mean(scores),
sds = sd(scores),
min = min(scores),
max = max(scores))
::kable(summary_df) knitr
conditions | means | sds | min | max |
---|---|---|---|---|
A | 8.853652 | 3.530194 | 4.221080 | 15.93309 |
B | 9.803153 | 6.492638 | 2.860014 | 21.02449 |
C | 11.061394 | 5.800476 | 3.925344 | 22.87423 |
D | 11.294845 | 5.993258 | 2.656215 | 23.54761 |
E | 9.012564 | 6.015641 | 2.044558 | 17.53280 |
F | 6.785061 | 7.686094 | -3.753145 | 17.20397 |
G | 6.515950 | 5.563919 | -5.966392 | 12.98101 |
H | 8.927163 | 6.636507 | -4.687276 | 16.96385 |
I | 9.151269 | 4.601282 | 2.755487 | 16.03131 |
J | 11.036461 | 2.234042 | 7.836105 | 14.48993 |
2.2.5 Describing gapminder
Now that we know how to get descriptive statistics from R, we cam do this will some real data. Let’s quickly ask a few question about the gapminder data.
library(gapminder)
<- gapminder gapminder_df
Note: The above code will only work if you have installed the gapminder package. Make sure you are connected to the internet, then choose the Packages tab from the bottom right panel, and choose install. Thens search for gapminder, choose it, and install it.
2.2.5.1 What are some descriptive for Life expectancy by continent?
Copy the code from the last part of descriptives using dplyr
, then change the names like this:
<- gapminder_df %>%
summary_df group_by(continent) %>%
summarise(means = mean(lifeExp),
sds = sd(lifeExp),
min = min(lifeExp),
max = max(lifeExp))
::kable(summary_df) knitr
continent | means | sds | min | max |
---|---|---|---|---|
Africa | 48.86533 | 9.150210 | 23.599 | 76.442 |
Americas | 64.65874 | 9.345088 | 37.579 | 80.653 |
Asia | 60.06490 | 11.864532 | 28.801 | 82.603 |
Europe | 71.90369 | 5.433178 | 43.585 | 81.757 |
Oceania | 74.32621 | 3.795611 | 69.120 | 81.235 |
2.2.6 Generalization Exercise
(1 point - Pass/Fail)
Complete the generalization exercise described in your R Markdown document for this lab.
What is the mean, standard deviation, minimum and maximum life expectancy for all the gapminder data (across all the years and countries). Hint: do not use
group_by
What is the mean, standard deviation, minimum and maximum life expectancy for all of the continents in 2007, the most recent year in the dataset. Hint: add another pipe using
filter(year==2007) %>%
2.2.7 Writing assignment
(2 points - Graded)
Complete the writing assignment described in your R Markdown document for this lab. When you have finished everything. Knit the document and hand in your stuff (you can submit your .RMD file to blackboard if it does not knit.)
Your writing assignment is to answer these questions in full sentences using simple plain langauge:
- Define the mode.
- Explain what would need to happen in order for a set of numbers to have two modes
- Define the median
- Define the mean
- Define the range
- When calculating the standard deviation, explain what the difference scores represent
- Explain why the difference scores are squared when calculating the standard deviation
- If one set of numbers had a standard deviation of 5, and another had a standard deviation of 10, which set of numbers would have greater variance, explain why.
Rubric
General grading.
- You will receive 0 points for missing answers (say, if you do not answer question c, then you will receive 0 out .25 points for that question)
- You must write in complete sentences. Point form sentences will be given 0 points.
- Completely incorrect answers will receive 0 points.
- If your answer is generally correct but very difficult to understand and unclear you may receive half points for the question
2.4 SPSS
In this lab, we will use SPSS to calculate a variety of descriptive statistics. SPSS allows us to specify which statistics we would like calculated and produce them all in one output table. Here, we will learn to:
- Calculate descriptive statistics
- Graph data using a histogram
- Editing graphs
Let’s begin with a short data set {x= 1, 1, 4, 1, 2, 5, 7}
Suppose we want to calculate the measures of central tendency (mean, median, and mode) as well as variability (range, standard deviation, and variance). First, we will have to enter our data into the SPSS spreadsheet. There are 7 measurements, so we will need 7 rows of data (I have also changed the name of this variable in Variable View to x
:
2.4.1 Calculating Descriptive Statistics
From here, go to the top menu and choose Analyze, then Descriptive Statistics and then Frequencies:
A new window will ask you to specify for which variables you want descriptives statistics calculated. Use the arrow to move the x
variable from the left-hand to the right-hand field.
Now, click the Statistics button. This will open a new window containing a list of statistics. You can choose as many as you want to be calculated. We will choose mean, median mode, range, standard deviation, and variance.
Then click Continue, and then OK. SPSS will produce a set of two output tables: one containing the descriptive statistics we have chosen, and the other a frequency table–a list of all the possible values in our data set and their corresponding frequencies.
2.4.2 Descriptive Statistics and Histograms
Now let’s use a real dataset to calculate the same measures of central tendency and variability as in the last example, but with the addition of a histogram to visualize a distribution to relate back to the descriptive statistics. Here is a link to the life expectancy dataset we used for our graphing tutorial. It is named life_expectancy.sav.
Suppose we wanted to know about life expectancy (around the world) in 2018. This will include calculating descriptive statistics, as well as graphing a histogram to examine the distribution of our data. SPSS often has neat shortcuts to graph data within other menu options. For example, in the process of asking SPSS to produce a table of descriptive statistics, we can also add a histogram to the output without having to go to the Graphs menu.
First, we go to Analyze, then Descriptive Statistics, then Frequencies.
A window will appear asking us to indicate which variable to use. We will scroll all the way down in the list of variables on the left, choose 2018 [V220]
, and then move it into the field on the right using the arrow.
As in the previous example, we will click the Statistics button next, and choose our measures of central tendency (mean, median, mode) as well as variability (range, standard deviation, variance).
Once we click continue, we are back to the previous window. Here, we can conveniently ask SPSS to insert a histogram. Let’s click on the Charts button:
This will open a new window that asks us to specify the type of chart we would like. Let’s select Histogram, and check off Show normal curve on histogram:
Click Continue, and then OK.
SPSS will produce a table of the requested descriptive statistics, a frequency table (we will ignore this for now because it is long and not relevant to this exercise), and finally, a histogram showing the distribution of life expectancy (in years) with a normal distribution superimposed for comparison.
Something to think about: What do the mean, median, and mode indicate about the shape of the distribution? Is it confirmed when we look at the histogram? How does the shape of this distribution compare to the symmetrical normal distribution which is superimposed over it?
2.4.3 Editing graphs
For future reference, we will take this opportunity to make edits to the appearance of this graph. It is helpful to know how to manipulate not just the superficial, cosmetic appearance of data visualizations, but also the components of graphs that communicate important information. One feature to consider is the scale of our axes.
To modify the scale in a graph, first double-click on the graph itself in the output window. A Chart Editor should pop up.
Now, hover your mouse over one of the values on the x-axis, and click down once. You will see all the x-axis values suddenly highlight:
Now that you know how to select the entire axis, double click on it. A properties window will pop up:
Notice that in the default tab, you can change the fill color of components of your graph. For now, we will select the Scale tab instead:
Notice in this tab, we can change the minimum, maximum, and increments of our scale. Lets change the major increment to 5:
Click Apply. Then Close. Now, close your Chart Editor window by simply x-ing it out. The SPSS output window will now contain your edited graph:
2.4.4 Practice Problems
Using the life expectancy data set, produce a table of output showing the descriptive statistics (measures of central tendency and variability) for both years 1800 and 1934 (during the Great Depression).
Plot histograms of life expectancy for both years. How are these distributions different? (Hint: Plot these on the same axes so that they are comparable).
2.6 CogStat
We already realized, that the data we want to ask questions about often comes with loads of numbers, which is too many to look at all at once. That is one reason we use descriptive statistics, to reduce the big set of numbers to only a few summary numbers, which tell us something about all the original values. Descriptive statistics is a powerful tool, that helps us summarize vast amounts of information to understand our data and its traits better.
In this section data from the gapminder project will be used, so first of all, load the data “Gapminder” into CogStat, as we did it before in the first Chapter.
2.6.2 Statistical measurements to describe the data
Measures of Central Tendency (mean, median, mode)
Mean: The numbers’ mean is their sum, divided by the number of numbers counted.
Median: Median is the middle number of all your numbers ordered from smallest to largest.
Mode: Mode is the number that occurs most frequently in your dataset.
Measures of Variability (minimum, maximum, range, standard deviation, quartiles): We often want to know how variable the numbers are. Descriptive statistics can describe this such as the range, variance, standard deviation, and a few others.
Minimum: The smallest one between your numbers.
Maximum: The largest one between your numbers.
Range: The difference between your dataset’s smallest and largest numbers. (for more: https://en.wikipedia.org/wiki/Range_(statistics))
Standard deviation: The square root of the variance (the average of squared deviations from the mean). (for more: https://doc.cogstat.org/Differences-in-calculations-between-CogStat-and-other-programs#standard-deviation-of-the-sample-and-the-population)
Quartiles: It divides the data into four somewhat equal portions.
2.6.3 What are some descriptives of life expectancy in this dataset?
In this task, we only want to take a look at one variable, life expectancy, and its descriptives. First, load the data “Gapminder”. After loading the data, select “Explore variables”. In the dialogue that appears, choose “LifeExp” from “Available variables” and click “OK”. In the “Descriptives for the variable” table of the Sample Properties section, you find the descriptive statistics of your variable. As you can see, with this task, we got all the descriptives we usually look for and need. According to our output, the mean life expectancy in this dataset is almost sixty years.
It is also very important to look at the minimum and maximum values. In this case, the maximum life expectancy is almost eighty-three years and the minimum is about twenty-four years. It sounds strange, that the minimum life expectancy is barely twenty-four years, right? This is why it is useful to take a look at your dataset and raw data first.
This dataset contains more countries and their life expectancy throughout decades. As data from several countries can be found here, we can find well-developed and under-developed ones, as well as there are years of war, prosperity, or crisis. After this piece of information, it is easier to look at and understand this dataset and the results coming from the analysis.
2.6.4 What are some descriptives of life expectancy by continent?
Sometimes, you will have a single variable with some numbers, and you can use the steps explained above to find the descriptives for that variable. Other times, you will have a big set of numbers grouped into different conditions or groups. You might want to find descriptive statistics for each set of numbers inside each condition or group.
In this task, we would like to find out the descriptives of the life expectancy for all the continents. What we do here is compare different variables by a single one. In CogStat, choose the “Compare groups” (Ctrl + G) analysis. (Note that, to solve this task you could also choose the “Compare repeated measures variable and groups” analysis [Ctrl + P] as well, as it gives you the same output, only the dialogues differ in these tasks. We choose “Compare groups” here as according to the task it comes more naturally and logically.)
or
After this dialogue appears, select “lifeExp” as Dependent variable(s) and “continent” as “Group(s)”, as we want to know all the continents’ life expectancies separately.
In the Sample Properties part of the output, all the descriptives for the groups are provided, which lets us make comparisons about the data.
First, we see the mean life expectancy for continents, and we can say that it is highest in Oceania and Europe, followed by America and Asia, and the lowest is in Africa. As we take a look at the different standard deviations, we see that it is highest in Asia and lowest in Oceania. It means that in Asia there is a bigger difference from the mean, than in Oceania. The table also shows the minimum and maximum life expectancy for each continent, which is quite interesting to look at, since this shows us some really big differences. The median and the upper and lower quartiles also show some important points. For these values, the dataset is in order. The median shows the middle values, for each continent. For example, in Europe, the median is around 72 years, and there is as much higher life expectancy as much lower than this one (as the median is the middle value). Quartiles are values that divide the data points into equal parts of four, with the same number of values. 25% of the data points are under the value of the lower quartile and 75% of the data points are under the value of the upper quartile. For Europeans, the upper quartile is about 75 years, which means that 75% of people cannot expect a life expectancy of 75 years. The lower quartile is around 69 years, which means that only 25% of people can expect a lower life expectancy, but most will probably live longer than that.
Additionally, CogStat also provides us with boxplots to visualize some of the information mentioned before.
To understand the boxplot, take a look at its parts, regarding the data of Africa. Vertically, all the datapoints are represented with little blue dots.