Chapter 4 Lab 4: Normal Distribution & Central Limit Theorem

By a small sample, we may judge of the whole piece. —Miguel de Cervantes from Don Quixote

4.1 General Goals

  1. Distributions
  2. Sampling from distributions
  3. Sampling distribution of the mean
  4. Sampling statistics (statistics of many samples)
  5. Central limit theorem
  6. Normal Distribution
  7. z-scores

4.2 R

This is one of two special labs where we don’t use too much real data. We will mostly fake everything. Yes, you will learn how to fake data in this course. Be a superhero, and only use these skills for good and not for evil.

As we progress through the course, you will learn that generating simulated data can be very useful to help you understand real data. In fact, I will say this right now. If you can’t simulate the data you expect to find, then you probably can’t understand the data that you do find very well. That’s a bold statement. It’s probably partly true.

4.2.1 Generating Numbers in R

There are many ways to make R generate numbers for you. In all cases you define how the numbers are generated. We’ll go through a few of the many ways.

4.2.1.1 sample

The sample function is like an endless gumball machine. You put the gumballs inside with different properties, say As and Bs, and then you let sample endlessly take gumballs out. Check it out:

gumballs <- c("A","B")
sample_of_gumballs <-sample(gumballs, 10, replace=TRUE)
sample_of_gumballs
##  [1] "A" "B" "B" "A" "B" "A" "A" "B" "B" "A"

Here the sample function randomly picks A or B each time. We set it do this 10 times, so our sample has 10 things in it. We set replace=TRUE so that after each sample, we put the item back into the gumball machine and start again. Here’s another example with numbers

some_numbers <- c(1,2,3,4,5,5,5,5)
sample_of_numbers <-sample(some_numbers, 20, replace=TRUE)
sample_of_numbers
##  [1] 1 5 5 4 4 2 5 5 5 4 5 5 4 1 5 5 4 5 4 5

Let’s do one more thing with sample. Let’s sample 1000 times from our some_numbers variable, and then look at the histogram

library(ggplot2)
some_numbers <- c(1,2,3,4,5,5,5,5)
sample_of_numbers <-sample(some_numbers, 1000, replace=TRUE)
hist(sample_of_numbers)

We are looking at lots of samples from our little gumball machine of numbers. We put more 5s in, and voila, more 5s come out of in our big sample of 1000.

4.2.1.2 runif uniform distribution

We can sample random numbers between any range using the runif(n, min=0, max = 1) function for the uniform distribution. We discussed this in the textbook. A uniform distribution is flat, and all the numbers between the min and max should occur roughly equally frequently. Let’s take 1000 random numbers between 0 and 1 and plot the histogram. We’ll just do it all in one line for speed.

hist(runif(1000,0,1))

This is histogram is flattish. Not perfectly flat, after all we only took 1000 samples. What if we took many more, say 10,000 total samples? Now it looks more flat, each bin is occurring about 500 times each, which is pretty close to the same amount.

hist(runif(10000,0,1))

4.2.1.3 rbinom the binomial distribution

The binomial distribution sounds like a scary word… binomial (AAGGGGHHHHH, stay away!). The binomial can be a coin flipping distribution. You use rbinom(n, size, prob). n gives the number of samples you want to take. We’ll keep size = 1 for now, it’s the number of trials (forget this for now, it’s more useful for more complicated things than what we are doing, if you want to know what it does, try it out, and see if you figure it out). prob is a little list you make of probabilities, that define how often certain things happen.

For example, consider flipping a coin. It will be heads or tails, and the coin, if it is fair, should have a 50% chance of being heads or tails. Here’s how we flip a coin 10 times using rbinom.

coin_flips <- rbinom(10,1,.5)
coin_flips
##  [1] 0 1 0 1 0 1 1 0 0 0

We get a bunch of 0s, and 1s. We can pretend 0 = tails, and 1 = heads. Great, now we can do coin flipping if we want. For example, if you flip 10 coins, how many heads do you get? We can can do the above again, and then sum(coin_flips). All the 1s are heads, so it will work out.

coin_flips <- rbinom(10,1,.5)
sum(coin_flips)
## [1] 5

Alright, so we get the sum, which tells us the number of heads. But, should we always get that number of heads if we flipped a coin 10 times? If you keep redoing the above, you’ll get different answers. 5 heads will be the most frequent answer, but you will get lots of other answers too.

Hold on to your seats for this next one. With R, we can simulate the flipping of a coin 10 times (you already know that, you just did it), and we can do that over and over as many times as we want. For example, we could do it 100 times over, saving the number of heads for each set of 10 flips. Then we could look at the distribution of those sums. That would tell us about the range of things that can happen when we flip a coin 10 times. We can do that in loop like this:

save_number_of_heads<-length(1000) # make an empty variable to save things in

for(i in 1:1000){
  save_number_of_heads[i] <- sum(rbinom(10,1,.5))
}

hist(save_number_of_heads)

See, that wasn’t too painful. Now we see another histogram. The histogram shows us the frequency observing different numbers of heads (for 10 flips) across the 1000 simulations. 5 happens the most, but 2 happens sometimes, and so does 8. All of the possibilities seem to happen sometimes, some more than others.

4.2.1.4 rnorm the normal distribution

We’ll quickly show how to use rnorm(n, mean=0, sd=1) to sample numbers from a normal distribution. And, then we’ll come back to the normal distribution later, because it is so important.

hist(rnorm(10000,0,1))

There it is, a bell-shaped normal distribution with a mean of 0, and a standard deviation of 1. You’ve probably seen things like this before. Now you can sample numbers from normal distributions with any mean or standard deviation, just by changing those parts of the rnorm function.

4.2.1.5 mixing it up

The r functions are like Legos, you can put them together and come up with different things. What if wanted to sample from a distribution that looked like a two-humped camel’s back? Just sample from rnorm twice like this… mix away.

hist( c( rnorm(100,25,5), rnorm(100,50,5)) )

4.2.1.6 summary

You can generate as many numbers as your computer can handle with R. PSA: Don’t ask R to generate a bajillion numbers or it will explode (or more likely just crash, probably won’t explode, that’s a metaphor).

4.2.2 sampling distribution of the mean.

Remember the sampling distribution of the sample means from the textbook? Now, you will see the R code that made the graphs from before. As we’ve seen, we can take samples from distributions in R. We can take as many as we want. We can set our sample-size to be anything we want. And, we can take multiple samples of the same size as many times as we want.

4.2.2.1 Taking multiple samples of the same size

Let’s take 10 samples from a normal distribution (mean = 0, and SD = 1). Let’s set the sample-size for each to be 20. Then, we’ll put them all in a data frame and look at 10 different histograms, one for each sample.

scores <- rnorm(10*20,0,1)
samples <- rep(1:10,each=20)
my_df <- data.frame(samples,scores)

First, look at the new my_df data frame. You can see there is a column with numbers 1 to 10, these are the sample names. There are also 20 scores for each in the scores column. Let’s make histograms for each sample, so we can see what all of the samples look like:

ggplot(my_df, aes(x=scores))+
  geom_histogram(color="white")+
  facet_wrap(~samples)+
  theme_classic()

Notice, all of the samples do not have the same looking histogram. This is because of random sampling error. All of the samples are coming from the same normal distributions, but random chance makes each sample a little bit different (e.g., you don’t always get 5 heads and 5 tails when you flip a coin right)

4.2.2.2 Getting the means of the samples

Now, let’s look at the means of the samples, we will use dplyr to get the means for each sample, and put them in a table:

library(dplyr)

sample_means <- my_df %>%
                group_by(samples) %>%
                summarise(means=mean(scores))

knitr::kable(sample_means)
samples means
1 -0.1286469
2 0.3367034
3 0.0576323
4 0.2521191
5 -0.2608940
6 -0.1039758
7 -0.0072817
8 -0.1982368
9 -0.1308235
10 -0.0581564

So, those are the means of our samples. What should the means be? Well, we would hope they are estimating the mean of the distribution they came from, which was 0. Notice, the numbers are all not 0, but they are kind of close to 0.

4.2.2.3 histogram for the means of the samples

What if we now plot these 10 means (of each of the samples) in their own distribution?

 ggplot(sample_means, aes(x=means))+
  geom_histogram(color="white")+
  theme_classic()

That is the distribution of the sample means. It doesn’t look like much eh? That’s because we only took 10 samples right.

Notice one more thing…What is the mean of our 10 sample means? This is a mean of means. Remember that.

mean(sample_means$means)
## [1] -0.02415602

Well, that’s pretty close to zero. Which is good. When we average over our samples, they better estimate the mean of the distribution they came from.

4.2.2.4 simulating the distribution of sample means

Our histogram with 10 sample means looked kind of sad. Let’s give it some more friends. How about we repeat our little sampling experiment 1000 times.

Explain…We take 1000 samples. Each sample takes 20 scores from a normal distribution (mean=0, SD=1). Then we find the means of each sample (giving us 1000 sample means). Then, we plot that distribution.

# get 1000 samples with 20 scores each

scores <- rnorm(1000*20,0,1)
samples <- rep(1:1000,each=20)
my_df <- data.frame(samples,scores)

# get the means of the samples

sample_means <- my_df %>%
                group_by(samples) %>%
                summarise(means=mean(scores))

# make a histogram

 ggplot(sample_means, aes(x=means))+
  geom_histogram(color="white")+
  theme_classic()

There, that looks more like a sampling distribution of the sample means. Notice it’s properties. It is centered on 0, which tells us that sample means are mostly around zero. It is also bell-shaped, like the normal distribution it came from. It is also quite narrow. The numbers on the x-axis don’t go much past -.5 to +.5.

We will use things like the sampling distribution of the mean to make inferences about what chance can do in your data later on in this course.

4.2.3 Sampling distributions for any statistic

Just for fun here are some different sampling distributions for different statistics. We will take a normal distribution with mean = 100, and standard deviation =20. Then, we’ll take lots of samples with n = 50 (50 observations per sample). We’ll save all of the sample statistics, then plot their histograms. We do the sample means, standard deviations, maximum values, and medians. Let’s do it.

all_df<-data.frame()
for(i in 1:1000){
  sample<-rnorm(50,100,20)
  sample_mean<-mean(sample)
  sample_sd<-sd(sample)
  sample_max<-max(sample)
  sample_median<-median(sample)
  t_df<-data.frame(i,sample_mean,sample_sd,sample_max,sample_median)
  all_df<-rbind(all_df,t_df)
}

library(ggpubr)
a<-ggplot(all_df,aes(x=sample_mean))+
  geom_histogram(color="white")+
  theme_classic()
b<-ggplot(all_df,aes(x=sample_sd))+
  geom_histogram(color="white")+
  theme_classic()
c<-ggplot(all_df,aes(x=sample_max))+
  geom_histogram(color="white")+
  theme_classic()
d<-ggplot(all_df,aes(x=sample_median))+
  geom_histogram(color="white")+
  theme_classic()

ggarrange(a,b,c,d,
          ncol = 2, nrow = 2)

From reading the textbook and attending lecture, you should be able to start thinking about why these sampling statistic distributions might be useful…For now, just know that you can make a sampling statistic for pretty much anything in R, just by simulating the process of sampling, measuring the statistic, doing it over a bunch, and then plotting the histogram. This gives you a pretty good estimate of the distribution for that sampling statistic.

4.2.4 Central limit theorem

We have been building you up for the central limit theorem, described in the textbook and in class. The central limit theorem is basically that the distribution of sample means will be a normal curve. We already saw that before. But, the interesting thing about it, is that the distribution of your sample means will be normal, even if the distribution the samples came from is not normal. Huh what?

To demonstrate this the next bit of code is modified from what we did earlier. We create 100 samples. Each sample has 1000 observations. All of them come from a uniform distribution between 0 to 1. This means all of the numbers between 0 and 1 should occur equally frequently. Below I plot histograms for the first 10 samples (out of the 100 total, 100 is too many to look at). Notice the histograms are not normal, they are roughly flat.

scores <- runif(100*1000,0,1)
samples <- rep(1:100,each=1000)
my_df <- data.frame(samples,scores)

ggplot(my_df[1:(10*1000),], aes(x=scores))+
  geom_histogram(color="white", bins=10)+
  facet_wrap(~samples)+
  theme_classic()+
  ylim(0,200)

We took samples from a flat uniform distribution, and the samples themselves look like that same flat distribution.

HOWEVER, if we now do the next step, and compute the means of each of our 100 samples, we could then look at the sampling distribution of the sample means. Let’s do that:

sample_means <- my_df %>%
                group_by(samples) %>%
                summarise(means=mean(scores))

# make a histogram

 ggplot(sample_means, aes(x=means))+
  geom_histogram(color="white", bins=15)+
  theme_classic()

As you can see, the sampling distribution of the sample means is not flat. It’s shaped kind of normal-ish. If we had taken many more samples, found their means, and then looked at a histogram, it would become even more normal looking. Because that’s what happens according to the central limit theorem.

4.2.5 The normal distribution

“Why does any of this matter, why are we doing this, can we stop now!!!!!! PLEEEEAASSEE, somebody HELP”.

We are basically just repeating what was said in the textbook, and the lecture, so that you get the concept explained in a bunch of different ways. It will sink in.

The reason the central limit theorem is important, is because researchers often take many samples, then analyse the means of their samples. That’s what they do.

An experiment might have 20 people. You might take 20 measurements from each person. That’s taking 20 samples. Then, because we know that samples are noisy. We take the means of the samples.

So, what researchers are often looking at, (and you too, very soon) are means of samples. Not just the samples. And, now we know that means of samples (if we had a lot of samples), look like they are distributed normally (the central limit theorem says the should be).

We can use this knowledge. If we learn a little bit more about normal distributions, and how they behave and work, we can take that and use it to understand our sample means better. This will become more clear as head into the topic of statistical inference next week. This is all a build up for that.

To continue the build-up we now look at some more properties of the normal distribution.

4.2.5.1 Graphing the normal distribution

“Wait, I thought we already did that”. We sort of did. We sampled numbers and made histograms that looked like normal distributions. But, a “normal distribution” is more of an abstract idea. It looks like this in the abstract:

normal_dist <- dnorm(seq(-4,4,.1), 0, 1)
values <-seq(-4,4,.1)
normal_df <-data.frame(values,normal_dist)

ggplot(normal_df, aes(x=values,y=normal_dist))+
  geom_line()+
  theme_classic()

A really nice shaped bell-like thing. This normal distribution has a mean of 0, and standard deviation of 1. The heights of the lines tell you roughly how likely each value is. Notice, it is centered on 0 (most likely that numbers from this distribution will be near 0), and it goes down as numbers get bigger or smaller (so bigger or smaller numbers get less likely). There is a range to it. Notice the values don’t go much beyond -4 and +4. This because those values don’t happen very often. Theoretically any value could happen, but really big or small values have really low probabilities.

4.2.5.2 calculating the probability of specific ranges.

We can use R to tell us about the probability of getting numbers in a certain range. For example, when you think about. It should be obvious that you have a 50% probability of getting the number 0 or greater. Half of the distribution is 0 or greater, so you have a 50% probability.

We can use the pnorm function to confirm this:

pnorm(0, mean = 0, sd= 1, lower.tail=FALSE)
## [1] 0.5

Agreed, pnorm tells us the probability of getting 0 or greater is .5.

Well, what is the probability of getting a 2 or greater? That’s a bit harder to judge, obviously less than 50%. Use R like this to find out:

pnorm(2, mean = 0, sd= 1, lower.tail=FALSE)
## [1] 0.02275013

The probability of getting a 2 or greater is .0227 (not very probable)

What is the probability of getting a score between -1 and 1?

ps<-pnorm(c(-1,1), mean = 0, sd= 1, lower.tail=FALSE)
ps[1]-ps[2]
## [1] 0.6826895

About 68%. About 68% of all the numbers would be between -1 and 1. So naturally, about 34% of the numbers would be between 0 and 1. Notice, we are just getting a feeling for this, you’ll see why in a bit when we do z-scores (some of you may realize we are already doing that…)

What about the numbers between 1 and 2?

ps<-pnorm(c(1,2), mean = 0, sd= 1, lower.tail=FALSE)
ps[1]-ps[2]
## [1] 0.1359051

About 13.5% of numbers fall in that range, not much.

How about between 2 and 3?

ps<-pnorm(c(2,3), mean = 0, sd= 1, lower.tail=FALSE)
ps[1]-ps[2]
## [1] 0.02140023

Again a very small amount, only 2.1 % of the numbers, not a a lot.

4.2.5.3 summary pnorm

You can always use pnorm to figure how the probabilities of getting certain values from any normal distribution. That’s great.

4.2.6 z-scores

We just spent a bunch of time looking at a very special normal distribution, the one where the mean = 0, and the standard deviation = 1. Then we got a little bit comfortable with what those numbers mean. 0 happens a lot. Numbers between -1 and 1 happen a lot. Numbers bigger or smaller than 1 also happen fairly often, but less often. Number bigger than 2 don’t happen a lot, numbers bigger than 3 don’t happen hardly at all.

We can use this knowledge for our convenience. Often, we are not dealing with numbers exactly like these. For example, someone might say, I got a number, it’s 550. It came from a distribution with mean = 600, and standard deviation = 25. So, does 545 happen a lot or not? The numbers don’t tell you right away.

If we were talking about our handy distribution with mean = 0 and standard deviation = 1, and I told I got a number 4.5 from that distribution. You would automatically know that 4.5 doesn’t happen a lot. Right? Right!

z-scores are a way of transforming one set of numbers into our neato normal distribution, with mean = 0 and standard deviation = 1.

Here’s a simple example, like what we said in the textbook. If you have a normal distribution with mean = 550, and standard deviation 25, then how far from the mean is the number 575? It’s a whole 25 away (550+25 = 575). How many standard deviations is that? It’s 1 whole standard deviation. So does a number like 575 happen a lot? Well, based on what you know about normal distributions, 1 standard deviation of the mean isn’t that far, and it does happen fairly often. This is what we are doing here.

4.2.6.1 Calculating z-scores

  1. get some numbers
some_numbers <- rnorm(20,50,25)
  1. Calculate the mean and standard deviation
my_mean <- mean(some_numbers)
my_sd <-sd(some_numbers)

print(my_mean)
## [1] 47.45589
print(my_sd)
## [1] 25.35837
  1. subtract the mean from your numbers
differences<-some_numbers-my_mean
print(differences)
##  [1]  31.373480 -58.384659  38.706068  14.058252  19.923460  31.389225
##  [7]   7.068542  -8.836754 -40.910669  -6.171537  17.398077  23.856998
## [13] -13.742795   9.220430 -29.682059 -14.521087   5.864580 -20.057430
## [19] -13.552452   7.000330
  1. divide by the standard deviation
z_scores<-differences/my_sd
print(z_scores)
##  [1]  1.2372041 -2.3023821  1.5263625  0.5543831  0.7856759  1.2378250
##  [7]  0.2787459 -0.3484748 -1.6133003 -0.2433728  0.6860881  0.9407938
## [13] -0.5419431  0.3636050 -1.1705034 -0.5726348  0.2312680 -0.7909589
## [19] -0.5344370  0.2760560

Done. Now you have converted your original numbers into what we call standardized scores. They are standardized to have the same properties (assumed properties) as a normal distribution with mean = 0, and SD = 1.

You could look at each of your original scores, and try to figure out if they are likely or unlikely numbers. But, if you make them into z-scores, then you can tell right away. Numbers close to 0 happen a lot, bigger numbers closer to 1 happen less often, but still fairly often, and numbers bigger than 2 or 3 hardly happen at all.

4.2.7 Generalization Exercise

(1 point - Pass/Fail)

Complete the generalization exercise described in your R Markdown document for this lab.

  1. Simulate the sampling distribution of the mean for sample-size =10, from a normal distribution with mean =100, and standard deviation = 25. Run the simulation 1000 times, taking 1000 samples, and computing the sample mean each time.
  1. Plot the sampling distribution of the mean in a histogram
  2. Report the mean of the sampling distribution of the mean
  3. Report the standard deviation of the sampling distribution of the mean
  1. Repeat all of the above, except change the sample-size to 100 for all simulations
  1. Plot the sampling distribution of the mean in a histogram
  2. Report the mean of the sampling distribution of the mean
  3. Report the standard deviation of the sampling distribution of the mean

4.2.8 Writing assignment

(2 points - Graded)

Complete the writing assignment described in your R Markdown document for this lab. When you have finished everything. Knit the document and hand in your stuff (you can submit your .RMD file to blackboard if it does not knit.)

  1. Explain the concept of sampling error (1 point)

  2. Explain why the standard deviation of the sampling distribution of mean gets smaller as sample-size increases (1 point)

General grading.

  • You must write in complete sentences. Point form sentences will be given 0 points.
  • Completely incorrect answers will receive 0 points.
  • If your answer is generally correct but very difficult to understand and unclear you may receive half points for the question

4.3 Excel

How to do it in Excel

4.4 SPSS

In this lab, we will use SPSS to transform a set of raw data into z-scores using two methods:

  1. Saving data as standardized values
  2. Computing a new variable manually

Let’s first begin with a short data set we will enter into a new SPSS data spreadsheet. Let’s use the following data set: {x= 1, 3, 2, 5, 4, 6, 5, 8, 9}. Enter these into SPSS and name them appropriately.

4.4.1 Saving data as standardized values.


This method is a quick and simple way to turn a set of scores in raw format into z-scores. First, go to Analyze, then Descriptive Statistics, then Descriptives:

A window will appear, asking for you to specify for which variable you would like descriptive statistics. Move the x variable into the right field using the arrow. Then, make sure to check off Save standardized values as variables:

Now, click Ok.

SPSS will produce an output table containing some basic descriptive statistics for the x variable:

However, you can ignore this window for now. Go back to data view in your spreadsheet and you will notice a new column of data (a new variable) as been created called Zx. This variable contains the z-scores that correspond to the values in the x columm:

4.4.2 Computing variables


If you use the method shown above to convert a set of measurements to z-scores, you don’t need to use this method to do the same thing. However, computing new variables may be useful for more than just converting scores to z-scores, so it’s worth illustrating as well. The take-away message here is that you can use SPSS to create new variables based on existing ones.

To illustrate how this is done, let’s work with the data sheet we used in the previous example. Currently, it includes a column for our x variable, and the new Zx variable we created using the Analyze > Descriptive Statistics > Descriptives menu.

This time, we are going to turn our x variable measurements into z-scores manually. In order to do this, we’re going to need the necessary components to calculate z-scores: We will need the mean and standard deviation of the x variable. Remember, in the previous section a table of descriptive statistics was generated:

We know that the mean of variable x is 4.78 and the standard deviation is 2.635. In order to turn every measurement in variable x into a z-score, then, we must subtract 4.78 from each x, then divide that difference by 2.635. We are going to ask SPSS to do this by going to the Transform menu, and then selecting Compute Variable.

This will open the Compute window:

Here, let’s first enter a name for the new variable we’re about to create. In the field labeled “Target Variable”, enter a name for your new variable. Let’s try new_z as the name.

Next, in the field labeled “Numeric Expression”, we will specify the calculation to be made. Because I want SPSS to take the variable x, subtract 4.78 from it, and then divide this difference by 2.635, I am going to type in this formula: 1. First, type in an open parenthesis 2. Second, select x from the field on the left and move it to the “Numeric Expression” field using the arrow. 3. Type a minus sign, followed by 4.78 4. Type in a closed parenthesis. 5. Type in a backslash “/” (this specifies division) 6. Type in the standard deviation, 2.635

Your Compute window should now look like this:

Now, click OK. SPSS will generate a blank output window that confirms you have computed a variable. You may minimize or close this window and go back to your SPSS spreadsheet. It should now have a new variable listed, new_z, and the computed z-scores should look very similar to the previously calculated z-scores (any differences are due to using rounded figures):

You can use this Compute Variable window to calculate z-scores or make any calculations using existing variables.

4.4.3 Practice Problems


“Professor, do you grade on a curve?”

This is probably the most commonly-asked question in Statistics class. Everyone assumes that grading on a curve will benefit them. But does it?

Here is a link to an SPSS file containing 50 students’ exam grades (let’s say it’s the final exam for a Statistics class).

  1. Create a table containing the mean and standard deviation for this sample of scores. Now, produce a frequency histogram of the score data. Describe the distribution.

  2. Transform each student’s score into a Z-score (you can use either method shown in this tutorial). Now, plot the frequency histogram of this Z-score distribution. Compare it to the raw score distribution. How are they the same? How are they different?

  3. Imagine you are a student in this class who received a 90 on this exam. However, the Professor has decided to GRADE ON A CURVE (shock! awe!), such that only the top 10% of the class receives an A (this professor only gives whole grades, no minuses or pluses). Calculate the z-score that corresponds to a raw score of 90 on this exam. Will you get an A with this grade? Why or why not?

4.5 JAMOVI

How to do it in JAMOVI

4.6 CogStat

4.6.1 Goals

  • Overviewing distributions

  • Sampling from distributions

  • Sampling distribution of the mean

  • Sampling statistics (statistics of many samples)

  • Normal Distribution

4.6.2 Sample

In the manual describing other software packages, there are some chapters where generated data are used. This is easy to implement in R, but since CogStat cannot be used for modifying the data, it cannot generate data either. However, similar to data modification, we may generate these data with the help of a spreadsheet.

To solve the task you can load the data file “4.2_10randomnumber” from the demo data or you can generate the data for yourself. For this, open Excel. There is an easy way to generate data using the function “=RANDBETWEEN()”, which will generate a random number between two given parameters if you write it into a cell or the formula bar. The formula can be pasted to other cells, so more random data can be generated.

This function acts like an endless gumball machine. You put the gumballs inside with different properties, say numbers between 1 and 10, and then you let it take them out. Here the function randomly picks numbers between 1 and 10 each time. We did this 10 times, so our sample has 10 data points in it, but we could do as many times as we would like to. The more random number we would generate the more it would show us a smooth normal distribution.

Let’s take a look at our data after opening it with CogStat. We can run an “Explore variable” analysis on it to get the frequency details. Note, that the generated data in Excel is regenerating itself with every change made to the spreadsheet, so you might see slightly different results each time.

We can also generate a variable with values where an exact number has more chance to appear. This way the generated data still remains random, but its distribution will change. In Excel, this can be done by generating random data as before in cell A1. After that, in B1 another formula must be used, write =IF(A1>5;5; A1), this way more 5s should appear in the dataset (row B) than before when we were just generating random numbers. This should also be visible in the distribution.

You can also open this file from demo data, open “4.2_morefives and choose the second column “rand_changed” for analysis. We should choose “Explore variable” to take a look at the second, changed variables’ frequency traits.

As you see more than half of the generated numbers turned out to be 5 (54% of all).

4.6.3 Uniform distribution

A uniform distribution is flat, and all the numbers between the minimum and maximum should occur roughly equally. In the theorem, each value has the same likelihood to appear. For this distribution, two parameters of the data are fundamental: the minimum and the maximum values. We can sample random numbers between any range for the uniform distribution.

Let’s take 100 random numbers between 0 and 5 and plot the histogram. You can either open the file 4.3_uniform from the demo data or generate your own data. To do this, we will use the “=RANDBETWEEN()” command in the spreadsheet. Write “=RANDBETWEEN(0;5)” into the formula bar or in a given cell.

To make the hundred numbers take the lower right corner of the square and drag it down until the 101th row. Copy and paste this variable into CogStat and choose “Explore variable” to see the descriptives and histogram.

These histograms look flattish but not perfectly flat. After all, we only took 100 samples. What if we took many more, say 10,000 samples? Now it looks flatter, each bin is occurring about 500 times each, which is pretty close to the same amount.

4.6.4 Binomial distribution

In the case of a binomial distribution, there are only two possible outcomes, and the chance for each is the same. Every repetition is independent of the others and can result in the desired outcome. Two parameters are fundamental here: the number of trials taken and the likelihood of success. For example, consider flipping a coin. It will be heads or tails, and the coin, if it is fair, should have a 50% chance of being heads or tails. So we would like to see how many times will the desired outcome happen from X trials, we can investigate how many times will the coinflip results heads from ten coinflips. If we make several trials from this, we can get a binomial distribution.

To demonstrate this, in a spreadsheet, you can generate numbers in the following way, shown in the picture below.

In the first row from A to J generate a random number between 0 and 1, with =RANDBETWEEN(0;1). After that in the cell K1 use =SUMIF(A2:J2;1). This will count how many ones are in the cells between A2 and J2 and this is the result of ten continuous coinflip. We can pretend that one means heads and zero means tails. So, here from ten coinflips, we had 4 heads. We can take more cases and create a variable from the coinflip sums.

This way we have in row K the sums of all trials when we flipped a coin ten times and got heads. This variable contains 100 trials, but we could do this as many times as needed. This file is available in the demo data under the name “4.4_binomial”. The variable, named “sum” can be analysed in CogStat, so load this dataset and choose the “Explore variable” analysis. We can see the frequency traits of the variable in a histogram and frequency tables in Raw data and in Sample properties.

The histogram shows us the frequency of observing different numbers of heads (for 10 flips) across a hundred simulations. Five happens the most, but two happens sometimes, and so does eight. All of the possibilities seem to happen sometimes, some more than others.

4.6.5 Normal distribution

Normal distribution might be the most important one of all, that is what we usually see or assume in our statistical analysis when examining our sample. It has the shape of a bell, is symmetric, and has only one peak. Note, that not every distribution that has these traits are normal. Its necessary parameters are the mean and standard deviation. The picture below shows how normally distributed data might look like on a histogram. This one in the example has a mean of zero and a standard deviation of one in a sample of ten thousand participants.

Unlike the distributions mentioned earlier, the bell curve is continuous, which means that values are not interrupted with gaps and a variable can take on any values between a given range. The bell curve is commonly used in statistics and is especially important when testing a hypothesis. An example of this is the curve that IQ follows across people.

Picture from: (Wikipedia- https://en.wikipedia.org/wiki/Intelligence_quotient)

Here it can be seen that most people are located around the mean, which is 100, and fewer and fewer people have higher or lower scores. The bins with the highest and lowest scores contain only a tiny percentage of people.

If you open the data “4.5_normal” with CogStat and choose “Explore variables” on the variable, you can see the normal distribution of this generated data, as well as the data parameters. You can also generate the data yourself with the “NORMINV(RAND();100;20) function in a spreadsheet. After loading the dataset first, take a look at the raw data.

If you see the scatterplot or you take your time to take a look at the dataset, you might realise what we already talked about, the data points can take any values in the range given and are quite symmetrical.

4.6.6 Sampling distribution of the mean

The sampling distribution of the mean refers to a distribution obtained from multiple other samples’ mean. Which means a variable which contains several means. We will work with the data shown below. Here we see these close-to-zero numbers, each number represents the mean of one random sample. For example, sample 1’s mean is 0.4492527, sample 2’s mean is -0.0918667, and so on. These means taken as new datapoints will be used as a variable for our analysis in this section.

The dataset “4.7_sdm” contains means of generated, normally distributed samples of ten, with a mean of zero and a standard deviation of one in the second coloumn. All these means come from the same normal distribution, but many of them are quite different from each other. This is because of random sampling error, random chance makes each sample a little bit different (e.g., you don’t always get five heads and five tails when you flip a coin). Open from demo data the file “4.7_sdm” and run an “Explore variable” analysis on the “sample_means” variable.

This is the distribution of the sample means in the CogStat output. As you see it does not look like a normal distribution at all, and that is because we had only ten samples. But if we calculate the mean of the means we get around 0.037, which is pretty close to zero. This is good since when we average over our samples the mean of the means better estimates the mean of the population distribution they came from.

If we repeat this sampling experiment a thousand times the distribution of the histogram will look more like normal. So what we do here is take a look at the descriptives of the means of thousand sample means. This would look similar to the picture shown below.

Notice the properties of the histogram, the curve seems to be bell-shaped and the sample means are mostly around zero. It is also quite narrow. The numbers on the x-axis do not go much past -0,4 to +0,4.

If we take a look at the descriptives we can see that the mean is much closer to zero than in the previous task when we only used ten samples.

Probably if we would repeat this task, for example, 100000 times or even more, we would get a mean closer and closer to zero. The importance of all this is, that we investigate the distribution of the sample means. This is what we call the sampling distribution of the mean. As we saw, with the changes in the number of samples taken the curve of the distribution also changes a bit as it becomes less noisy.

It also shows that the more data you have, the more accurate the mean’s mean will be in this task. With only ten number of samples, the mean does not represent properly the whole dataset. As you already saw, if we calculate the mean of a sample of ten, again and again, the means can be quite different from one another. With a dataset of a hundred or thousand, it is more likely to be a very similar mean after repeating the test.