# Statistical Computing Assignment Paper

Statistical Computing

#Problem A. Random number generation and power.

**Definition:** The **power** of a statistical test is the probability the test correctly rejects the null hypothesis when it is indeed false.

# Statistical Computing Assignment Paper

1. Let’s explore the rnorm function. The rnorm() function in R randomly generates data from a normal distribution with a specified mean and a specified standard deviation. Recall we saw the exact equation of the probability distribution function in the notes. The r in rnorm stands for “random.” The data will be randomly drawn based on the probabilities dictated by the probability distribution function.

a. Use the set.seed() function with a seed number of your choosing. Then complete the following: Use rnorm(500) and draw a histogram using ggplot2 tools of the resulting data values. Enter your code in the space provided below so that when you **knit** this document it will show the code and the histogram.

“{r problem1a}

“

b. Use rnorm(500) again and draw a histogram using base tools of the resulting data values. Note: You do not need to use the set.seed() function again. Enter the code in the space provided below.

“{r problem1b}

“

c. Write a few sentences to describe your histograms in parts a) and b). Also, explain What you think the 500 in the code represents. Type our answers below in plain text:

d. Based on a) and b), what do you think the default mean is for the rnorm function? How did the graphs inform your answer? Type your answers below.

e. Do you know what the default standard deviation is? How did you determine this?

f. Use rnorm(500,100,5) and draw a histogram (base or ggplot2) of the resulting data values. Report your code in the space provided below.

“{r problem1f}

“

g. Use rnorm(500,100,5) again and draw a histogram (base or ggplot2) of the resulting data values. Report your code in the space provided below.

“{r problem1g}

## Statistical Computing Assignment Paper

h. Based on f) and g), what does the second argument (the 100) of the rnorm function do? How did the graphs provide evidence of this for you?

i. What does the third argument (the 5) of the rnorm function do?

j. Generate 1000 observations from the F distribution with 5 numerator degrees of freedom and 10 denominator degrees of freedom. Draw a histogram (base or ggplot2) of the data values.

“{r problem1j}

“

k. Write a few sentences to describe the histogram you generated in part j.

*Now that you have an understanding of how the rnorm function works, complete the following problems. These items go together to investigate the statistical concept of power.*

2. Create two scalars named rows and samplesize with the values of 10 and 3, respectively. (We will later change these values to 1000 and 30, but while you are working on getting all of your code to run, these smaller values will allow you to print objects and view them to investigate what is going on).

“{r problem2}

“

3. Use the set.seed(1000) function so that we all are randomly generating from the same starting point.

“{r problem3}

“

4. Use the rnorm() function in R in order to randomly generate data from a normal distribution with a mean of 100 and a standard deviation of 5. You should generate enough values to fill in a matrix with the number of rows and the number of columns given by the objects rows and samplesize that you created in #2. You should not retype their values. Instead, reference the objects so that later you can make the change once in the code to explore other rows and sample size options.

“{r problem4}

“

5. Create a matrix of the data values you randomly generated in #4 that has the number of rows given by the scalar you created in #2. Also include code to print the matrix you created.

“{r problem5}

“

6. Calculate the mean of each row of your matrix and store this information in an object called mymeans. Print the output. Answer the question that follows.

“{r problem6}

“

What do you notice about the sample means from your samples and the mean of the normal distribution from which they were drawn?

7. Calculate the standard deviation of each row of your matrix and store this information in an object called mysd. Print the output. Answer the question that follows.

“{r problem7}

“

What do you notice about the sample standard deviations from your samples and the standard deviation of the normal distribution from which they were drawn?

8. Suppose you planned to test the hypotheses of $H_0: \mu = 107$ vs. $Ha: \mu \neq 107$ in order to determine if the mean of the population from which your data is drawn is different from 107.

a. Question: What is the true mean of the population from which this data is drawn? Type your answer below in plain text.

b. Question: Therefore, what do you expect your p-value would look like (small or large)? Type your answer below in plain text.

c. Question: Therefore, what is the correct outcome of this test (reject the null or do not reject the null)? Type your answer below in plain text.

d. For this test, you would calculate the test statistic as $$t=\frac{\bar{x}-107}{s/\sqrt{n}}$$

Using the objects you created in #6 and #7 and knowing that you created an object in #2 called samplesize, which indicates the size of your sample, calculate this test statistic for every row of your data using R. You do not need the apply function, this is just a calculation involving vectors! Save your vector of test statistics in an object called test.stat. Be very careful with parentheses!! I would calculate the numerator and denominator separately and then divide if I were you! Your code should ultimately print the object test.stat.

“{r problem8d}

“

e. Report the data values (should be 3 values) from the first row of your matrix of data values by using R code to print them. Report the mean and standard deviation of three data values by using R code to calculate these values. Use R as a calculator to calculate the test statistic for these three data values only. Then print the first element of the object test.stat to verify that it was calculated correctly. Be ware of order of operations – PEMDAS!

“{r problem8e}

“

9. We can use the following code to calculate the two-sided p-value:

pvals <- 2*pt(abs(test.stat),lower.tail=_____________,df=________)

Note that the abs() function in R calculates absolute values. I did this because I wanted to only work with the positive version of the test statistics. Since the t-distribution is symmetric, I can use mirror images to calculate p-values more efficiently. Additionally, I can multiple by 2 in order to find the two-sided p-value. Complete the two blanks in the code and run it below. Your outcome should be a vector with one p-value for each row of the data. Print your vector pvals also.

“{r problem9}

“

10. Find the proportion of times the p-values were less than 0.05. Note that you should use a logical comparison to get TRUE/FALSE values for each p-value based on if it is less than 0.05. Then fill in the blanks in the two interpretations below.

“{r problem10}

“

We can interpret the proportion or percentage from your calculated from the previous problem as follows:

Interpretation 1: ___% of the time we correctly rejected the null hypothesis that $H_0: \mu = 107$ when the true population mean is $\mu = 100$ based on samples of size 3.

OR

Interpretation 2: The probability of correctly rejecting the null hypothesis of $H_0: \mu = 107$ when the true population mean is $\mu = 100$ is about ___ based on samples of size 3.

###The remaining problems ask you to re-run your code from #2-10 making changes each time. Copy and paste your code and re-run it to answer the following questions. Do not edit your original code (so that I can grade questions #2-10). Instead, copy and paste all of the individual lines of code from #2-#10 and make the necessary changes. HOWEVER, REMOVE ANY PRINTING OF OBJECTS. I DON’T WANT TO SEE ANYTHING THAT IS 1000 LINES LONG!

11. Our answer in #10 was based on seeing the process repeated a mere 10 times. That’s not enough to see the long-term patterns! Copy your code from #2-10 and paste it below. **Remove any lines of code that would print out objects.** For this problem, I want you to change the number of rows to 1000 so that we can look at the long-term proportion of times the null hypothesis is correctly rejected. Your code should ultimately print out the proportion of times you obtained a p-value of less than 0.05 and therefore rejected the null hypothesis. Also, write a sentence to summarize the proportion you find in context as indicated in the space after your R code.

“{r problem11}

### Statistical Computing Assignment Paper

Type your sentence in plain text here:

12. Our answer in #10 was based on seeing the process repeated a mere 10 times but it was also based on samples of size 3. That’s not very interesting! Copy your code from #11. For this problem, I want you to change the number of rows to 1000 and to change the samplesize to 30 so that we can look at the long-term proportion of times the null hypothesis is correctly rejected for a larger sample size. Your code should ultimately print out the proportion or times you obtained a p-value of less than 0.05 and therefore rejected the null hypothesis. Type your code below. Also, write a sentence to summarize the proportion you find in context as indicated in the space after your R code and answer the questions that follow by typing in plain text.

“{r problem12}

“

Type your sentence in plain text here:

Is this what you would expect? (YES/NO)