Monday, December 13, 2010

Preparing for the 1st semester final

Topics on the AP Stat 1st Semester Final Exam

Types of graphs, their advantages and disadvantages, their interpretations
Measures of center and spread, their calculation, their different meanings, and their uses
Measures of position, converting back and forth among different measures (i.e. percentile, observed value, and z-value)
Probabilities associated with continuous random variables (area under the curve, normalcdf, empirical rule, Chebyshev’s theorem)
Probabilities associated with discrete random variables, multiplication property, addition property, independence, conditional probabilities
Special discrete random variables—binomial and geometric distributions
Relationships in two variables – linear regression, residuals, interpreting regression output, correlation coefficient, coefficient of determination
Means and standard deviations of combinations and transformations of random variables
Design of surveys, types of bias, types of sampling
Design of experiments, methods of randomizing, methods of control, matched pairs design, blocking, causation
Vocabulary: for instance, outliers, clusters, gaps, population, sample, variance, influential observations

Answer the following two questions on two separate sheets of paper. Your response to question one will be graded as a small test grade. Your response to question two will be the free response part (take-home portion) of your final exam. Your answers may be hand-written or typed, but must be legible and complete. Computed numbers that are unsupported by their calculations will be given no credit. You may NOT work together on this assignment.

Question One: Using an example from the second half of your selected book (Bringing Down the House, Freakonomics, etc.), explain a specific connection to one of the topics in the list of exam topics above. You may get creative with your product for this question. It may be in the form of a Powerpoint, a 9”x12” poster, or other appropriate written or mixed media form. Interpretive dance is not appropriate.

Question Two: Answer the problem handed out in class on a separate sheet of paper. You must work alone on this problem.

Monday, December 06, 2010

Distributions of Random Variables

11/19/2010
We're combining parts of chapters 6-8 to build on students' prior understanding of probability.

First up: Geometric and binomial probabilities
geometric probabilities:
Know the 4 characteristics that define a geometric distribution
Know how to find the expected value of x
Know how to find probabilities for values of x (both individual probabilities and cumulative probabilities)

bimomial probabilities:
geometric probabilities:
Know the 4 characteristics that define a binomial distribution
Know how to find the expected value of x
Know how to find probabilities for values of x (both individual probabilities and cumulative probabilities)

Be able to identify a binomial or geometric distribution when you read a problem.
Define the random variable x.
Solve problems related to probabilities for these distributions.


Next up: any other discrete distributions
Work with valid probability distributions (individual and cumulative)

Apply the concept of independent events to joint probability problems.

Apply the concepts of disjoint sets and complements to find probabilities.

Find the means and standard deviations of transformations of a random variable and combinations of independent random variables. YOu'll have to bookmark these pages and study them A LOT!


Monday December 6, 2010
Testing on Thursday on distributions of random variables. We will be learning new material through Wednesday. Be here!

Thursday, November 04, 2010

Producing Data

So we've moved into survey design and experimental design. (Chapter 5 in the text.)

Important vocabulary:
Population
sample
census
bias
nonreaponse bias
undercoverage
response bias
convenience sample
stratified random sample
systematic random sample (like The Lottery by Shirley Jackson)
simple random sample (SRS)


We used the Table of Random Digits to (1) pick a sample, (2) Simulate a random event, and (3) randomly allocate participants to experimental treatments.

We have looked at a few experimental design/survey design questions from previous exams.

HW due Thursday 11/4: Problems 5.2, 5.3, 5.10 ,5.11
HW due Friday 11/5: Written answers on your own paper to the 1998 and 2002 experimental design questions handed out in class.

Friday, October 22, 2010

Bivariate distributions

We've moved from investigations using single variables to the world of two variables. The first type of bivariate relationship we study is the relationship between two numerical variables.

We collected data on Monday and crunched numbers again on Tuesday to find the least squares regression line and the coefficient of determination. R^2.

On Wednesday we studied the correlation coefficient, the slope and intercept of the LSRL, and the patterns in residuals. We saw that the point (x-bar, y-bar) lies on the LSRL and that R^2 may not be an indicator of a good model.

Based on your new knowledge of these concepts, please expand your 4 inch summary of section 3.1 to 5 inches of strong content.

Do the problems in section 3.1 that relate to the manatees and to the archaeopteryx.

Quiz answers:
scatterplot - points have a strong positive linear pattern with no outliers. Graph should have labels and scale.
LSRL y-hat = -10.64 + 4.117x
Residuals - y - y-hat graphed against x. To compute residuals, use L2 - Y1(L1). The graph shows that the residuals fluctuate above and below the axis with varying distances.
Interpretation - Because the residuals can be interpreted as randomly scattered about the residual = 0 line, the linear model is good.
Caveat- Because the residuals seem to be getting further from residuals = 0 as x gets larger, we might be concerned about our error increasing as length increases. Beware telescoping residuals.


October 8, 2010

This week student should have completed problems 3.35, 3.36, 3.37, & 3.37 from the text. For HW due Monday, they need to complete problems 3.39, 3.40, & 3.48.

What have we done so far? Collected bivariate data. Looked at them. Computed the LSRL. Computed residuals. Interpreted the residuals and the slope of the LSRL. Used the LSRL to predict a value of y. Performed a Linear regression t-test to determine the significance of the slope.

What do we have to do? Practice and interpret outputs.

ALSO, pick a book. Suggestions: A Civil Action, Freakonomics, Bringing Down the House, Moneyball, And the Band Played On, The Lady Tasting Tea. Get your parents' permission to read your book. You should have it finished by the end of Thanksgiving break.

Monday, October 11, 2010

We collected data that we expected to have no correlation. In 13 of the 15 cases, we got what we expected. We graphed the ordered pairs, computed the LSRL, checked the residuals, performed a linear regression t-test, and interpreted the results.
Small p-value>>> reject the null hypothesis--that there is no linear relationship between x and y. Instead, we have evidence indicating that there is a linear relationship.
Large p-value>>> fail to reject the null hypothesis. We do not have compelling evidence that there is a linear relationship.

HW problems 3.6 and 3.61

Have your papeback on Friday. You will be given a reading day.
_________________
As we continue through bivariate distributions, please take care to clearly identify the transformation you have performed on your lists in the calculator. For instance, log L2 may make sense to you, but you may be better off by renaming the list log life expectancy.

Typically, students have problems when they graph the curves through data. The linear regression graph only works with straightened (transformed) data. The curves go through the original data.

Your test on bivariate data will be Thursday, Oct 28.

Monday, August 30, 2010

Measurements of position

Hmmm. Z values. Percentile ranks. Proportions between two x-values.
How are these connected for Normal distributions?

The percentile for a particular z-value is the value in the body of the Z table that represents the "sum" of the column and row titles. For Negative z-values, just append (attach) the hundredths place digit. For instance. . .
row 1.3, column 0.4 ==> 1.34 = z. This is the 90.99th percentile.
for row -2.3, column 0.4 ==> -2.34 = z. With a table value if 0.0096, this is just a hair under the 1st percentile.

The percentile is the proportion of data that lies to the left of the x value or is equal to it. If you took a test and scored at the 99th percentile, 99% of all other test scores should be equal to your score or below it.

Another way to find the percentile is to use the NormalCDF function on the calculator. Use NormalCDF(lower bound, upper bound) where the boundary values are z scores. To find the percentile for a Normally-distributed z value, we use the lower bound of negative infinity and the upper bound of the z under consideration.

We can use -999999 for negative infinity. NormalCDF(-999999,1) = the proportion of the population of Normally distributed z values that fall equal to or below 1.

To find the Z value for a particular percentile, use the inverse of the NormalCDF function-- INVNorm. To find the 95th percentile, enter InvNorm(.95). Approximately 95% of all z-values in a Normal distribution will fall below this value.

To find the X value that corresponds to the desired Z value, take the mean and add Z standard deviations.

Practice converting X values into z values adn percentiles into X values. Do the problems on page 147.

Friday, August 20, 2010

New data!!!! Haircut costs

As we learn to represent and interpret our data, we collected the following data:
boys' haircut prices
12, 18, 22, 0, 0 ,0, 15, 0, 17, 16, 17.95, 10, 12
girls' haircut prices
35, 55, 50, 30, 18, 25, 0, 50, 40, 45, 45, 140, 40, 8, 25, 30, 22, 28

Represent each of these as a boxplot on the same axes AND
using the information starting on page 42 in the text, represent it also as a back to back stemplot.

We will interpret your results on Tuesday.

Be safe.
_________________________

8/20
We've used histograms, boxplots, and stemplots to represent univariate (one-dimensional) data. We've worked many problems from previous AP exams.

You're probably ready to close out this chapter (1). Let's focus on the parts we haven't covered so far and test on Thursday, 8/26.

We will start the CiCi's Sundays on August 29, unless you do not need help yet.

Be safe. Play hard. Go Trojans.

Monday, May 10, 2010

Gearing up for the final exam

So far we've discussed good answers for problems 1-4 of the operational exam. The take-home part of your final will be one of the 6 questions, randomly selected two days before your exam. Get to work developing your best answers to these questions.

The in-class part of the final exam will be cumulative, with an emphasis on second semester topics.

Random assignment of problems for take-home portion:

6th period seniors: #3 YOU MAY NOT COLLABORATE ON THIS PROBLEM NOW THAT IT HAS BEEN ASSIGNED. THE WORK YOU TURN IN MUST BE YOUR OWN.

Tuesday, March 30, 2010

Chi-square procedures

http://lassiterstatistics.wikispaces.com/

Send your summary documents (in pdf format if possible) to jhl2881 at
students dot kennesaw dot edu


Suggested problems from Chapter 14:
goodness of fit: 3, 4, 5 (this is an example of how biologists use 2x2 tables to do goodness of fit tests), 9 (simulation)
2-way tables: 13, 19, 20, 14, 16, 18, 12, 17, 22, 24, 32, 33, 34.



We're beginning our last new topic (since we already did linear regression inference once).

We will use two different types of chi-square procedures and three different names for the procedures.

First, if some higher power determined what the proportions of the sample should have been associated with different values of the categorical variable. . .
like what portion of your M&Ms should have been red, brown, green, etc., then you will use a Goodness of Fit test to compare your experience (the sample) with what the higher powers suggested. This is also the test we use when the higher power might suggest that the distribution should have been "fair" or equal across all the values.


If the sample itself is going to suggest a distribution, then we use the test on independence or the test of homogeneity. These two tests are performed the same way, we just have different inputs and hypotheses associated with the two forms.

When we have one sample from one population and want to know if characteristics are associated, like red hair and green eyes, we might use the test of indepenence.

If we have two populations, like smokers and non-smokers, and want to know if the two populations had the same propensity for speeding tickets, we could use a test of homogeneity with cleverly-selected data.

Methods will be discussed in class.

Have the printed draft of your assignment in class on Wednesday!

Friday, March 26, 2010

T-tests for means

http://www.nytimes.com/2010/02/28/weekinreview/28sussman.html?ref=weekinreview
.
Time magazine article on the complications of the race and ethnicity entries on the census: http://www.time.com/time/nation/article/0,8599,1975883,00.html?hpt=T2
What inference procedure do I perform? applet (http://www.ltcconline.net/greenl/java/Statistics/StatsMatch/StatsMatch.htm?)

Welcome to the new stuff. Same as the old stuff. . .almost.

EQ: Why do you use inference?
Under what conditions can we use inference?

As you've seen, t-tests for means are quite similar to the z-tests for proportions. We still follow the same pattern of Setup-Check Assumptions-calculations/arithmetic-decision in the context of the problem.

Now, to use t-methods we have to prove that the distribution of x-bars is probably mound-shaped and symmetric enough to invoke the CLT. If given the data, sketch the histogram of the observations or the Normal probabilty plot. If either sample observation graph looks severely non-normal (with gaps or outliers), then we cannot assume that the means of the samples drawn from that type of population would be close to Normal.

Our quiz on 3/2 will be a lot like the lab we did in class on Monday. You will have data to analyze (SCAD).

3/4/10 Now you've taken two quizzes and shown great improvement.

Things you can do to improve your communication:

Label the graph of the observations or the Normal Probablity plot. The x axis of the histogram should be named with the thing you're measuring, like blood pressure. The scale should be added at about 5 places (no need to mark off every little bit). The y axis is the frequency. Help the reader out by labeling the tallest bar.

Label your Normal Probability plot with the definition of the x at the bottom. No scales are necessary if you label the NPP as a Normal Probability Plot.


Sketch the Normal curve before you start the calculations. Put the hypothesized mean in the middle. Then mark your x-bar to the right or the left (as appropriate). This will help you put the less than or greater than sign into your calculations correctly. It also reminds you that a probability cannot be less than 0 or more than 1!

Refine your decisions. Compare, conclude, contextualize, convince.

HW due Friday: Finish the AP exam problem begun in class that deals with a paired t-test. A paired t-test is performed exactly the same way as a regular -test, but on the third set of data--the difference between the two sets of DEPENDENT data.

PAIRED T-TESTS
This is used when the two "samples" are not independent, but two measures from each of the experimental units or participants. Examples: pre-tests and post-tests on the same students, the e.coli problem, the pharmacy problem, and the hand span problem.

Align the samples so you can subtract the "pre-test" value from the "post-test" value or perform a similar subtraction to generate one list of differences. Use this list as your input for the one-sample t-test.



T-TESTS FOR TWO INDEPENDENT SAMPLES
For this, Ho is mu1 - m2 = some number.

The test statistic is (x-bar 1 - x-bar 2)/std error of the difference of the means.

Use tcdf to find the area in the tail. The procedure is much like the regular t-test.

CONFIDENCE INTERVALS: estimate +- t* (std error of the estimate).

Use the x-bar or the difference of the x-bars as the estimate.
Use the sx/sqrt n or sqrt((Sx1)^2/n + (Sx2)^2 /n ) for the std error.

Work problems from the chapters for homework.



TEST REWORK/TEST "RETAKE"

There will be a test "retake" available on Thursday, March 18 in class for anyone who turns in their completed reworked problems from the Two-proportion test.

The retake will include one-proportion and two proportion tests and intervals, Type I and Type II error, and finding the sufficient sample size for a margin of error.



EXAM REVIEW
Have the draft of your first review page (assigned individually in class) with you in class on March 31. We will peer-review and edit the pages before they go into the class webpage. Go to this page to see examples of how the class of 2009 handled a similar assignment. Your product may be a webpage or a document.

HW for every night: Work problems from chapters 11-13. You should be able to do all these problems.

Essential questions that you need to write responses for:

Why is a t-test used instead of a z-test when we do not have the population standard deviation?

Why is it inappropriate to perform repeated tests instead of relying on one test?

The mean of x for sample 1 - the mean of x for sample 2 = the mean of the paired differences. So why does it matter whether we perform a two-sample test of independent means or a paired t-test? (This is incredibly important!!!)

WRITE AT LEAST A PAGE ABOUT THESE TOPICS. Hint: Consider the effect that your choice of test has on the power of the test.

TEST on inference for means: March 30, 2010

Friday, February 19, 2010

Inference for Proportions

Congratulations, AP STAT GROUP! 100% participation in the AP exam!

Standards: IV A3, A4, A5, B3, B4, and now B1.
INFERENCES FOR ONE PROPORTION and the DIFFERENCE BETWEEN TWO PROPORTIONS.

We're combining aspects of Chapters 10, 12, and 13 in the text to understand inferences about proportions.

Please print out a copy of the complete hypothesis test and confidence interval examples from my Lassiter blog (http://lhsblogs.typepad.com/linner)

Some of the basics:
Every complete inference problem will have four parts: setup, assumptions, calculations, and decision in the context of the problem.

The set-up of a one-proportion z test will include the definition of the parameter of interest, the hypotheses, and any other information you will use to perform the test.

The assumptions portion includes checking all assumptions and conditions necessary to use the z test, in other words that the data are randomly selected, independent, from a Normally-distributed population, and allow us to use the simple standard deviation formula.

The calculations include the name or formula for the test, the calculations of the z-statistic and the p-value. A correctly-drawn graph helps.

The decision part must link the decision to the reason for that decision, citing the statistics and including the actual language of the problem. This means that you have to answer the question asked using the words provided in the prompt (the context). To make your answer *shine* include a well-worded statement that demonstrates to the reader that you really understand what the p-value means.



HW due Monday 2/8/2010 12.3, 12.4, 12.6, 12.13, and 12.14.

If you can't find time to do the homework, I will hold afterschool detentions to help you with the scheduling.


Standards IV A3, A5, and B3

Go to the AP Statistics documents page to download an example of bot a 2 proportion CI and a 2 proportion HT for the data collected in 6th period.

HW due 2/17/10: Problems 13.7-13.10 from the text.



Standard IV B1

Type I and Type II error, alpha, beta, and the power of the test.

Type I error: rejecting the null hypothesis when the null is actually true.
Type II error: failing to reject the null when it is false.

P(Type I error) = alpha. We have the privilege of selecting this value.
P(Type II error) = beta. We calculate this using the rejection region boundaries and the true distribution. This requires a new theoretical parameter.
Power = the probability that the test will be able to detect a difference between the hypothesized value and the new, theoretical value.

Power = 1 - beta

Beta = 1 - power

No formulas combine both alpha and beta.

Calculator method for computing beta: normalcdf(lower critical value, upper critical value, new theoretical mean or proportion, standard error).

For instance, the lower and upper boundaries of the "fail to reject" region if the hypothesized p is 40% and n = 200 are .3321 and .4679. What is the likelihood that we fail to reject when the true proportion is 48% (meaning that we can't distinguish between the 50% and 48%)?

std error = .035

normalcdf(.3321, .4679, .48, .035) = .3648. About 36.48% of samples drawn from the distribution with proportion = .48 will not make us reject the null hypothesis.

Error warning: If you get 95% when you make this type of calculation, you are probably using the original hypothesized parameter, and not the new theoretical one. Try again using the theoretical value.

And the really good news is that this method works for inferences for means and differences of means, too, so we don't have to learn another new procedure.

HW due 2/19/2010: Finish problem 13.30 a-d. This will take more than one page.

Answers for 13.30 should include the following elements.
A. Two treatment groups, random assignment(not random sample), first group of 1/2 people took only aspirin, the other both drugs.
B. Test statistic = 2.73. Complete answer requires all the rest of the HT work, including computation of combined p-hat.
C. (-.0232, .0197) with supporting work and interpretation on context.
D. Explanation of each type of error is required. II is more serious because of potential harm to patients.


HW due 2/22/10: Using part c of the 2009 AP exam question #5 as a guide, re-consider at least 5 of the inference problems we've already worked. "Based on your conclusion . . . which type of error, Type I or Type II, could have been made? What is one potential consequence of this error?" Write complete responses. Please pick problems that have each of the responses, reject and fail to reject, so you can get practice answering the problem both ways. Also, be ready to discuss the effects of Type I and Type II errors on HIV testing, pharmaceutical studies, and court cases.

Monday, January 11, 2010

The Central Limit Theorem and Sampling Distributions

Standards: IIID 1, 2, and 3
Why is the Central Limit Theorem so important, even when the distribution of x is not Normal?

Key understandings:
When does the CLT "kick in?"
Why is the standard deviation of the sample averages smaller than the standard deviation of the population?
How do you apply the CLT to compute probabilities related to the sample average?
What does sample size have to do with your certainty about the distribution?
How do these methods apply to sample proportions?

Stay on top of the material by reading the chapter and taking notes on the key concepts. Work problems. We will test on Thursday, January 21st.


Here's a version of the sampling program that works!

-> means "store"
Lbl, For, IF, End, and Goto are all programming words, so they are found under the programming menu. Just hit PRGM for access to the menu.
DON'T TRY TO PROGRAM THE ITALICIZED EXPLANATIONS



:1 -> C *Initializes the value of C, the counter
:rand(150) -> L1 *Puts 150 uniform random numbers in L1
:Lbl 10 *Labels this line for use later
:0 -> B *Resets the value of B, the partial sum of the numbers, to be 0:rand(150) -> L2 *Puts 150 more random numbers in L2
:SortA(L2,L1) *Sorts your original numbers by the second column:For(J,1,10):L1(J)+B -> B: End *Takes the top 10 numbers, and adds them together
:(B/10) -> L3(C) *Calculates the average and puts it in the next row of L3:C + 1 -> C *Changes the counter to the next number
:If C < 101: Goto 10: *Starts over to select another sample of 10 from the population



Please answer these questions every time you work a problem that requires a calculation.

What is the population of interest?
What is the sample?
How can you justify using the Normal distribution?
Why are you allowed to use that simplified standard deviation?
Is your sample large enough?
Is your sample an insignificant part of the population?
Was your sample selected randomly?
Are the observations independent?

What is the distribution of the sample statistic?
Have you drawn the normal distribution graph and labeled it?
Have you shaded the appropriate part of the graph?
Have you checked your answer to see if it looks reasonable?

Work a bunch of problems from the text. Cici's 2-4 Sunday. Test Thursday.