Linner Statistics: 2007

Tuesday, December 11, 2007

Chapter 7 Random Variables

INTERESTING NEWS:
http://news.yahoo.com/s/ap/20071213/ap_on_re_us/hiv_lawsuit

This test will be on Thursday, December 13, 2007.

You will need to create a chart to remind yourself about the formulas for this chapter until you have practiced enough to know them by heart. A sticky-note at pages 396 and 400 would also be helpful!

Mu = the POPULATION average. This is a parameter.
Sigma squared = the POPULATION variance. This is also a parameter.
Standard deviation is the square root of the variance.

X-bar is the sample average, the unbiased estimator of the population average. It is a statistic.
S-squared is the sample variance, the unbiased estimator of the population variance. It is also a statistic.

HW due Wednesday, December 12th: 7.13, 7.15, 7.17, 7.34, 7.42
HW due Tuesday, December 11th: 7.24, 7.28. Use the formulas and the examples in the text.

HW due Monday, December 10th : 7.2, 7.4, 7.7.

Monday, December 03, 2007

Chapter 6 Probability

The Chapter 6 test will be December 6th.
Previous tests will be returned to students as soon as they are graded.

Prepare for the test. Work problems from each section. Read the chapter and section summaries. Write down what you are doing. Draw the Venn diagram or the tree diagram for complicated situations. Ask questions on the blog. Take the practice test.

Here are some answers to even HW problems:
Problem 6.10 (a) S= {all numbers between 0 and 24}
(b) = {any whole number up to and including 11,000}
(c) S = {0, 1, . . . 12}
(d) S= {any dollar and cents amount up to [insert your maximum guess here]}
(e) S = {any positive or negative number}

Problem 6.12 Four outcomes for two coins: {HH, HT, TH, TT}, eight for three coins: {HHH. HHT, HTH, HTT, THH, THT, TTH, TTT}, and sixteen for four coins (do that one yourself!).

Problem 6.16 (a) YYY -0000 through YYY-9999 = 10,000 numbers. (b) YYY-ZXX-XXXX, each X having 10 possible numbers, except the number can't start with a 0 or a 1 meaning that Z has only 8 possible values, so this means 8 * 10^6 LESS the restricted numbers (911-xxxx, 411-xxxx, etc.)

Problem 6.20 P(moves to another class) = 1 - P(stays) = 1 - .46 = .54.

Problem 6.24 P(wins large battle) = .6, P(wins three small battles) = P(wins individual small battle)^3 = .8^3 = .512. Choose the strategy with the larger probability of occurring.

Problem 6.40
Venn diagram has two circles representing getting job A and getting job B.
Both jobs: intersection of the two circles, the overlapped part, the biscuit.
First but not second: the part of circle A that is not within circle B
Second but not first: the part of circle B that is not within circle A
Neither: the part that is in the background, in NEITHER circle.

Problem 6.44
P(W) = 856/1626
P(W given prof degree) = 30/74
These are not the same. so gender and professional degree are not independent

Problem 6.56
P(y <> x) = 1/8,
P(y > x) = 1/2,
P(y <> x) = P(y <> x) /P(y > x) = 1/4.

Problem 6.48: P(W) * P(Manager given W) = P(Woman AND Manager)

One pattern that shows up a lot is Marginal * Conditional = Joint

If you divide both sides by Marginal you get
Conditional = Joint / Marginal.

IFF means IF AND ONLY IF.

IFF P(A) * P(B) = P(both A&B), A and B are independent.

IFF P(A) = P(A given B), A and B are independent.

HW for Tuesday night: DO problems 6.33 and 6.48. Read 6.66 and be prepared to work the problem. Essential question: How do mathematical independence and our regular understanding of independence relate? The chapter 4 tests were returned today. HW for Monday night: 6.39, .40, .53, and .56. The problem we worked today in class was problem .65. You would be wise to work through this problem and problem .66.

Notes from Friday (11/30) are embedded in the purple sections below.

Conditional probability rules:

PLEASE NOTE THE CORRECTION! BLOGGER WON"T ACCEPT THE VERTICAL LINE SYMBOL!!!
P(A GIVEN B) = P(A and B)/P(B)
P(B GIVEN A) = P(A and B)/P(A)

so of course

P(B) P(A GIVEN B) = P(A and B), the joint probability of A and B. It may be helpful to think of it like cancelling factors in the numerator and denominator of a fraction EXCEPT that the result is the JOINT probability. Be careful.

P(A) P(B GIVEN A) = P(A and B), again, the joint probability of A and B.

These relationships can be represented in two-way tables, Venn diagrams, and tree diagrams. The count within a cell of a two-way table divided by the marginal total is a conditional probability. Likewise, the joint probability for that cell divided by the marginal probability is also the conditional probability.

Tree diagrams can be useful when you are trying to work the problems backwards.

I don't think that I made this clear in class today:
P(A) = P(A and B) + P(A and not B) = P(A)P(B given A) + P(not A) P(B given not A).

HW for the weekend is problems 6.44 and 45.

If A and B are independent, then P(A) = P(A GIVEN B) and P(B) = P(B GIVEN A).
Interpretation: If A and B are independent, then whether or not B happened has no relationship with whether A happened.
Likewise, if A and B are independent, then whether or not A happened has no relationship with whether B happened

Today we used a Venn Diagram, a two-way table, and a tree diagram to represent the outcomes and probabilities associated with throwing two strangely-marked dice. All of the methods yielded the same answer.

HW for Tuesday (11/27) night: Re-work the weird dice problem from the AP exam (the one with two dice, one has only 9s and 0s, the other has 11s and 3s.) This time, instead of using simulation, use formal probability rules and a tree diagram, table, or Venn diagram. Answer the question in complete sentences. For part B, reconcile the answer with the joint probabilities you found in part A. Figure out the guidelines in your own words that tell you whether a price/reward is fair.

Get the reading done! What are the big concepts?
----------------------------------------------------------

Sorry for the delay: just got home from KSU.
HW for Monday night, Nov 26: 6.19, 20, 21
plus. . .finish State of Fear.
-----------------------------------------------------------
Add problems 6.24 and 6.25, due Monday, November 26.

Don't forget to read State of Fear.
------------------------------------------------------------
he complete list of HW problems due Tuesday: 6.9, 10, 12, 13, 16 (plus any others you feel like doing).

Don't forget to read State of Fear.

------------------------------------------------------------

Events that are mutually exclusive ARE NOT independent.

Addition principle: P(A or B) = P(A) + P(B) - P(A and B)

Multiplication principle: P(A) P(BA) = P(A and B), the joint probability.

When B and A are independent, P(BA) = P(B), A happening or not has not relationship to B happening, so P(A) P(B)=P(A and B). THAT IS ONLY WHEN THE EVENTS ARE INDEPENDENT.

Key vocabulary
parameter
sample space
event
probability
joint probability
independent
---------------------------------------------------------------------------
This won't be so bad. The test will be Thursday, Dec. 6.

What are YOU doing to maximize your understanding of the material?

Are you creating an outline of the chapter?
Have you developed a glossary for the vocabulary and formulas?
Have you worked all of the homework problems when assigned?
Do you read the sections that relate to the homework?
Are you part of a study group?
Do you ask questions?
Have you worked problems from a study guide?
Have you worked the online quiz (see the link on the right panel of this blog)?

Do you try to see the big picture?

Do you look for the similarities and differences in the ways data are processed?
Do you work with problems long enough to understand why the formulas work the way they do?
Have you made connections between current concepts and prior knowledge?
Have you gone online to review concepts that you have forgotten?

Just "going through the motions" does not lead to the success that you desire in Advanced Placement courses. Take control of your learning.

Be safe.

Wednesday, November 14, 2007

Chapter 5 -- Producing Data

Be sure to check the comments for this post. Your fellow classmates ask the best questions! Have you taken the online quiz? Have you worked through a study guide? Have you prepared for Thursday's test? Have you finished the reading for Tuesday? Why would I ask you to read that book in the middle of Chapter 5???? [There must be a reason.] Why would you use simulation instead of actually testing the real thing in an experiment? What are the three essential principles of good experimental design? Why is each one important? What does bias have to do with all of this? What IS bias? What IS confounding? When do you block? What is the difference between stratifying and blocking?

The two sweet diagrams of experimental design are on page 272 (Completely randomized) and page 280 (block design).

Blocking is a form of control. When a large number of your experimental units share some pre-existing condition that may make their responses to the treatment vary tremendously WITHIN the treatment groups, you will have a hard time differentiating between the results of the treatment groups. You would really prefer to have the differences in results BETWEEN groups to be big enough so you can make a decision about your comparison. To reduce this vaiability, you may choose to BLOCK by the nuisance variable (the pre-existing condition). Then you RANDOMLY allocate the experimental units in each block to the different treatments. If there are two treatments, then each block is randomly broken into two treatment groups. You proceed by running the experiment on each of the blocks individually.

Thursday and Friday (11/8-11/9) Finish writing up the experiment described below, using all the concepts of section 5.2. ALSO, answer the free response (FR) problems from 2001 and 2003 handed out in class. You definitely NEED to read the section of the book. These are NOT opinion problems.

The Chapter 5 test is Thursday, November 15. Freakonomics must be read by 11/13.

Wednesday night's (11/7) HW: 5.65 PLUS design an experiment to answer this question (at least 6-7 sentences!!).

Does the choice of presentation technology make a difference in student achievement in a geometry class?

Conditions: Geometry classes at Lassiter
four teachers teach geometry
some teachers have students write HW on the board
some teachers have students write HW answers on the overhead projector
some teachers put their official answer transparencies on the O/H.

Two document cameras are available to use (Google document camera if you haven't seen one!)

Students are already assigned to the classes.

How could we design this experiment to answer the question? What questions or clarifications do you have? Bring at least seven complete sentences of helpful guidelines for performing this study.

Friday night's HW: 5.63 and 5.64. Complete most of your Freakonomics assignment this weekend. When you get to the part where the authors belabor their unique name theory, you can consider your assignment completed. what was your favorite part? What connections did the authors make that you agree with? that you don't agree with?

Thursday night's HW: 5.60 and 5.61
Wednesday night's HW: 5.54, 5.55, 5.56 Be safe.

Tuesday night's HW: Complete both of the problems from the 2001 exam.

Example of using the TORD to simulate a bag of M&Ms with the OLD color distribution:

Old Distribution:
Brown 30%
Red 20%
Yellow 20%
Green 10%
Blue 10%
Orange 10%

Let's try this two ways. First, let's use two-digit numbers to simulate candies according to the following schedule.
01-30 Brown
31-50 Red
51-70 Yellow
71-80 Green
81-90 Blue
91-00 Orange

There are no excluded numbers. If we draw the same number twice, use it again!

Using the following line from a table of random digits, simulate drawing 5 candies.

63996 32914

63>>>>Yellow
99>>>>Orange
63>>>>Yellow
29>>>>Brown
14>>>>Brown

The second way requires only one digit. Let 1-3 represent Brown, 4-5 for Red, 6-7 for Yellow, 8 for Green, 9 for Blue, and 0 for Orange.

21833 70905
Using the TORD above, you would get
2 Brown
1 Brown
8 Green
3 Brown
3 Brown

Link to interesting site about the Dewey-Truman polling error. Did you know who the third party candidate was who threw the wrench into the process? Strom Thurmond. Your parents will be impressed that you know this.

http://www.hannibal.net/stories/101998/Pollstersrecall.html

Interesting historical link about Tukey. Scroll to the middle to see his influence in predicting outcomes of elections.

http://www.amstat.org/about/statisticians/index.cfm?fuseaction=biosinfo&BioID=14

The two books I assigned for November are Freakonomics and State of Fear. Freakonomics discusses a lot of associations/correlations that promote critical thinking. State of Fear makes you enlightened consumers of research (even though it IS fiction). Many parents have probably already read one or both of these books. Last year's students (generally) loved them.

Wednesday, October 24
Take notes on the first section of the new chapter, especially new vocabulary.

For Monday and Tuesday of next week:

5.1-5.5, 5.8, 5.11, 5.17-5.18, 5.22, 5.23

Key concepts covered in class (alliteration, anyone??) today included

undercoverage
non-response bias
response bias
convenience sampling
voluntary response sample

and examples like the C-SPAN and American Idol calls, surveying the people sitting around you, the Dewey Defeats Truman mistake, answering with un-truths, failure to respond to surveys.

Can you match the concept to the example? Can you think of another example of each concept in action? Why does each of these result in data we cannot rely on?

Monday, October 08, 2007

Non-linear relationships

Assignment for Monday: Create an outline of the key points in the chapter, including all vocabulary words.

Also, do problems 4.54, 4.56, 4.62 (this study just celebrated its 20th anniversary!!!), and 4.66

So, how about those marginal and conditional distributions for two-way tables, huh?

The marginal distributions are the percents that each column or row represents in the entire table. For instance, if the total of one row was 250 and the total for the table was 1000, the marginal distribution for that row is 25%. You would continue to calculate percents for all of the other rows or the other columns--whatever the question asked for.

For the conditional distributions, you only consider a portion of your population, for instance only a specific row or column. Then, what portion of the observations recorded in tht small group shared the desired characteristic?

If there were 15 sophomores taking AP Chinese and 500 sophomores in a school of 2000 students, GIVEN THAT a student is a sophomore, the percent who are taking AP Chinese is 100*15/500.

Tuesday. October 16

Assignment for Friday: 4.34, .36, .38, .39, .40, .42, and .43

How did you like the Simpson's Paradox activity today?

When breaking data into two or more divisions by a lurking variable changes the "decision" for EVERY ONE of the sub-groups, the result is a Simpson's Paradox. For instance, the example today presented no clear, justifiable answer about whether we should fund Bolgg's Panacea or not.

The example in the book about the hospitals is instructive.

Good luck on the PSAT.

Monday, October 15

4.22-4.24, 4.27, and 4.28

Review the topics and procedures on the notes handed out today.

Have you ever heard of Simpson's Paradox???

Thursday, October 11
We've transformed non-linear data to a linear form, found the LSRL through the data, re-written the equation reflecting the nature of the lists used to develop the LSRL, and re-transformed the equation to model the original data.

You should have done problems 4.6 and 4.9. For tonight, DO problem 13 and READ ACTIVITY 4 and problem 4.15. If you feel excited about the investigation, read problem 16 also.

Monday, October 8

We graphed some relationships between x and y to determine whether we were allowed to run the LSRL on the data. Of course, we ONLY run the least squares regression on data that look like they have a linear pattern.

When the pattern in L1 and L2 looked like an exponential growth or decay model, we took the log of y in order to un-do the exponential. Putting the log y into L3, we proceeded to verify that the graph of L1 and L3 was approximately linear. We then ran the LSRL through that set of points.

The equation we found by using the LSRL will not run through our curve-y data, so we have to un-transform the equation. For the exponential case, we had used the log of y instead of y itself when finding the LSRL (but the original x values!), so we re-write the equation as log y-hat = a + bx.

We solve for y by taking the antilog of both sides ("ten-to-the" or 10^stuff). The resulting equation for y can be graphed with the original x and y data and should match the pattern pretty well.

If the model looks like a quadratic, square root, or other power function, you'll need to perform mostly the same functions, but on the logs of both x and y. The linear equation that passes through the straightened data will be transformed like this: log y-hat = a + b times log x, and the result will have a factor equal to x to the b power.

HW: Problem 4.1. The answer is in the back of the book, but there are a lot of sections to this problem.

Thursday, September 20, 2007

Chapter 3 Linear Relationships

Tuesday, October 2
Work AT LEAST 5 problems to prepare for Thursday's test.

Monday, October 1
3.50 and 3.52

Test on Thursday

Friday,September 28
Yo, this is David T.

Well today we broke up the deviation in a prediction, part into y-hat minus y-bar and an error part y -y-hat.

We also explained the significance of r^2 which equals the portion of the variation in y which could have been predicted using the regression relation.

Remember that if r^2 is close to zero, then the points on the graph are crazy and scattered. If r^2 is close to one, then the graph and points are predictable and are linear.

The HW is 3.46 and 3.49. (YES, this means YOU!) You = David V.

Thursday, September 27
KEY FORMULAS

b = r * Sy/Sx
a = y-bar minus b* x-bar

y-hat = a + b* x

residual = actual minus expected = y - y-hat

If residuals are small and scattered, then the linear model is a good model. If there is a distinct pattern (if you could predict what the residual would be for a particular x-value), then the linear model is not appropriate.

Be sure to WRITE what you see in the residuals ("The residuals are small and scattered, so a linear model is appropriate" or not) and what effect that observation has on your model.

Also be sure to write out the description of the y-hat equation in words: "The predicted value of [insert y variable here] is approximately [insert y-intercept here] plus [insert slope here] times [insert the x variable here]."

COMMON ERRORS:
Failure to use LinReg(a+bx) L1, L2, Y1

Failure to check that the observed y values are close to the predicted y values.

Failure to use the same x and y in your stat plot that you used in your linreg equation. (causes graphs to not show up!)

HW problem 3.39.

Wednesday, September 26
Problems 23 and 31 PLUS find the Least squares regression line for the Archaeopteryx data.

Tuesday, September 25
We re-worked the HW from last night and extended the concept by investigating what happens when you calculate the correlation coefficient for non-linear data (Anarchy! Riots! Dogs and cats living together!). Although you CAN calculate a correlation coefficient for non-linear data, the results tells you NOTHING.

Key points to remember:

-1<= r <= 1. Always. No getting around it.
r is dimensionless. If you change units or perform a linear operation on all of the values of x, or y, or both, your r will not change!!! In fact, what happens when you switch the order of the variables and calculate r for L2 and L1????
r is affected by outliers. They increase the standard deviation, which causes the denominator to be smaller, which causes the r to be closer to 0.
r only gives you information about linear relationships. If it isn't linear, then this linear modeling is inappropriate.

If you haven't already tried it, calculate r for some small sets of non-linear data and see what I mean.

HW 3.13 and 3.19. All about the archaeopteryx.

See you tomorrow.

Monday, September 24
Do problem 3.18. This is just like what we did in class.

Friday, September 21
Problems 3.1-3.3 and 3.5.

Fifth and sixth periods: You did a great job with all the distractions today. Thanks for trying to stay on task.

Good job, Trojans! You make us proud.

CiCi's on Sunday? 2-4.

Be safe.

A new Chapter!!!

Thursday, September 20

Copy the formulas and definitions from Chapter 3 into your notes.

Friday, September 07, 2007

Chapter 2 Probability Distributions

Summaries of current topics can be found below the homework details.

Tuesday, 9/18
Work three of the problems that you set up last night and in class today. ALSO, do problems 2.41, .42, and .43 completely.

What more do you need in order to be successful with this? Make a list! Outline the important concepts from the chapter!

Why is the normal distribution such a big deal?

Monday, 9/17
Select 15 problems from Chapter 2. Split your paper in half (hotdog style). Write the details of the GIVENS on the left and the items that the problem asks you to find on the right side. You do not have to solve the problems. Watch carefully for those cases where there are multiple requirements.

Your test is Thursday.

Wednesday, 9/12

UPDATE: I have posted a WORD document with hints on the homework site: classhomework.com. You'll have to enter the password and then click on the file name.

Post to the blog and let me know when you get it--but don't ruin the fun for the other students!
-------------------------------------------------------------------------------------------
You KNOW that the area under the curve in a probability distribution is always one (That's why you get 1 when you use normalcdf(-infinity, +infinity)). Go back to the first parts of Chapter 2 and review the characteristics of a probability distribution.

Then, for HW, find the values of x which represent the Q1, median, and Q3 of a triangular probability distribution that starts at the origin and ends at (4, ???). Yes, you have to figure out what the value of ??? is so the are under the curve is 1.

You will use the formula for the area of a triangle: A = (1/2) base * height.

There are many different ways of attacking this problem. How many can you find???

Tuesday 9/11
HW 2.24, 2.25, 2.30

If you don't understand something, ask a question on the blog. Coming to class unprepared is not an option.

Monday 9/10
HW 2.15, 2.16, 2.22, 2.23

Friday 9/7
You developed equations today to standardize observed values.

Your formula for the z-score was (observed x minus the average)/(standard deviation).

You also used a formula to find the value of x that has a certain z-value:
x = average value + (z-score)(standard deviation)

but you probably noticed that the second formula is redundant.

You also learned about the Empirical Rule.
http://www.stat.tamu.edu/~west/applets/empiricalrule.html

This ONLY WORKS with normal distributions and it is only an approximation. We will learn more precise methods next week.

If you're interested in a neat relationship that works for other types of distributions, check this out: http://www.stat.tamu.edu/stat30x/notes/node33.html

Another neat website:
http://people.hofstra.edu/stefan_waner/Realworld/Summary7.html

Now, is anyone out there planning to CiCi's this Sunday? I won't go unless there is interest, so post your plans!

HW: 2.6, 2.7, 2.8 and read up to that point in Chapter 2. Be safe.

Wednesday, August 29, 2007

Graphs and standard deviation & CiCi's

Preparing for the test
Your test is Thursday. Begin to prepare now by working problems and creating an outline.

Homework: For those in class today - rewrite your responses to the FR questions, plus work problem 1.4 from the text.

For those absent today: Problem 1.4 PLUS ALL OF 1.48-1.52. Pick up your original responses to the FR upon your return to complete overnight. If you were participating in Senior Skip Day, your absence is unexcused.

CiCI's
YES!!!! There is a request for CiCi's this Sunday, so I will be there from 2 to 4. That is the one by the Walmart at Trickum and 92 (close to Arby's).

Test is Thursday, 9/6!

If you are going on the marketing fieldtrip, stay after school to take the test in room 214 at 3:30. Don't forget to bring your calculator.

For Tuesday (9/4), select one odd and one even problem from the set 1.48-1.52 and work them completely. Become an expert on one of the problems.

For Friday (8/31), complete problems 1.41 and 1.43. The answers are in the back of the book, but that is not sufficient! You must show all work and explain your actions.

By Thursday (8/30), you should have both the graph from the Internet, a newspaper, or magazine and problems 1.35 and 1.36 from the text.

RE: the graph
You will identify the variable(s) represented in the graph and the type of graph you brought. Are the data numerical or categorical? Are numerical data discrete or continuous? Does your graph represent one variables or two? Is a trendline appropriate for your data?

RE: Standard deviation

The standard deviation of a sample of data is like an average deviation from the sample mean. It is the square root of the sample variance, which is an unbiased estimator of the population variance.

If we just found the sum of the deviations, we would get a sum of zero because some data are above the mean and some are below. Because of the definition of the mean, the positives and the negatives cancel each other out.

Instead, we square each deviation so the numbers we add together are all positive. We "average" these numbers by dividing by (n-1). You remember that n is the number of observations. We subtract one because we are using an estimate derived from the data themselves for x-bar. This gives us the sample variance or s-squared. To get the value of s just take the square root.

In formula form, s = sqrt(sum of all the squared deviations/(n-1)). The formula for the first squared deviation is (x minus x-bar)^2. Again, x-bar is the average of the x values.

The same relationship holds between sigma and sigma squared, the population standard deviation and the population variance: you take the square root of the variance to get the standard deviation.

Monday, August 27, 2007

Cumulative and relative frequency histograms

We created cumulative frequency histograms and relative frequency histograms today. For the cumulative frequency histograms, find the cumulative sum up to and including each line of the frequency table, for instance,

x freq cumulative freq
1-5 4 4
5-10 5 9
11-15 6 15

For the relative frequency distribution, divide the count for an interval by the total number of observations, n. What do you observe about the graphs of frequency and relative frequency????

The HW is attached to the site at classhomework.com.

Friday, August 24, 2007

Histograms - a beginning

Find a nice set of data to graph.

Create a frequency table for the data, breaking the data into 5, 7, or 9 intervals of equal length. That does NOT mean that there will be an equal number of observations in each bin or interval!

Create a histogram to represent the data.

http://www.ncsu.edu/labwrite/res/gt/gt-bar-home.html in Excel

http://jwilson.coe.uga.edu/EMT668/EMAT6680.F99/Estes/graphicaldisplays1page.html on the TI-83, but you need to reset the window!!!

http://facstaff.colstate.edu/henning_cindy/Calculator%20Assistance_files/Creating%20Histograms%20on%20TI83.htm

Thursday, August 23, 2007

Stem and leaf plots

Create a stem-and-leaf graph (a stemplot) of your data. If your data your data don't go nicely into a stemplot, find some fun data to use instead.

More later. . .

Have you found good resources on the web?

Do these help?
http://regentsprep.org/regents/math/data/stemleaf.htm

http://en.wikipedia.org/wiki/Stemplot

http://www.sjsu.edu/faculty/gerstman/StatPrimer/freq.pdf

Wednesday, August 22, 2007

More box and whisker

Using the data from problem 1 of the 2001 exam, answer the questions posed in class. Part C reads, "The news media reported that in a particular year, there were only 10 inches of rainfall. Use the information provided to comment on this reported statement.

Keep in mind all the errors that students might have made under test conditions. What do you suppose that a student under extreme time pressure might have done wrong on this problem?

Tuesday, August 21, 2007

Box and whisker plots

You collected data today in class and we talked through the processes of finding the five number summary and constructing a box and whisker plot for univariate quantitative data.

5 number summary: Min, Q1, Med, Q3, Max

Use the 5 number summary to construct the box and whisker plot. Use the interquartile range (the length of the box containing the middle 50% of the distribution)to determine whether observations are outliers. The boundaries of reasonable answers are Q1 - 1.5(IQR) and Q3 + 1.5(IQR).
**********************
For homework, construct the modified box and whisker plot for your data.
**********************
There are many websites that explain how to perform this task.
http://www.statcan.ca/english/edu/power/ch12/plots.htm

Can you find one that you like better? Please share the site with the rest of us.

Monday, August 20, 2007

Types of variables

You further analyzed the data you collected last week and made a claim about whether we can use the x-value to predict the y-value for your experience.

We discussed many types of variables in class today, focusing on quantitative (discrete and continuous) and categorical.

Create a list of 20 clever variables and identify whether each is quantitative or categorical. Include both types in your list. If the variable is quantitative, determine whether it is discrete or continuous.

Examples:

The number of students in math classes at Lassiter: quantitative and discrete.
The number of minutes of studying/homework done by students each night: quantitative and continuous.
The math course taken by students at Lassiter: categorical. [The values that the variable can take are the different courses.]

Friday, August 17, 2007

The linear labs

Find the least-squares regression line through the data you collected in class today. Describe trends in your data, the direction, linearity, strength of the relationship, and presence of outliers. Present a comment about the nature of the connection between the x and y values from your lab.

First period:

These may not be your actual data. . . the postits got a little mixed up. Anyway, you can use these for your HW.

Other classes:
If you lost your data, post a message asking for your particular set of data. NO LAST NAMES PLEASE!!!!

Arm/foot
12, 10.5
11, 10
11, 9.5
11, 10
11.5, 10
11, 10
11, 10
11.5, 10.25
12, 10.5
11.5, 10
12, 11.5
11.5, 10
11, 9
10, 8.5
9.5, 9.25
11, 9
8.5, 9
8.5, 9

Days
15, 16
1, 30
3, 28
9, 21
8, 22
12, 18
11, 17
14, 17
15, 16
17, 14
19, 12
7, 23
26, 5
24, 7
25, 6
24, 6
30, 1
30, 1

Ball toss
5, 6
3, 3
3, 3
2, 4
1, 2
2, 1
1, 0
0, 0

Thursday, August 16, 2007

The ball-measurement lab

You did a good job using the center, shape, and spread of data to match the graphs to the summary statistics for last night's assignment.

The data you collected will be posted here. Please post comments and questions attached to this entry. Our technical team is standing by, ready to answer your questions!

Your assignment is to generate a least-squares regression line for the data. Your technical team will post instructions.

Do not use your full name when you create your account.

Here are data you collected:

Circumference Diameter
25 3.5
24.75 3.5
19.5 2.5
19.875 3.25
19.75 3
19.75 2.5
20.5 3
16.5 2.25
16 2
15.25 2
12.25 1.5
15.25 2
8 1
8.5 1
12 1.5
12.5 1.75
8.25 1.25
12 1.75
13 1.75
5.5 0.75
3.5 0.5
5 0.5
3.5 0.5
5.5 0.75
4.75 0.5
5.5 0.75
9 1.25
12.5 1.75
15.25 2.25
20 3
19.75 2.5
25 3.5
9 1.25
12.2 1.75
8 1
5.5 0.75
4.75 0.75
8.25 1.25
20 5.5
33 8.5

Tuesday, August 14, 2007

Welcome to the new school year!

You did a marvelous job of spinning pennies today. Can you think of ways to minimize the outside influences on the results of the spin? How could you make the results dependent only on the fairness of the coin--well, as much as possible?

Your homework can be found on Classhomework.com.

Some of our classmates have reported that the composition notebooks are sold out of local stores. Don't panic. We won't do our first write-up for at least another week.

Also, don't rush out to buy a study guide or a new calculator. The new edition of the Barron's guide will be released in September. TI will be releasing the new calculator, the TI-Inspire in September as well. There is no sense in spending money on last-year's model.

We will issue textbooks when we finally need them. We will do some more investigations (labs) in class before we use the text.

When you (eventually) need to retrieve a document from the HW website, the password that you will use is lassitermath.

Leave a comment if you want to share thoughts or questions.

Sunday, May 13, 2007

The final weeks

Clarification: The AP exam exemption policy says that you must be passing the class and have fewer than 6 absences to exempt the final exam. Anyone who is not passing must take the exam.

If you are part of the crowd who has to take the final, then take those practice exams that you used to prepare for the real exam. Your exam will be multiple choice.

Sunday, May 06, 2007

Getting ready for the exam

I have sent out emails to 3rd period teachers asking for you to be released in time to eat (except for JROTC and pers fitness. Please let them know!!!).

Meet at the outdoor classroom between 11 and 11:30.
Bring pencils, calculator, pen, smile, sweater, mittens, scarf, hand warmers. . .

Just kidding a little about the stuff to keep you warm. Reports from Monday's exams were that it was FREEZING in the gym. Bundle up in layers.

Some sites to visit:
for fun
http://www.youtube.com/watch?v=Ooa8nHKPZ5k
for real
http://tinyurl.com/gwcmq

Somethings to remember:
You CANNOT discuss the MC problems at all--not in person, not on the phone, not on the web.

You can discuss the free response problems after 4:00 on Thursday.

Bring pencils and a pen to the test. Do not bring your cell phones, i-pods, etc. You can leave them in my room. I will lock them up.

Know your assumptions/conditions.

What questions do you have?

Monday, April 30, 2007

Chapter 13 Inferences for regression

Here are the answers from today's activities:
Please forgive formatting. Note that the values of SECoef, T, and P for the constant are not used in our inference calculations.

Regression Analysis: C6 versus C5
The regression equation is
C6 = 26.7 + 57.0 C5

Predictor Coef SECoef T P
Constant 26.75 19.44 1.38 0.263
C5 57.00 42.17 1.35 0.269

S = 9.42956 R-Sq = 37.8% R-Sq(adj) = 17.1%

Regression Analysis: C9 versus C8
The regression equation is
C9 = 32.4 + 37.5 C8

Predictor Coef SECoef T P
Constant 32.38 25.87 1.25 0.299
C8 37.50 60.19 0.62 0.577

S = 14.1877 R-Sq = 11.5% R-Sq(adj) = 0.0%

Regression Analysis: C12 versus C11
The regression equation is
C12 = 17.5 + 90.3 C11

Predictor Coef SECoef T P
Constant 17.50 10.45 1.68 0.192
C11 90.31 14.96 6.04 0.009

S = 8.52974 R-Sq = 92.4% R-Sq(adj) = 89.9%
--------------------------------------------
Regression Analysis: C15 versus C14
The regression equation is
C15 = 52.7 - 14.0 C14

Predictor Coef SECoef T P
Constant 52.67 18.74 2.81 0.067
C14 -14.00 42.50 -0.33 0.764

S = 7.76030 R-Sq = 3.5% R-Sq(adj) = 0.0%
-------------------------------------------

Descriptive Statistics: C19, C20
Variable N N* Mean SE Mean StDev
C19 5 0 0.5000 0.0791 0.1768
C20 5 0 61.00 8.34 18.64

Regression Analysis: C20 versus C19
The regression equation is
C20 = 9.00 + 104 C19

Predictor Coef SECoef T P
Constant 9.000 5.279 1.70 0.187
C19 104.00 10.07 10.33 0.002

S = 3.55903 R-Sq = 97.3% R-Sq(adj) = 96.4%

--------------------------------------------

Now, can you generate a confidence interval for the slope of the REAL regression line from one of your estimates? What does your (large) interval tell you about the strength of the relationship between x and y?

What were the three types of evidence you used to answer the questions about the model? Match the evidence to the question online at the quizplace. http://www.proprofs.com/quiz-school/quizview.php?id=968

Friday, April 13, 2007

Chapter 13 Inferences using Chi-square procedures

Hang in there; we're getting close to the end of the race.

This chapter introduces us to chi-square procedures. These methods are generally used to analyze tables of counts from samples which are separated into CELLS based on one or more categorical variables. The advantage of these methods is that you can perform many comparisons at once, instead of just two as in our previous procedures using z and t. Most students like chi-square procedures better than z and t procedures because we will be using counts rather than continuous data and our tests are automatically two-tailed.

For instance, we might analyze the COUNTS of M&Ms of each of the six usual colors in a bag or the distribution of the COUNTS of teachers at each combination of YEARS OF EXPERIENCE and HIGHEST DEGREE ATTAINED. Each element counted must be placed in exactly one CELL. We will compare the OBSERVED counts from a sample or samples to the EXPECTED counts in a way that will quantify the likelihood of this size error so we can make inferences.

There are two common versions of this test, one for situations where there is a set of guidelines or percentages that your sample data should match and one where the observations themselves determine the expected counts using the independence principle. In order to make an inference about the population or populations involved, the samples used must be SRS.

Another condition that must be met is that each expected count must be at least one. Furthermore, at least 80% of the expected counts must be at least 5. Although the observed counts must be integer values, the expected counts (just like expected values) do not need to be integers. [Error alert: Many students INCORRECTLY use the observed counts instead of the expected counts to determine whether the test is appropriate.]

Depending on the type of test we are performing there will be one of two different methods for calculating the expected counts. For both types of tests, once you have found the expected counts, you calculate chi-square components for each pair of observed and expected counts:

chi-square component = (observed - expected)^2/expected. (Of course, these are all non-negative.)

You add up all of the chi-square components to get the chi-square statistic, X^2.

You compare this X^2 value to the chi-square distribution with the appropriate number of degrees of freedom to find the p-value, or probability that you could get a X^2 value at least this large, randomly, when the null hypothesis is true. If this p-value is small, we reject the null hypothesis. If the p-value is large, we do not have sufficient evidence to conclude that the alternative is preferred.

So, I haven't addressed the hypotheses. . ..

Chi-square Goodness of Fit Test (GOF)

This is the test you use to compare a sample set of observed counts to a model that is defined somewhere else by a higher authority. Some examples:

Comparing your bag of M&Ms to the distribution of colors posted on the M&M/Mars website.

Comparing your bag of M&Ms to a uniform distribution by color (1/6th of the bag / color).

Comparing the age distribution of your town to the U.S. Census proportions.

Comparing the number of students at your school making 1, 2, 3, 4, or 5 on the AP exam compared to the global distribution.

Your null hypothesis states that the distribution matches the expected distribution. The alternative is that the distributions do not match. It is important that you write the first statement in context.

If the null hypothesis says something like p1 = p2 = p3, the alternative hypothesis SHOULD NOT be "p1 is not equal to . . . " because some of the pairs of proportions could still be equal yet the numbers do not match the distribution you wanted. Instead, use verbal descriptions like the distribution does not match the model.

To find each expected count, you take the proportions from the higher authority and multiply them by the total of all observations. You will generally get non-integer values.

Check the expected counts to make sure that all of them are at least one and check a second time to make sure that at least 80% are 5 or more.

Perform the calculations described, computing the chi-square components, adding them up to get the chi-square statistic, using that statistic to find the p-value, and making a decision in the context of the problem. If you choose to reject the null hypothsis, go back through the components to find the greatest contributor to the high chi-square statistic and cite that in your decision.

Chi-square Tests of Association: Independence and Homogeneity

When you have two or more samples from one population or two or more samples from two or more populations that you are comparing against each other with respect to categorical variables you will generally perform a chi-square test of association on the two-way table that you create to summarize the samples.

Use the words of the problem to generate the Ho and Ha for this test. The null hypothesis will customarily follow the pattern there is no association between [characteristic one] and [characteristic two].

The method for finding the expected values is different from the method described for the goodness of fit test. Otherwise, the tests are virtually the same.

To find each expected value for the cells of the two-way table, multiply the row total by the column total for that cell and divide by the grand total. Again, you will likely get non-integer numbers. Check the expecteds to see if they are at least 1 and at least 5 as described above. Calculate the components and chi-square statistic using the same formulas as in the goodness of fit test and evaluate the statistic in the same way.

Setting up the hypotheses

One of the hardest problems for students seems to be figuring out what the null and alternative hypotheses should be. Consider the test itself. Whenever the observed count matches the expected count you get a chi-square component equal to zero--something that does not contribute anything to our chi-square statistic. If ALL of the numbers matched, then our statistic would be zero and it would be graphed on the far left end of our distribution, leaving 100% of the probability to the right--a p-value of 1. (Fail to reject the null!!!!)

On the other hand, if our observed values are far from the expecteds, then the chi-square components will contribute to a larger statistic and, ultimately, a smaller p-value. (If p is small enough, reject the null!!!!!)

How does this help us to generate our hypotheses? For our null hypothesis, our observeds must be close to our expecteds. When does that happen? When our idea of what should have happened actually DID happen, for instance, when we expected the distribution to be practically uniform and it was.

This is just a little trickier when we are talking about association. The null hypothesis is that the characteristics listed along the top of the two-way table have nothing to do with the characteristics listed on the side of the table. If we proposed that video-gaming and gender were independent, then we would expect the same proportion of boys to be gamers as the girl gamers. Even though the wording of the problem may be ambiguous (Are gaming and gender independent? vs Is there a relationship between gaming and gender?), the test is still the same. The comparison that you make is between the observed counts and what the counts should be if the two characteristics are independent.

Thursday, March 29, 2007

Reviewing concepts

The problem for Thursday's HW:
P-hat is 0.3, Ho: p = .25, Ha: p > .25

Your rival thinks that the sample indicates over 25% support for his program. He found 12/40 customers liked the idea. Write an email to the boss to enlighten him/her.

**********
The card problem:
Three cards are in a hat. One is white on both sides, one is red on both sides, and one has one white face and one red face. The cards are mixed and one is drawn from the hat and placed face down on the table without showing the underside. If the face up is red, what is the probability that the other face is also red?

Monday, March 12, 2007

Chapter 12 Inference for proportions

Statistics in action. . .

Here's the basketball video.
http://viscog.beckman.uiuc.edu/grafs/demos/15.html

NCAA Brian's out in front with no way for anyone to catch up (I think). Pretty amazing. http://linnerstats.mayhem.sportsline.com/e
You'll need the password, which tells you who I think will win: gogators

Please try out this quiz and let me know how it works for you.
http://www.proprofs.com/quiz-school/quizview.php?id=567 :Basic stuff quiz

http://www.proprofs.com/quiz-school/quizview.php?id=585 :Which test do we do?

Cool sites for playing with proportions:
http://http://www.ltcconline.net/greenl/java/Statistics/HypTestProp/HypTestProp.htm

http://www.math.csusb.edu/faculty/stanton/m262/proportions/proportions.html
List of top engineering schools for recruiting as discussed in class (not in any particular order):
Cal Poly, Penn State, Penn, MIT, Florida A&M, Florida, RPI, Morgan State, Maryland, UCLA, Virginia, VA Tech, Iowa State, GA Tech, Howard, Colorado, Arizona, Cal – Berkley, North Carolina A&T, Puerto Rico, Michigan, Carnegie Mellon, Ohio State, Purdue, Illinois, Cornell, Texas, Texas A&M, Stanford, USC

This chapter is more of the same methods we saw in the last two chapters. You perform hypothesis tests and confidence intervals for proportions and for differences between proportions.

The tricky bits: (1) you have to keep track of which version of the proportion you will use for testing assumptions and for calculating standard deviations/std errors. Simply use the "best" information available. (2) Recognize when the inference is about proportions and when it is about measurements (chapter 11 methods). If you use X when you should have used p you let the reader know that you are confused.

When you have a 1 proportion hypothesis test, you have a hypothesized value for p that you use for both checking assumptions (conditions) and calculating the std dev.

When you are constructing a 1 proportion confidence interval, use the best info you have--the sample proportion. This is the lucky case where you just record the number of successes and the number of failures when you are checking the conditions. Because the estimator is used, we call the sqrt(p-hat(1- p-hat)/n) the standard error. Estimate------>>>>std error.

When you have a 2 proportion hypothesis test and you are testing to see if the two proportions are the same, well, doesn't that mean that the two proportions that you use in the std error calculation should be the same? In this case you generate a "pooled" estimator (Pooled sample proportion = sum of x / sum of n)to use for condition checking and for std error calculations. When checking conditions, use the pooled proportion * each value of n and (1 - the pooled proportion) * each value of n and make sure that each product is 5 or more.

On the other hand, when you are creating a 2 proportion confidence interval for the difference, you are not assuming that the proportions are the same, so the proportions must be checked separately and the formula for the std error resembles the formulas for two-sample conf interval std errors from Ch 11 a little bit. Checking conditions: for each sample check p-hat for that sample * sample size and (1-p-hat for that sample) * sample size.

Wednesday, February 21, 2007

Chapter 11 Inference for Distributions

To understand this chapter you have to understand the processes of Chapter 10.

The t-distribution is a lot like the normal (z) distribution. It is much more forgiving (look for the references in the book to robustness) than the normal and we use it mostly when we have only a sample to work from--no population standard deviation.

The formulas involving t start out a lot like the z formulas.

t-statistic = (x-bar - mean)/(sample std dev/sqrt n)

and t-interval boundaries are x-bar +/- t* (sample std dev/sqrt n)

We use n-1 degrees of freedom because we "lost " one when we used x-bar to create the estimator s.

The sample std dev / sqrt n is called the standard error of the mean.

The value we use for t*, in fact the line of the table we use when considering probabilities, is based on the number of degrees of freedom (df). You can't use a line with a df = some number if you don't have at least that number of degrees of freedom. It's kind of like buying stuff. If you don't have the money, you can't buy the product. Do you realize what this means??? If you have 990 degrees of freedom and the closest choices in the text are 100 and 1000, you are supposed to select the conservative number, the one you can afford, 100 df. Now, if you can get a closer number from your calculator, use it.

How can you get the value from your calculator? (1) Use the Inv T program or function. Ti-84s with system 2.41 have it. If you have an '84, upgrade your system. If you have something else, get the program.
(2) Use the trick we demonstrated in class: Use T-INT with x-bar = 0, sx = sqrt of n, and n = n. The upper bound of the interval you generate is the estimate for t*.

Paired t-test

This is a routine t-test that is done on matched-pairs data. When you can load the first data set into L1 and the second into L2 and the following two conditions hold, you are looking at a matched-pairs design. (1) Each row of the data has to be naturally linked, as in data coming from the same person--and a different person from the rest of the rows. The two lists are DEFINITELY NOT independent of each other. (2) The variable of interest is the difference between the two values, like L1 - L2. The null hypothesis is usually mu(of the differences) = 0.

To perform the test, just do the regular t-procedures on the column of differences. DF still equals n-1.

If the two sets of data are two independent samples, that's something different. . ..

Two-sample tests

Note: The t-statistic for the difference betwen two means IS NOT t-distributed, but it is pretty close under most conditions.

We use two-sample procedures when we are looking at two separate, independent samples and trying to make an inference about the difference between the two population means.

While most of the procedure is intuitive, the standard error and the number of degrees of freedom require a little explanation.

Std Error of the difference of the means:
Do you remenber how we can't add std deviations? And how the variance of the difference of two variables is the sum of the variances? Put it together for this problem.

Find each sample variance--(s/sqrt(n))^2. Add the two sample variances together. Take the square root. In these formulas, s1 is the sample std dev for the first sample, n1 is the size of the firs sample, etc.

Then the std error of the difference = sqrt( (s1^2/n1) + (s2^2/n2) ).

Degrees of freedom:
For the number of degrees of freedom, either use the number that the calculator or the computer calculates for you or use the more conservative minimum of n1-1 or n2-1.

Hypothesis:
Ho: mu1 = mu 2 which is equivalent to Ho: mu1 - mu2 = 0

Other than these little changes, the procedures are similar to those you've already practiced.

Pooled vs unpooled

This refers to the situations when you believe that the variances of the two populations should really be equal. Using a concept similar to our Law of Large Numbers, combining the standard deviations from the samples in a clever way creates an even stronger estimate for the ONE estimated standard deviation. This is pooling of variances.

Just because the means are the same we cannot assume that the variances are equal also.

We almost never pool variances of X-bar. You can generally leave your calculator set on UNPOOLED and forget about memorizing the formula. You can only pool variances if you are really sure that the variances are equal.

Tuesday, February 06, 2007

Chapter 10 Beginning of Inference

This chapter introducecs important methods under the highly unrealistic conditions where we know the population standard deviation but not the population mean.

Point estimates for the average value of X found through samples are generally good estimates, but they are wrong. You can generate a better estimate by creating a confidence interval.

The confidence interval =
x-bar +/- Z* times sigma of x / (Sqrt n).

We get Z* from the t-table for a specific confidence level, for instance when we want a 95% we use 1.96.

In creating a complete solution we first write down all of the given information. Define your variable. Then we determine whether the central limit theorem has "kicked in" or if the underlying data were already normally distributed. Be sure to address whether the data were from a SRS. Graph them if you have them to make sure there are no gaps or outliers. Is the sample size less than 1/10 of the population size???

Identify what you are trying to produce-- a 95% Z interval for mu and give the formula. Show how the numbers are plugged in and calculate the interval.

Write the interpretation of your interval.

We are 95% confident that the true population mean value of [insert the contextual information here] falls between [lower bound] and [upper bound].

Refer to your notes for all of the baaaaaaaaaaaaad interpretations of a confidence interval and NEVER use them. :)

If a value of mu had been proposed before we collected our sample, we could see if the value falls within our interval. If the proposed value does fall in the interval, then it is a reasonable value, although not necessarily correct. If it does not fall in the interval, then it is not a reasonable value according to our sample values.

Hypothesis tests

For hypothesis tests, you develop a null and alternate hypothesis BEFORE you collect data. Both hypotheses use the parameter (NEVER THE STATISTIC) and they are considered logical opposites. The null hypotheses ALWAYS has an "equals" aspect to it: the alternate hypothesis is always <, >, or not equal to.

For instance: H0: mu = 15
Ha: mu > 15.

Although these are not actually opposites, finding evidence that mu is less than 15 provides no support of the alternative hypothesis. You can think of the null hypothesis in this case as mu<=15, which still has an "equals" in it. This is the way I learned hypothesis-writing back in the day and it is still acceptable, but not as common.

Alpha, Beta, Type I error, Type II error, and Power

Alpha is the likelihood of a Type I error--accidentally rejecting the null hypothesis when it was actually correct. (Like convicting the wrong guy.)

Beta is the likelihood of a Type II error--failing to reject the null when it was wrong. (Kind of an error of omission, or not enough evidence to convict.)

Power is the likelihood that the test would have been sensitive enough to pick up the difference between the hypothesized mu and the actual mu (given some other new value for mu). This is the complement of Beta. Yes, 1 - Beta = Power. 1 - power = beta. Power + beta = 1.

Notice that alpha and beta are NEVER added together. They don't live under the same conditions--one assumes that the null was true and the other that the null was false. DO not fall into the trap of EVER adding alpha and beta together (unless you are TOLD to do it and then only if they offer you a lot of money or a passing grade on a test).

Calculating beta is easier than people think on the calculator.
(1) Figure out what the critical values are for rejection of Ho in terms of x-bar.

(2) Find the area under the curve centered at the NEW mu that falls between these critical values. You can use normalcdf(left_critical_value, right_critical_value, new mean, standard dev or error of x-bar).

Saturday, January 20, 2007

Chapter 9 - Sampling Distributions

How does the sample size affect our estimate and our decisions?

Parameters are the (usually unknown) measures of a population. Often they are represented by Greek letters like mu and sigma.

Statistics are the calculated measures generated from the samples. Statistics are estimators for parameters.

When the average of a statistic is the parameter itself, it is called an unbiased estimator. X-bar is an unbiased estimator for mu, the population mean.

Sampling distributions are the distributions of all of the averages of all of the samples of size n taken from a population.

When the sample size n increases, the variability of the means of the samples decreases--the graph of the sampling distribution is taller and narrower. When the sample size n decreases, the variability of the means of the samples increases.

This holds for sample proportions. The mean of the sample proportions (p-hats) is the true proportion for the population, p. Under special conditions we can use a formula for the standard deviation of the p-hats: SQRT(p*(1-p)/n).

The condition that allows this is that the sample is less than 1/10th of the population (and, of course, we're talking about simple random samples!!)

Also, the really BIG twist is that we can also use an approximation to the normal distribution when the expected numbers of successes and and failures are both 10 or more.

So, about that CLT thing. . . What was the REALLY BIG idea with the Central Limit Theorem???

How do you express the distributions for a binomial X, a geometric X, a uniform X, a normal X, the sampling distribution (X-bar), and the sample proportions (p-hat)?

When can you assume that the sampling distribution is approximately normally distributed?

What do you have to write to support your calculations of mean and standard deviation? your calculations of probabilities?

Monday, January 08, 2007

Chapter 8 - Binomial and Geometric distributions

Part 1 - Binomials

Binomial distributions have the following defining characterisitics:

(1)Only two mutually-exclusive and complementary events are possible on each trial--success or failure.

(2)The number of trials is fixed (n).

(3)The probability of a success on any trial is fixed at p. This DOES NOT mean that the probability of a success is always 50%.

(4)The trials are independent--knowing one outcome does not help you predict the next.

Always define what X represents, for instance, X = number of daughters (successes).

Shorthand identification for a binomial distribution: Binom(n, p).

The calculator will provide probabilities given n and p: binompdf(n,p[,x]) and binomcdf(n,p[,x]). Use pdf when you want probabilities for individual values of X and cdf when you want cumulative values, like the probability that the number of successes is less than or equal to 5. Insert the X value when you want just one value for a specific value of X. You may omit it when you want all the probabilities. Caution! For binomials, the least value X can take is ZERO, not one, so make sure that you associate the right X values with te correct probabilities.

The formula for P(X=k) = nCk p^k * (1-p)^(n-k).

nCk is "n choose k" or n!/(k!*(n-k)!).

If you calculate these probabilities for each possible value of x from 0 to n and add them up you will get a sum of 1.

The expected value or mean of the number of successes in a binomial setting is "mu sub x" = n*p.

The variance of the number of successes in the binomial setting is sigma squared sub x = n * p * (1-p).

The square root is (of course!) the square root of the variance.

What were those directions for loading binomial values into the lists and graphing as histograms? Use seq(X,X,0,n) --> L1 to populate the Xs and binompdf(n,p) --> L2 to insert the corresponding probabilities. To graph, select the histogram tool, use L1 as the xlist and L2 as the freq. You can use zoom 9 to generate a first stab at the graph. Then fix the graph using the window controls.

Part 2--Geometric Distribution

This was different from the binomial in that we are counting the number of trials UNTIL we achieve success, then we stop. This means that X is the number of trials it took and there is no "n" involved. Theoretically, it could take us infinitely many tries before we had a successful result.

Defining characteristics: fixed p, s/f, independent trials, count until success (not a fixed n).

The expected value of x, the number of trials required, is 1/p, where p is the probability of a success in one try. The variance is (1-p)/p^2.

The probability distribution for x = 1, 2, 3, 4, etc. is p, (1-p)p, (1-p)^2*p, (1-p)^3*p, etc.

What is the probability that it takes more than k attempts before you get a success?

Tuesday, January 02, 2007

State of Fear

The rhetorical questions:

What does State of Fear refer to?

What is true? How do you know? Who do you trust? What role does the statistician play in your understanding of news? What role do the media play?

The question to answer:

How can someone lie with statistics?