Saturday, February 9, 2008

Success with Random Variables

So this afternoon, I finished grading a quiz focusing on random variables associated with a sequence of Bernoulli trials. Based on these quizzes, I'm trying to understand what makes a probability course so difficult for students to understand. The other night, I was at a department social and talking with a professor who has taught Math 318 many times. He came right out and stated (without me giving any prompting) that he is always surprised at how students have such a hard time with the course, even though the actual mathematics involved in the course are fairly straight-forward.

In the last entry, I noted that at least part of the challenge is that there are so many different types of functions that appear. But I think that a significant issue comes back to confusion about what random variables represent and how they relate to questions that are posed. Recall that a random variable summarizes some aspect of a random experiment as a single number. (We will soon generalize to the ability to summarize with multiple random variables.) Typical questions in probability focus on the probability of events and the average of certain quantities. Almost always, we answer such questions by identifying an appropriate random variable. We then decide how to characterize events in terms of that random variable, or how to express the quantity being averaged in terms of that random variable.

The examples of random variables related to a sequence of Bernoulli trials provide our first real example of this type of reasoning, and I think this transition is part of why the topic was difficult for many of you. First, we must remember that the random variable is not the same as the random experiment, but simply one of many different ways to summarize an aspect of that experiment. The experiment itself is characterized completely by the sequence of Bernoulli trials, each of which is going to be either a success or a failure. Random variables will be measurement that are related to these successes or failures, and we must choose an appropriate measurement that will allow us to answer the questions of interest.

On our quiz last week, I gave the following scenario: "Apples are packaged in 3-pound bags. Suppose that 4% of the time, the bag weighs less than 3 pounds. You select bags randomly and weigh them in order." In principle, there is an unlimited supply of bags of apples, and we indefinitely select a random bag and weigh it. If we were to weigh enough bags over a long enough time, we would find that the fraction that are underweight appears 4% of the time. However, for any particular number of weighings, we could have found any number of underweight bags. The random experiment is described by an infinite sequence of F's and U's, with an F if the bag was full and a U if the bag was underweight.

So now consider the first question: "What is the probability that you must weigh at least 20 bags before you find an underweight bag?" One method that we could use to do this is to directly understand the event in question. That is, we could create a tree diagram that would eventually give enough information to answer the question. Unfortunately, this tree diagram would be much to cumbersome to answer a question about the first 20 bags weighed: 2^20 different outcomes after 20 weighings. So we try to see if there is a random variable for this random experiment that has enough information to answer the question instead.

There are actually multiple ways to choose a random variable to answer the question. The best random variable is a geometric random variable, which counts the number of Bernoulli trials until we see the first success. Since our question considers when we find the first underweight bag, we define success for our purposes as weighing a bag as underweight. Thus, X counts the number of weighings until we find the first underweight bag. If the first bag is underweight, X=1. But if the first 5 bags are full and the 6th bag is underweight, X=6. Every path on our hypothetical tree diagram is associated with a particular value of the random variable. Since an underweight bag will be chosen with probability p=0.04, we use X~Geometric(p=0.04). Now, having chosen a random variable, we must determine how we answer the question in terms of that random variable. The event of interest, "weigh at least 20 bags before you find an underweight bag," corresponds to an event, "X is greater than or equal to 20". Thus, we wish to compute P[X≥20].

Another way that we could answer the question is to consider that finding an underweight bag for the first time on or after the 20th weighing means that the first 19 bags must have all been full. Consequently, we could answer our question using a Binomial random variable with n=19. So let X count the number of underweight bags out of the first 19 that are weighed. Again, a "success" is finding an underweight bag, so that X~Binomial(n=19,p=0.04). Our event can now be restated as saying that "X is equal to 0". But remember that the X in this paragraph is different from the random variable in the previous paragraph. For this random variable, our question will be answered by computing P[X=0].

The third question on the quiz asked, "What is the probability that you find the 5th underweight bag before you weigh 80 bags?" The best choice for a random variable that will answer our question is to count the number trials until you do find the 5th underweight bag. If we say that a trial is a success if the bag is underweight, then we are counting trials until the 5th success. That is, our random variable X is a negative binomial random variable with r=5 (the number of successes needed to stop counting) and p=0.04 (the success probability). We write this: X~Neg.Binom.(r=5,p=0.04). To answer the question, we must realize that the event of interest is to say that X<80. (We stop counting before we reach 80.) Thus, the answer would be P[X<80].

And of course, as I mentioned earlier, we could actually compute the probability using another type of random variable. For this problem, we must find another way to describe the event of interest. One way to do this is to realize that we find the fifth bag prior to the 80th weighing if there are at least 5 underweight bags by the time we weigh the 79th bag. This can be represented using a binomial random variable. Let our random variable X count the number of underweight bags in the first 79 weighed. Thus, X~Binomial(n=79, p=0.04). The probability we wish to compute is P[X>5].

In summary, to do well in probability, you will need to think probabilistically. Before you can actually compute a quantity (which usually involves fairly straightforward mathematics), you must identify a random variable and understand how to answer the question in terms of that random variable. Then you can use one of the appropriate functions to finally answer the question, whether it is a probability or an expectation.

No comments: