Sampling Fundamentals - Selecting the Probability Sample
Sampling Fundamentals - Selecting the Probability Sample
There are a variety of methods that can be used to select a probability sample. The simplest, conceptually, is termed "simple random sampling." It not only has practical value, but it is a good vehicle for gaining intuitive understanding of the logic and power of random sampling.
Simple Random Sampling
Simple random sampling is an approach in which each population member, and thus each po^ibie^an-UDke^Jias^ an equal probability of being se-lected. The implementation is straightforward. Put the name of each person in the population on a tag and place the tags in a large bowl. Mix the contents of the bowl thoroughly and then draw out the desired number for the sample. Such a method was, in fact, used to select the order in which men would be drafted for military service during the Vietnam War, using birth dates. Despite the fact that the bowl was well mixed, the early drawing revealed a much higher number of December dates than January dates, indicating that the randomizing process can be more involved than it seems. The apparent reason was that the December tags were put in last, and the mixing was not sufficient to create a random draw. The solution was to randomize the order in which the dates were placed in the bowl.1 The use of a table of random numbers is usually much more practical than the use of a large bowl. A random-number table is a long list of numbers, each of which is computer generated by randomly selecting a number from 0 to 9. It has the property that knowledge of a string of 10 numbers gives no information about what the eleventh number is. Suppose a sample is desired from a list of 5000 opera season ticketholders. A random-number table such as that shown in Table 11-2 might provide the following sets of numbers:
7659|0783i4710|3749|774l|2960|0016|9347
Using these numbers, a sample of five would be created that would include these ticket holders:
0783|4710j3749|2960|0016|
The numbers above 5000 are disregarded, because there are no season ticketholders associated with them.
iSeymour Sudman, Applied Sampling (New York: Academic Press, 1976), p. 50.
Reprinted courtesy of Beta Research. Copyright © 1987, 88 Robert Leighton.
sampling point would be the same each day, and periods of peak travel or low usage easily could be missed.
A common use of systematic sampling is in telephone surveys. A number like 17 could be obtained from a random-number table. Then the seventeenth name on each page of a telephone directory would be a sample member. (Actually, a random number of inches from the top of the page would be used, so that names would not have to be counted.) Of course more than one name could be selected from each page if a larger sample were needed, or every other (or every third or fourth) page could be used if I smaller sample were desired.
Creating Lists Thejjiggest problem in simple random sampling is to obtain appropriate lists. The Donnelley Company maintains a list drawn from telephone directories and automobile registrations that contains around 85 percent of U. S. householdsTSucjrTaTIsTcan be used to get a national sample for a mail survey. Within a community, the local utility company will have a fairly complete list of the households.
The problem, of course, is that lists do not exist for specialized populations. There is no list of high-income people, mothers, tennis players, or cyclists, for example. A solution for this problem that is usually unsatisfactory is to use a convenient list. For example, for tennis players, a list of subscribers to Tennis World or membership lists in tennis clubs might be available. Obviously, neither would be representative of the entire tennis-playing population but still might be useful for some purposes. When suchlists are used that do not match the population, biases are introduced that should be considered. For instance, readers of Tennis World will be much more involved and knowledgeable than the average tennis player. A list of residents of a given community will not include new arrivals nor people living in dwellings built since the list was created. Thus, whole new subdivisions can be omitted. If such omissions are important, it can be worthwhile to identify new construction areas and design a separate sampling plan for them.
Sometimes several lists are combined in the hope of obtaining a more complete representation of the population. For example, subscribers to Tennis World and Tennis Today might be combined with a list of those who had purchased tennis equipment through a mail-order catalog. This approach, however, introduces the problem of duplication. Those appearing on several lists will have an increased chance of being selected. Removing duplication can be expensive and must be balanced against the bias that is introduced.
Another problem with lists is that of simply keeping them current. Many industrial firms maintain lists of those who have expressed interest in their products, and these are used in part for the mailing of promotional material. Similarly, many organizations, such as charities, symphonies, and art galleries, have lists of various types; but these lists can become outdated quickly as people move and change jobs within an organization.
Telephone Interviewing The use of telephone directories as a basis for generating a sample is extensive, as might be expected. The concern with the use of directories is the population members who are omitted because they have changed residences, requested an unlisted number, or simply do not have a telephone.
The incidence of unlisted numbers is extensive and varies dramatically from area to area. The percentage of phones that are unlisted in the major metropolitan areas ranges from 14/7 percent in West Palm Beach, Florida, to 60.3 percent in Las Vegas, Nevada, according to a study done by Survey Samples, a firm that provides telephone samples for the market research industry. Table 11-3 lists the metropolitan areas with the highest levels of unlisted numbers and the two with the lowest. Nationally 28 percent of the phones were unlisted in 1988, up from 22 percent in 1984 and 10 percent in 1965.
About 15 percent of the unlisted numbers comprise people who have moved and have not had a chance to get a number listed. The other 85 percent are people motivated to avoid crank or prank callers, telemarketers,
TABLE 11-3
Percentage of Unlisted Telephones
by Metropolitan Area
Las Vegas, Nevada Los Angeles-Long Beach, California Oakland, California Fresno, California Jersey City, New Jersey San Jose, California
60.3 56.0
53.6 52.6 51.3 50.6
Albany, New York
West Palm Beach, Florida
14.9 14.7
Source: "Unlisted Numbers," San Jose Mercury News, November 30, 1988, p. 1A.
bill collectors, or other unwanted callers. Those with unlisted numbers differ from other telephone subscribers. On the average, they are more likely tc be women, in the 25 to 35 age group, have high incomes, have modes: incomes, or have moved during the past year. One study showed that of those requesting unlisted telephones, 42 percent were female, whereas 32 percent of the other subscribers were female.
One way to reach unlisted numbers, to dial numbers randomly, can be very costly because many of the numbers will be unassigned or will be business numbers. A variant starts from a sample of listed telephone numbers. The number called is then the number drawn from the directory, plus some fixed number like 10. Of course, this method will still result in reaching some nonworking numbers. A study using this approach, conducted ir. two Colorado communities (Sterling and Boulder) resulted in 10 percen: nonworking numbers in Sterling and 29 percent in Boulder.
The method of adding a fixed number to a listed telephone number will not include those who are in a new series of numbers being activated by the telephone company. Seymour Sudman of The Survey Research Laboratory of the University of Illinois, a researcher long interested in sampling issues, therefore suggests that the last three digits in a listed telephone number be replaced by a three-digit random number. He indicates that the coverage will then increase and that half of the resulting numbers will generally be nonworking numbers.
Another approach is to buy lists from magazines, credit card firms, mail order firms or other such sources. One problem is that each such list has its own types of biases.
Stratified Sampling
In simple random sampling, a random sample is taken from a list (or sampling frame) representing the population. Often some information about subgroups within the sample frame can be used to improve the efficiency of the sample plan, that is, to obtain estimates with the same reliability with a smaller sample size. Reliability refers to the estimate variation caused by the fact that a sample is used instead of a population.
Suppose information on the attitudes of students toward a proposed new intramural athletic facility is needed. Further, suppose that there are three groups of students in the school—off-campus students, dormitory dwellers, and those living in fraternity and sorority houses. Suppose, further, that those living in fraternities and sororities have very homogeneous attitudes toward the proposed facility—the variation or variance in their attitudes is very small. Assume, also, that the dormitory dwellers are less homogeneous and that the off-campus students vary widely in their opinions. In such a situation, instead of allowing the sample to come from all three groups randomly, it will be more sensible to take fewer members from the fraternity/sorority group and to draw more from the off-campus group. We would separate the student body list into the three groups and draw a simple random sample from each of the three groups.
The sample size of the three groups will depend on two factors. First, it will depend on the amount of attitude variation in each group. The larger the variation, the larger the sample. Second, the sample size will tend to be inversely proportional to the cost of sampling. The smaller the cost, the larger the sample size that can be justified. (Sample-size formulas for stratified sampling are introduced in the next chapter.)
In developing a sampling plan, it is wise to look for natural subgroups that will be more homogeneous than the total population. Such subgroups are called 'rstrata/TTTerice, the term stratified sampling.
Cluster Sampling
In cluster sampling, the popu^iori^ajii^is divided into^sijbgroups, here terme^cIiSstersTnstead of strata. This time, however, a random sample of subgroups is selected and all members of the subgroups become part of the sample. This method is useful when subgroups can be identified that are representative of the whole population.
Suppose a sample of high-school sophomores who took an English class was needed in a midwestern city. There were 200 English classes, each of which contained a fairly representative sample with respect to student opinions on rock groups, the subject of the study. A cluster sample would select randomly a number of classrooms, say 15, and include all members of those classrooms in the sample. The big advantage of cluster sampling is lower cost. The subgroups or clusters are selected so that the cost of obtaining the desired information within the cluster is much smaller than if a simple random sample were obtained. If the average English class had 3C students, a sample of 450 would be obtained by contacting only 15 classes If a simple random sample of 450 students across all English classes were obtained, the cost probably would be significantly greater. The big question, of course, is whether the classes are representative of the population. If the classes from the upper-income areas have different opinions about rock groups than classes with more lower-income students, the assumption underlying the approach would not hold.
Multistage Designs
It is often appropriate to use a multi-stage design in developing a sample. Perhaps the most common example is in the case of area samples, in which a sample is desired of some area such as the United States or the state of California.
Suppose the need was to sample the state of California. The first step would be to develop a cluster sample of counties in the state. Each county would have a probability of being in the cluster sample proportionate to its population. Thus, the largest county—Los Angeles County—would be much more likely to be in the sample than a rural county. The second step would be to obtain a cluster sample of cities from each selected county. Again, each city-is selected with a probability proportionate to its size. The third step is to select a cluster sample of blocks from each city, again weighing each block by the number of dwellings in it. Finally, a systematic sample of dwellings from each block is selected, and a random sample of members of each dwelling is obtained. The result is a random sample of the area, in which each dwelling has an equal chance of being in the sample. Note that individuals living alone will have a larger chance of being in the sample than individuals living in dwellings with other people.
To see how a cluster sample of cities is drawn so that the probability of each being selected is proportionate to its population, consider the following example. Suppose there are six cities in Ajax County. In Table 11-4, the cities, plus the rural area, are listed together with their population sizes and the "cumulative population." The cumulative population serves to associate each city with a block of numbers equal in size to its population. The total
population of Ajax County is 100,000. The task is to select one city from the county, with the selection probability proportionate to the city population. The approach is simply to obtain a random number between 1 and 100,000. Taking the fourth row of Table 11-2 and starting from the right we get the number 89,701. The selected city would be the only with a cumulative population corresponding to 89,701: Austin. Clearly, the largest city, Fil-more, would have the best chance of being drawn (in fact, a 60 percent chance), and Cooper the smallest chance (only 2 percent).
The large marketing research firms develop a set of clusters of dwellings after each U. S. census. The clusters may be counties or some other convenient groupings of dwellings. Perhaps 100 to 300 such areas are selected randomly. Each area will have a probability of being selected proportional to the population within its boundaries. This set of clusters then
would be used by the marketing research firm for up to 10 years for their national surveys. For each area, data are compiled, on blocks and on living units within blocks. For rural areas, these firms hire and train interviewers to be available for subsequent surveys. Respondents from each area are selected on the basis of a sampling scheme such as stratified sampling or on the basis of a multistage scheme.
There are a variety of methods that can be used to select a probability sample. The simplest, conceptually, is termed "simple random sampling." It not only has practical value, but it is a good vehicle for gaining intuitive understanding of the logic and power of random sampling.
Simple Random Sampling
Simple random sampling is an approach in which each population member, and thus each po^ibie^an-UDke^Jias^ an equal probability of being se-lected. The implementation is straightforward. Put the name of each person in the population on a tag and place the tags in a large bowl. Mix the contents of the bowl thoroughly and then draw out the desired number for the sample. Such a method was, in fact, used to select the order in which men would be drafted for military service during the Vietnam War, using birth dates. Despite the fact that the bowl was well mixed, the early drawing revealed a much higher number of December dates than January dates, indicating that the randomizing process can be more involved than it seems. The apparent reason was that the December tags were put in last, and the mixing was not sufficient to create a random draw. The solution was to randomize the order in which the dates were placed in the bowl.1 The use of a table of random numbers is usually much more practical than the use of a large bowl. A random-number table is a long list of numbers, each of which is computer generated by randomly selecting a number from 0 to 9. It has the property that knowledge of a string of 10 numbers gives no information about what the eleventh number is. Suppose a sample is desired from a list of 5000 opera season ticketholders. A random-number table such as that shown in Table 11-2 might provide the following sets of numbers:
7659|0783i4710|3749|774l|2960|0016|9347
Using these numbers, a sample of five would be created that would include these ticket holders:
0783|4710j3749|2960|0016|
The numbers above 5000 are disregarded, because there are no season ticketholders associated with them.
iSeymour Sudman, Applied Sampling (New York: Academic Press, 1976), p. 50.
Reprinted courtesy of Beta Research. Copyright © 1987, 88 Robert Leighton.
sampling point would be the same each day, and periods of peak travel or low usage easily could be missed.
A common use of systematic sampling is in telephone surveys. A number like 17 could be obtained from a random-number table. Then the seventeenth name on each page of a telephone directory would be a sample member. (Actually, a random number of inches from the top of the page would be used, so that names would not have to be counted.) Of course more than one name could be selected from each page if a larger sample were needed, or every other (or every third or fourth) page could be used if I smaller sample were desired.
Creating Lists Thejjiggest problem in simple random sampling is to obtain appropriate lists. The Donnelley Company maintains a list drawn from telephone directories and automobile registrations that contains around 85 percent of U. S. householdsTSucjrTaTIsTcan be used to get a national sample for a mail survey. Within a community, the local utility company will have a fairly complete list of the households.
The problem, of course, is that lists do not exist for specialized populations. There is no list of high-income people, mothers, tennis players, or cyclists, for example. A solution for this problem that is usually unsatisfactory is to use a convenient list. For example, for tennis players, a list of subscribers to Tennis World or membership lists in tennis clubs might be available. Obviously, neither would be representative of the entire tennis-playing population but still might be useful for some purposes. When suchlists are used that do not match the population, biases are introduced that should be considered. For instance, readers of Tennis World will be much more involved and knowledgeable than the average tennis player. A list of residents of a given community will not include new arrivals nor people living in dwellings built since the list was created. Thus, whole new subdivisions can be omitted. If such omissions are important, it can be worthwhile to identify new construction areas and design a separate sampling plan for them.
Sometimes several lists are combined in the hope of obtaining a more complete representation of the population. For example, subscribers to Tennis World and Tennis Today might be combined with a list of those who had purchased tennis equipment through a mail-order catalog. This approach, however, introduces the problem of duplication. Those appearing on several lists will have an increased chance of being selected. Removing duplication can be expensive and must be balanced against the bias that is introduced.
Another problem with lists is that of simply keeping them current. Many industrial firms maintain lists of those who have expressed interest in their products, and these are used in part for the mailing of promotional material. Similarly, many organizations, such as charities, symphonies, and art galleries, have lists of various types; but these lists can become outdated quickly as people move and change jobs within an organization.
Telephone Interviewing The use of telephone directories as a basis for generating a sample is extensive, as might be expected. The concern with the use of directories is the population members who are omitted because they have changed residences, requested an unlisted number, or simply do not have a telephone.
The incidence of unlisted numbers is extensive and varies dramatically from area to area. The percentage of phones that are unlisted in the major metropolitan areas ranges from 14/7 percent in West Palm Beach, Florida, to 60.3 percent in Las Vegas, Nevada, according to a study done by Survey Samples, a firm that provides telephone samples for the market research industry. Table 11-3 lists the metropolitan areas with the highest levels of unlisted numbers and the two with the lowest. Nationally 28 percent of the phones were unlisted in 1988, up from 22 percent in 1984 and 10 percent in 1965.
About 15 percent of the unlisted numbers comprise people who have moved and have not had a chance to get a number listed. The other 85 percent are people motivated to avoid crank or prank callers, telemarketers,
TABLE 11-3
Percentage of Unlisted Telephones
by Metropolitan Area
Las Vegas, Nevada Los Angeles-Long Beach, California Oakland, California Fresno, California Jersey City, New Jersey San Jose, California
60.3 56.0
53.6 52.6 51.3 50.6
Albany, New York
West Palm Beach, Florida
14.9 14.7
Source: "Unlisted Numbers," San Jose Mercury News, November 30, 1988, p. 1A.
bill collectors, or other unwanted callers. Those with unlisted numbers differ from other telephone subscribers. On the average, they are more likely tc be women, in the 25 to 35 age group, have high incomes, have modes: incomes, or have moved during the past year. One study showed that of those requesting unlisted telephones, 42 percent were female, whereas 32 percent of the other subscribers were female.
One way to reach unlisted numbers, to dial numbers randomly, can be very costly because many of the numbers will be unassigned or will be business numbers. A variant starts from a sample of listed telephone numbers. The number called is then the number drawn from the directory, plus some fixed number like 10. Of course, this method will still result in reaching some nonworking numbers. A study using this approach, conducted ir. two Colorado communities (Sterling and Boulder) resulted in 10 percen: nonworking numbers in Sterling and 29 percent in Boulder.
The method of adding a fixed number to a listed telephone number will not include those who are in a new series of numbers being activated by the telephone company. Seymour Sudman of The Survey Research Laboratory of the University of Illinois, a researcher long interested in sampling issues, therefore suggests that the last three digits in a listed telephone number be replaced by a three-digit random number. He indicates that the coverage will then increase and that half of the resulting numbers will generally be nonworking numbers.
Another approach is to buy lists from magazines, credit card firms, mail order firms or other such sources. One problem is that each such list has its own types of biases.
Stratified Sampling
In simple random sampling, a random sample is taken from a list (or sampling frame) representing the population. Often some information about subgroups within the sample frame can be used to improve the efficiency of the sample plan, that is, to obtain estimates with the same reliability with a smaller sample size. Reliability refers to the estimate variation caused by the fact that a sample is used instead of a population.
Suppose information on the attitudes of students toward a proposed new intramural athletic facility is needed. Further, suppose that there are three groups of students in the school—off-campus students, dormitory dwellers, and those living in fraternity and sorority houses. Suppose, further, that those living in fraternities and sororities have very homogeneous attitudes toward the proposed facility—the variation or variance in their attitudes is very small. Assume, also, that the dormitory dwellers are less homogeneous and that the off-campus students vary widely in their opinions. In such a situation, instead of allowing the sample to come from all three groups randomly, it will be more sensible to take fewer members from the fraternity/sorority group and to draw more from the off-campus group. We would separate the student body list into the three groups and draw a simple random sample from each of the three groups.
The sample size of the three groups will depend on two factors. First, it will depend on the amount of attitude variation in each group. The larger the variation, the larger the sample. Second, the sample size will tend to be inversely proportional to the cost of sampling. The smaller the cost, the larger the sample size that can be justified. (Sample-size formulas for stratified sampling are introduced in the next chapter.)
In developing a sampling plan, it is wise to look for natural subgroups that will be more homogeneous than the total population. Such subgroups are called 'rstrata/TTTerice, the term stratified sampling.
Cluster Sampling
In cluster sampling, the popu^iori^ajii^is divided into^sijbgroups, here terme^cIiSstersTnstead of strata. This time, however, a random sample of subgroups is selected and all members of the subgroups become part of the sample. This method is useful when subgroups can be identified that are representative of the whole population.
Suppose a sample of high-school sophomores who took an English class was needed in a midwestern city. There were 200 English classes, each of which contained a fairly representative sample with respect to student opinions on rock groups, the subject of the study. A cluster sample would select randomly a number of classrooms, say 15, and include all members of those classrooms in the sample. The big advantage of cluster sampling is lower cost. The subgroups or clusters are selected so that the cost of obtaining the desired information within the cluster is much smaller than if a simple random sample were obtained. If the average English class had 3C students, a sample of 450 would be obtained by contacting only 15 classes If a simple random sample of 450 students across all English classes were obtained, the cost probably would be significantly greater. The big question, of course, is whether the classes are representative of the population. If the classes from the upper-income areas have different opinions about rock groups than classes with more lower-income students, the assumption underlying the approach would not hold.
Multistage Designs
It is often appropriate to use a multi-stage design in developing a sample. Perhaps the most common example is in the case of area samples, in which a sample is desired of some area such as the United States or the state of California.
Suppose the need was to sample the state of California. The first step would be to develop a cluster sample of counties in the state. Each county would have a probability of being in the cluster sample proportionate to its population. Thus, the largest county—Los Angeles County—would be much more likely to be in the sample than a rural county. The second step would be to obtain a cluster sample of cities from each selected county. Again, each city-is selected with a probability proportionate to its size. The third step is to select a cluster sample of blocks from each city, again weighing each block by the number of dwellings in it. Finally, a systematic sample of dwellings from each block is selected, and a random sample of members of each dwelling is obtained. The result is a random sample of the area, in which each dwelling has an equal chance of being in the sample. Note that individuals living alone will have a larger chance of being in the sample than individuals living in dwellings with other people.
To see how a cluster sample of cities is drawn so that the probability of each being selected is proportionate to its population, consider the following example. Suppose there are six cities in Ajax County. In Table 11-4, the cities, plus the rural area, are listed together with their population sizes and the "cumulative population." The cumulative population serves to associate each city with a block of numbers equal in size to its population. The total
population of Ajax County is 100,000. The task is to select one city from the county, with the selection probability proportionate to the city population. The approach is simply to obtain a random number between 1 and 100,000. Taking the fourth row of Table 11-2 and starting from the right we get the number 89,701. The selected city would be the only with a cumulative population corresponding to 89,701: Austin. Clearly, the largest city, Fil-more, would have the best chance of being drawn (in fact, a 60 percent chance), and Cooper the smallest chance (only 2 percent).
The large marketing research firms develop a set of clusters of dwellings after each U. S. census. The clusters may be counties or some other convenient groupings of dwellings. Perhaps 100 to 300 such areas are selected randomly. Each area will have a probability of being selected proportional to the population within its boundaries. This set of clusters then
would be used by the marketing research firm for up to 10 years for their national surveys. For each area, data are compiled, on blocks and on living units within blocks. For rural areas, these firms hire and train interviewers to be available for subsequent surveys. Respondents from each area are selected on the basis of a sampling scheme such as stratified sampling or on the basis of a multistage scheme.
Comments
Post a Comment