Tuesday, 29 September 2015

SD and SEM

It is easy to be confused about the difference between the standard deviation (SD) and the standard error of the mean (SEM). Here are the key differences:
The SD quantifies scatter — how much the values vary from one another.
The SEM quantifies how precisely you know the true mean of the population. It takes into account both the value of the SD and the sample size.
Both SD and SEM are in the same units -- the units of the data.
The SEM, by definition, is always smaller than the SD.
The SEM gets smaller as your samples get larger. This makes sense, because the mean of a large sample is likely to be closer to the true population mean than is the mean of a small sample. With a huge sample, you'll know the value of the mean with a lot of precision even if the data are very scattered.
The SD does not change predictably as you acquire more data. The SD you compute from a sample is the best possible estimate of the SD of the overall population. As you collect more data, you'll assess the SD of the population with more precision. But you can't predict whether the SD from a larger sample will be bigger or smaller than the SD from a small sample. (This is not strictly true. It is the variance -- the SD squared -- that doesn't change predictably, but the change in SD is trivial and much much smaller than the change in the SEM.)
Note that standard errors can be computed for almost any parameter you compute from data, not just the mean. The phrase "the standard error" is a bit ambiguous. The points above refer only to the standard error of the mean.
Reference 
http://www.graphpad.com/guides/prism/6/statistics/index.htm?stat_semandsdnotsame.htm

One-Way Analysis of Variance (ANOVA) example



Reference

ANNOVA Example

Monday, 28 September 2015

How Do You Interpret P Values?


VaccineIn technical terms, a P value is the probability of obtaining an effect at least as extreme as the one in your sample data, assuming the truth of the null hypothesis.
For example, suppose that a vaccine study produced a P value of 0.04. This P value indicates that if the vaccine had no effect, you’d obtain the observed difference or more in 4% of studies due to random sampling error.
P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis. This limitation leads us into the next section to cover a very common misinterpretation of P values.

P Values Are NOT the Probability of Making a Mistake

Incorrect interpretations of P values are very common. The most common mistake is to interpret a P value as the probability of making a mistake by rejecting a true null hypothesis (a Type I error).
There are several reasons why P values can’t be the error rate.
First, P values are calculated based on the assumptions that the null is true for the population and that the difference in the sample is caused entirely by random chance. Consequently, P values can’t tell you the probability that the null is true or false because it is 100% true from the perspective of the calculations.
Second, while a low P value indicates that your data are unlikely assuming a true null, it can’t evaluate which of two competing cases is more likely:
  • The null is true but your sample was unusual.
  • The null is false.
Determining which case is more likely requires subject area knowledge and replicate studies.
Let’s go back to the vaccine study and compare the correct and incorrect way to interpret the P value of 0.04:
  • Correct: Assuming that the vaccine had no effect, you’d obtain the observed difference or more in 4% of studies due to random sampling error.
     
  • Incorrect: If you reject the null hypothesis, there’s a 4% chance that you’re making a mistake.
To see a graphical representation of how hypothesis tests work, see my post: Understanding Hypothesis Tests: Significance Levels and P Values.
Refernce
http://blog.minitab.com/blog/adventures-in-statistics/how-to-correctly-interpret-p-values

Experimental design

An Experimental Design Example

Consider the following hypothetical experiment. Acme Medicine is conducting an experiment to test a new vaccine, developed to immunize people against the common cold. To test the vaccine, Acme has 1000 volunteers - 500 men and 500 women. The participants range in age from 21 to 70.
In this lesson, we describe three experimental designs - a completely randomized design, a randomized block design, and a matched pairs design. And we show how each design might be applied by Acme Medicine to understand the effect of the vaccine, while ruling out confounding effects of other factors.

Completely Randomized Design

The completely randomized design is probably the simplest experimental design, in terms of data analysis and convenience. With this design, participants are randomly assigned to treatments.
Treatment
PlaceboVaccine
500500
A completely randomized design layout for the Acme Experiment is shown in the table to the right. In this design, the experimenter randomly assigned participants to one of two treatment conditions. They received a placebo or they received the vaccine. The same number of participants (500) were assigned to each treatment condition (although this is not required). The dependent variable is the number of colds reported in each treatment condition. If the vaccine is effective, participants in the "vaccine" condition should report significantly fewer colds than participants in the "placebo" condition.
A completely randomized design relies on randomization to control for the effects of extraneous variables. The experimenter assumes that, on averge, extraneous factors will affect treatment conditions equally; so any significant differences between conditions can fairly be attributed to the independent variable.

Randomized Block Design

With a randomized block design, the experimenter divides participants into subgroups calledblocks, such that the variability within blocks is less than the variability between blocks. Then, participants within each block are randomly assigned to treatment conditions. Because this design reduces variability and potential confounding, it produces a better estimate of treatment effects.
GenderTreatment
PlaceboVaccine
Male250250
Female250250
The table to the right shows a randomized block design for the Acme experiment. Participants are assigned to blocks, based on gender. Then, within each block, participants are randomly assigned to treatments. For this design, 250 men get the placebo, 250 men get the vaccine, 250 women get the placebo, and 250 women get the vaccine.
It is known that men and women are physiologically different and react differently to medication. This design ensures that each treatment condition has an equal proportion of men and women. As a result, differences between treatment conditions cannot be attributed to gender. This randomized block design removes gender as a potential source of variability and as a potential confounding variable.
In this Acme example, the randomized block design is an improvement over the completely randomized design. Both designs use randomization to implicitly guard against confounding. But only the randomized block design explicitly controls for gender.
Note 1: In some blocking designs, individual participants may receive multiple treatments. This is called using the participant as his own control. Using the participant as his own control is desirable in some experiments (e.g., research on learning or fatigue). But it can also be a problem (e.g., medical studies where the medicine used in one treatment might interact with the medicine used in another treatment).
Note 2: Blocks perform a similar function in experimental design as strata perform in sampling. Both divide observations into subgroups. However, they are not the same. Blocking is associated with experimental design, and stratification is associated with survey sampling.

Matched Pairs Design

PairTreatment
PlaceboVaccine
111
211
.........
49911
50011
matched pairs design is a special case of the randomized block design. It is used when the experiment has only two treatment conditions; and participants can be grouped into pairs, based on some blocking variable. Then, within each pair, participants are randomly assigned to different treatments.
The table to the right shows a matched pairs design for the Acme experiment. The 1000 participants are grouped into 500 matched pairs. Each pair is matched on gender and age. For example, Pair 1 might be two women, both age 21. Pair 2 might be two women, both age 22, and so on.
For the Acme example, the matched pairs design is an improvement over the completely randomized design and the randomized block design. Like the other designs, the matched pairs design uses randomization to control for confounding. However, unlike the others, this design explicitly controls for two potential lurking variables - age and gender.

Reference

http://stattrek.com/experiments/experimental-design.aspx?Tutorial=AP

Wednesday, 23 September 2015

Confidence in statistics

The confidence interval is the plus-or-minus figure usually reported in newspaper or television opinion poll results. For example, if you use a confidence interval of 4 and 47% percent of your sample picks an answer you can be "sure" that if you had asked the question of the entire relevant population between 43% (47-4) and 51% (47+4) would have picked that answer.

The confidence level tells you how sure you can be. It is expressed as a percentage and represents how often the true percentage of the population who would pick an answer lies within the confidence interval. The 95% confidence level means you can be 95% certain; the 99% confidence level means you can be 99% certain. Most researchers use the 95% confidence level.

When you put the confidence level and the confidence interval together, you can say that you are 95% sure that the true percentage of the population is between 43% and 51%.

The wider the confidence interval you are willing to accept, the more certain you can be that the whole population answers would be within that range. For example, if you asked a sample of 1000 people in a city which brand of cola they preferred, and 60% said Brand A, you can be very certain that between 40 and 80% of all the people in the city actually do prefer that brand, but you cannot be so sure that between 59 and 61% of the people in the city prefer the brand. 

Factors that Affect Confidence Intervals 
There are three factors that determine the size of the confidence interval for a given confidence level. These are: sample sizepercentage and population size.

Sample Size 
The larger your sample, the more sure you can be that their answers truly reflect the population. This indicates that for a given confidence level, the larger your sample size, the smaller your confidence interval. However, the relationship is not linear (i.e., doubling the sample size does not halve the confidence interval).

Percentage 
Your accuracy also depends on the percentage of your sample that picks a particular answer. If 99% of your sample said "Yes" and 1% said "No" the chances of error are remote, irrespective of sample size. However, if the percentages are 51% and 49% the chances of error are much greater. It is easier to be sure of extreme answers than of middle-of-the-road ones.

When determining the sample size needed for a given level of accuracy you must use the worst case percentage (50%). You should also use this percentage if you want to determine a general level of accuracy for a sample you already have. To determine the confidence interval for a specific answer your sample has given, you can use the percentage picking that answer and get a smaller interval.

Population Size 
How many people are there in the group your sample represents? This may be the number of people in a city you are studying, the number of people who buy new cars, etc. Often you may not know the exact population size. This is not a problem. The mathematics of probability proves the size of the population is irrelevant, unless the size of the sample exceeds a few percent of the total population you are examining. This means that a sample of 500 people is equally useful in examining the opinions of a state of 15,000,000 as it would a city of 100,000. For this reason, the sample calculator ignores the population size when it is "large" or unknown. Population size is only likely to be a factor when you work with a relatively small and known group of people .

Note: 
The confidence interval calculations assume you have a genuine random sample of the relevant population. If your sample is not truly random, you cannot rely on the intervals. Non-random samples usually result from some flaw in the sampling procedure.


Reference 
http://www.gifted.uconn.edu/siegle/research/samples/confidenceinterval.htm

Friday, 18 September 2015

Descriptive vs. Inferential Statistics

Statistical procedures can be divided into two major categories: descriptive statistics and inferential statistics.
Before discussing the differences between descriptive and inferential statistics, we must first be familiar with two important concepts in social science statistics: population and sample. A population is the total set of individuals, groups, objects, or events that the researcher is studying.
For example, if we were studying employment patterns of recent U.S. college graduates, our population would likely be defined as every college student who graduated within the past one year from any college across the United States.
A sample is a relatively small subset of people, objects, groups, or events, that is selected from the population. Instead of surveying every recent college graduate in the United States, which would cost a great deal of time and money, we could instead select a sample of recent graduates, which would then be used to generalize the findings to the larger population.
Descriptive statistics includes statistical procedures that we use to describe the population we are studying. The data could be collected from either a sample or a population, but the results help us organize and describe data. Descriptive statistics can only be used to describe the group that is being studying. That is, the results cannot be generalized to any larger group.
Descriptive statistics are useful and serviceable if you do not need to extend your results to any larger group. However, much of social sciences tend to include studies that give us “universal” truths about segments of the population, such as all parents, all women, all victims, etc.
Frequency distributionsmeasures of central tendency (meanmedian, and mode), and graphs like pie charts and bar charts that describe the data are all examples of descriptive statistics.
Inferential Statistics
Inferential statistics is concerned with making predictions or inferences about a population from observations and analyses of a sample. That is, we can take the results of an analysis using a sample and can generalize it to the larger population that the sample represents. In order to do this, however, it is imperative that the sample is representative of the group to which it is being generalized.
To address this issue of generalization, we have tests of significance. A Chi-square or T-test, for example, can tell us the probability that the results of our analysis on the sample are representative of the population that the sample represents. In other words, these tests of significance tell us the probability that the results of the analysis could have occurred by chance when there is no relationship at all between the variables we studied in the population we studied.

Refernce
http://sociology.about.com/od/Statistics/a/Descriptive-inferential-statistics.htm

Population Parameter

Parameter

Trent D. Buskirk
A parameter is a numerical quantity or attribute of a population that is estimated using data collected from the population. Parameters are to populations as statistics are to samples. For example, in survey research, the true proportion of voters who vote for a presidential candidate in the next national election may be of interest. Such a parameter may be estimated using a sample proportion computed from data gathered via a probability sample of registered voters. Or, the actual annual average household "out-of-pocket" medical expenses for a given year (parameter) could be estimated from data provided by the Medical Expenditures Survey. Or, the modal race of students within a particular school is an example of an attribute parameter that could be estimated using data acquired via a cluster sample of classrooms .

Population Parameter

Sunghee Lee
Population parameters, also termed population characteristics , are numerical expressions summarizing various aspects of the entire population. One common example is the population mean, where F; is some characteristic of interest observed from the element i in the population of size N . Means, medians, proportions, and totals may be classified as descriptive parameters, while there are parameters measuring relationships, such as differences in descriptive parameters, correlation, and regression coefficients. Although population parameters are sometimes considered unobservable, they are taken to be fixed and potentially measurable quantities using survey statistics. This is because sampling statistics are developed for well-specified finite populations that social science studies attempt to examine and that the population parameters depend on all elements in the population. Before any sort of data collection.
Reference
ps://srmo.sagepub.com/view/encyclopedia-of-survey-research-methods/n386.xml
http://srmo.sagepub.com/view/encyclopedia-of-survey-research-methods/n370.xml

PROBABILITY VS. NON PROBABILITY SAMPLING

Sampling can be a confusing concept for managers carrying out survey research projects. By knowing some basic information about survey sampling designs and how they differ, you can understand the advantages and disadvantages of various approaches.

The two main methods used in survey research are probability sampling and nonprobability sampling. The big difference is that in probability sampling all persons have a chance of being selected, and results are more likely to accurately reflect the entire population. While it would always be nice to have a probability-based sample, other factors need to be considered (availability, cost, time, what you want to say about results). Some additional characteristics of the two methods are listed below.

Probability Sampling

• You have a complete sampling frame. You have contact information for the entire population.

• You can select a random sample from your population. Since all persons (or “units”) have an equal chance of being selected for your survey, you can randomly select participants without missing entire portions of your audience.

• You can generalize your results from a random sample. With this data collection method and a decent response rate, you can extrapolate your results to the entire population.

• Can be more expensive and time-consuming than convenience or purposive sampling.

Nonprobability Sampling 

• Used when there isn’t an exhaustive population list available. Some units are unable to be selected, therefore you have no way of knowing the size and effect of sampling error (missed persons, unequal representation, etc.). 

• Not random. 

• Can be effective when trying to generate ideas and getting feedback, but you cannot generalize your results to an entire population with a high level of confidence. Quota samples (males and females, etc.) are an example.

•More convenient and less costly, but doesn’t hold up to expectations of probability theory.


Reference
http://survey.cvent.com/blog/market-research-design-tips-2/sampling-demystified-probability-vs-nonprobability-sampling

Random Error and Systematic Error

Definitions

All experimental uncertainty is due to either random errors or systematic errors. Random errors are statistical fluctuations (in either direction) in the measured data due to the precision limitations of the measurement device. Random errors usually result from the experimenter's inability to take the same measurement in exactly the same way to get exact the same number. Systematic errors, by contrast, are reproducible inaccuracies that are consistently in the same direction. Systematic errors are often due to a problem which persists throughout the entire experiment.
Note that systematic and random errors refer to problems associated with making measurements. Mistakes made in the calculations or in reading the instrument are not considered in error analysis. It is assumed that the experimenters are careful and competent!

How to minimize experimental error: some examples


Type of ErrorExampleHow to minimize it
Random errorsYou measure the mass of a ring three times using the same balance and get slightly different values: 17.46 g, 17.42 g, 17.44 gTake more data. Random errors can be evaluated through statistical analysis and can be reduced by averaging over a large number of observations.
Systematic errorsThe cloth tape measure that you use to measure the length of an object had been stretched out from years of use. (As a result, all of your length measurements were too small.)The electronic scale you use reads 0.05 g too high for all your mass measurements (because it is improperly tared throughout your experiment).Systematic errors are difficult to detect and cannot be analyzed statistically, because all of the data is off in the same direction (either to high or too low). Spotting and correcting for systematic error takes a lot of care.
  • How would you compensate for the incorrect results of using the stretched out tape measure?
  • How would you correct the measurements from improperly tared scale?

Reference
https://www2.southeastern.edu/Academics/Faculty/rallain/plab193/labinfo/Error_Analysis/05_Random_vs_Systematic.html

Thursday, 17 September 2015

Nonprobability Sampling

The difference between nonprobability and probability sampling is that nonprobability sampling does not involve random selection and probability sampling does. Does that mean that nonprobability samples aren't representative of the population? Not necessarily. But it does mean that nonprobability samples cannot depend upon the rationale of probability theory. At least with a probabilistic sample, we know the odds or probability that we have represented the population well. We are able to estimate confidence intervals for the statistic. With nonprobability samples, we may or may not represent the population well, and it will often be hard for us to know how well we've done so. In general, researchers prefer probabilistic or random sampling methods over nonprobabilistic ones, and consider them to be more accurate and rigorous. However, in applied social research there may be circumstances where it is not feasible, practical or theoretically sensible to do random sampling. Here, we consider a wide range of nonprobabilistic alternatives.
We can divide nonprobability sampling methods into two broad types: accidental or purposive. Most sampling methods are purposive in nature because we usually approach the sampling problem with a specific plan in mind. The most important distinctions among these types of sampling methods are the ones between the different types of purposive sampling approaches.

Accidental, Haphazard or Convenience Sampling

One of the most common methods of sampling goes under the various titles listed here. I would include in this category the traditional "man on the street" (of course, now it's probably the "person on the street") interviews conducted frequently by television news programs to get a quick (although nonrepresentative) reading of public opinion. I would also argue that the typical use of college students in much psychological research is primarily a matter of convenience. (You don't really believe that psychologists use college students because they believe they're representative of the population at large, do you?). In clinical practice,we might use clients who are available to us as our sample. In many research contexts, we sample simply by asking for volunteers. Clearly, the problem with all of these types of samples is that we have no evidence that they are representative of the populations we're interested in generalizing to -- and in many cases we would clearly suspect that they are not.


Purposive Sampling

In purposive sampling, we sample with a purpose in mind. We usually would have one or more specific predefined groups we are seeking. For instance, have you ever run into people in a mall or on the street who are carrying a clipboard and who are stopping various people and asking if they could interview them? Most likely they are conducting a purposive sample (and most likely they are engaged in market research). They might be looking for Caucasian females between 30-40 years old. They size up the people passing by and anyone who looks to be in that category they stop to ask if they will participate. One of the first things they're likely to do is verify that the respondent does in fact meet the criteria for being in the sample. Purposive sampling can be very useful for situations where you need to reach a targeted sample quickly and where sampling for proportionality is not the primary concern. With a purposive sample, you are likely to get the opinions of your target population, but you are also likely to overweight subgroups in your population that are more readily accessible.
All of the methods that follow can be considered subcategories of purposive sampling methods. We might sample for specific groups or types of people as in modal instance, expert, or quota sampling. We might sample for diversity as in heterogeneity sampling. Or, we might capitalize on informal social networks to identify specific respondents who are hard to locate otherwise, as in snowball sampling. In all of these methods we know what we want -- we are sampling with a purpose.
  • Modal Instance Sampling
In statistics, the mode is the most frequently occurring value in a distribution. In sampling, when we do a modal instance sample, we are sampling the most frequent case, or the "typical" case. In a lot of informal public opinion polls, for instance, they interview a "typical" voter. There are a number of problems with this sampling approach. First, how do we know what the "typical" or "modal" case is? We could say that the modal voter is a person who is of average age, educational level, and income in the population. But, it's not clear that using the averages of these is the fairest (consider the skewed distribution of income, for instance). And, how do you know that those three variables -- age, education, income -- are the only or even the most relevant for classifying the typical voter? What if religion or ethnicity is an important discriminator? Clearly, modal instance sampling is only sensible for informal sampling contexts.
  • Expert Sampling
Expert sampling involves the assembling of a sample of persons with known or demonstrable experience and expertise in some area. Often, we convene such a sample under the auspices of a "panel of experts." There are actually two reasons you might do expert sampling. First, because it would be the best way to elicit the views of persons who have specific expertise. In this case, expert sampling is essentially just a specific subcase of purposive sampling. But the other reason you might use expert sampling is to provide evidence for the validity of another sampling approach you've chosen. For instance, let's say you do modal instance sampling and are concerned that the criteria you used for defining the modal instance are subject to criticism. You might convene an expert panel consisting of persons with acknowledged experience and insight into that field or topic and ask them to examine your modal definitions and comment on their appropriateness and validity. The advantage of doing this is that you aren't out on your own trying to defend your decisions -- you have some acknowledged experts to back you. The disadvantage is that even the experts can be, and often are, wrong.
  • Quota Sampling
In quota sampling, you select people nonrandomly according to some fixed quota. There are two types of quota sampling: proportional and non proportional. In proportional quota sampling you want to represent the major characteristics of the population by sampling a proportional amount of each. For instance, if you know the population has 40% women and 60% men, and that you want a total sample size of 100, you will continue sampling until you get those percentages and then you will stop. So, if you've already got the 40 women for your sample, but not the sixty men, you will continue to sample men but even if legitimate women respondents come along, you will not sample them because you have already "met your quota." The problem here (as in much purposive sampling) is that you have to decide the specific characteristics on which you will base the quota. Will it be by gender, age, education race, religion, etc.?
Nonproportional quota sampling is a bit less restrictive. In this method, you specify the minimum number of sampled units you want in each category. here, you're not concerned with having numbers that match the proportions in the population. Instead, you simply want to have enough to assure that you will be able to talk about even small groups in the population. This method is the nonprobabilistic analogue of stratified random sampling in that it is typically used to assure that smaller groups are adequately represented in your sample.
  • Heterogeneity Sampling
We sample for heterogeneity when we want to include all opinions or views, and we aren't concerned about representing these views proportionately. Another term for this is sampling fordiversity. In many brainstorming or nominal group processes (including concept mapping), we would use some form of heterogeneity sampling because our primary interest is in getting broad spectrum of ideas, not identifying the "average" or "modal instance" ones. In effect, what we would like to be sampling is not people, but ideas. We imagine that there is a universe of all possible ideas relevant to some topic and that we want to sample this population, not the population of people who have the ideas. Clearly, in order to get all of the ideas, and especially the "outlier" or unusual ones, we have to include a broad and diverse range of participants. Heterogeneity sampling is, in this sense, almost the opposite of modal instance sampling.
  • Snowball Sampling
In snowball sampling, you begin by identifying someone who meets the criteria for inclusion in your study. You then ask them to recommend others who they may know who also meet the criteria. Although this method would hardly lead to representative samples, there are times when it may be the best method available. Snowball sampling is especially useful when you are trying to reach populations that are inaccessible or hard to find. For instance, if you are studying the homeless, you are not likely to be able to find good lists of homeless people within a specific geographical area. However, if you go to that area and identify one or two, you may find that they know very well who the other homeless people in their vicinity are and how you can find them.
Reference

http://www.socialresearchmethods.net/kb/sampnon.php

Randomized Block Designs

The Randomized Block Design is research design's equivalent to stratified random sampling. Like stratified sampling, randomized block designs are constructed to reduce noise or variance in the data (see Classifying the Experimental Designs). How do they do it? They require that the researcher divide the sample into relatively homogeneous subgroups or blocks (analogous to "strata" in stratified sampling). Then, the experimental design you want to implement is implemented within each block or homogeneous subgroup. The key idea is that the variability within each block is less than the variability of the entire sample. Thus each estimate of the treatment effect within a block is more efficient than estimates across the entire sample. And, when we pool these more efficient estimates across blocks, we should get an overall more efficient estimate than we would without blocking.
Here, we can see a simple example. Let's assume that we originally intended to conduct a simple posttest-only randomized experimental design. But, we recognize that our sample has several intact or homogeneous subgroups. For instance, in a study of college students, we might expect that students are relatively homogeneous with respect to class or year. So, we decide to block the sample into four groups: freshman, sophomore, junior, and senior. If our hunch is correct, that the variability within class is less than the variability for the entire sample, we will probably get more powerful estimates of the treatment effect within each block (see the discussion on Statistical Power). Within each of our four blocks, we would implement the simple post-only randomized experiment.
Notice a couple of things about this strategy. First, to an external observer, it may not be apparent that you are blocking. You would be implementing the same design in each block. And, there is no reason that the people in different blocks need to be segregated or separated from each other. In other words, blocking doesn't necessarily affect anything that you do with the research participants. Instead, blocking is a strategy for grouping people in your data analysis in order to reduce noise -- it is an analysis strategy. Second, you will only benefit from a blocking design if you are correct in your hunch that the blocks are more homogeneous than the entire sample is. If you are wrong -- if different college-level classes aren't relatively homogeneous with respect to your measures -- you will actually be hurt by blocking (you'll get a less powerful estimate of the treatment effect). How do you know if blocking is a good idea? You need to consider carefully whether the groups are relatively homogeneous. If you are measuring political attitudes, for instance, is it reasonable to believe that freshmen are more like each other than they are like sophomores or juniors? Would they be more homogeneous with respect to measures related to drug abuse? Ultimately the decision to block involves judgment on the part of the researcher.

How Blocking Reduces Noise

So how does blocking work to reduce noise in the data? To see how it works, you have to begin by thinking about the non-blocked study. The figure shows the pretest-posttest distribution for a hypothetical pre-post randomized experimental design. We use the 'X' symbol to indicate a program group case and the 'O' symbol for a comparison group member. You can see that for any specific pretest value, the program group tends to outscore the comparison group by about 10 points on the posttest. That is, there is about a 10-point posttest mean difference.
Now, let's consider an example where we divide the sample into three relatively homogeneous blocks. To see what happens graphically, we'll use the pretest measure to block. This will assure that the groups are very homogeneous. Let's look at what is happening within the third block. Notice that the mean difference is still the same as it was for the entire sample -- about 10 points within each block. But also notice that the variability of the posttest is much less than it was for the entire sample. Remember that the treatment effect estimate is a signal-to-noise ratio. The signal in this case is the mean difference. The noise is the variability. The two figures show that we haven't changed the signal in moving to blocking -- there is still about a 10-point posttest difference. But, we have changed the noise -- the variability on the posttest is much smaller within each block that it is for the entire sample. So, the treatment effect will have less noise for the same signal.
It should be clear from the graphs that the blocking design in this case will yield the stronger treatment effect. But this is true only because we did a good job assuring that the blocks were homogeneous. If the blocks weren't homogeneous -- their variability was as large as the entire sample's -- we would actually get worse estimates than in the simple randomized experimental case.
Reference
http://www.socialresearchmethods.net/kb/expblock.php

Factorial Design

Probably the easiest way to begin understanding factorial designs is by looking at an example. Let's imagine a design where we have an educational program where we would like to look at a variety of program variations to see which works best. For instance, we would like to vary the amount of time the children receive instruction with one group getting 1 hour of instruction per week and another getting 4 hours per week. And, we'd like to vary the setting with one group getting the instruction in-class (probably pulled off into a corner of the classroom) and the other group being pulled-out of the classroom for instruction in another room. We could think about having four separate groups to do this, but when we are varying the amount of time in instruction, what setting would we use: in-class or pull-out? And, when we were studying setting, what amount of instruction time would we use: 1 hour, 4 hours, or something else?
With factorial designs, we don't have to compromise when answering these questions. We can have it both ways if we cross each of our two time in instruction conditions with each of our two settings. Let's begin by doing some defining of terms. In factorial designs, a factor is a major independent variable. In this example we have two factors: time in instruction and setting. A level is a subdivision of a factor. In this example, time in instruction has two levels and setting has two levels. Sometimes we depict a factorial design with a numbering notation. In this example, we can say that we have a 2 x 2 (spoken "two-by-two) factorial design. In this notation, the number of numbers tells you how many factors there are and the number values tell you how many levels. If I said I had a 3 x 4 factorial design, you would know that I had 2 factors and that one factor had 3 levels while the other had 4. Order of the numbers makes no difference and we could just as easily term this a 4 x 3 factorial design. The number of different treatment groups that we have in any factorial design can easily be determined by multiplying through the number notation. For instance, in our example we have 2 x 2 = 4 groups. In our notational example, we would need 3 x 4 = 12 groups.
We can also depict a factorial design in design notation. Because of the treatment level combinations, it is useful to use subscripts on the treatment (X) symbol. We can see in the figure that there are four groups, one for each combination of levels of factors. It is also immediately apparent that the groups were randomly assigned and that this is a posttest-only design.
Now, let's look at a variety of different results we might get from this simple 2 x 2 factorial design. Each of the following figures describes a different possible outcome. And each outcome is shown in table form (the 2 x 2 table with the row and column averages) and in graphic form (with each factor taking a turn on the horizontal axis). You should convince yourself that the information in the tables agrees with the information in both of the graphs. You should also convince yourself that the pair of graphs in each figure show the exact same information graphed in two different ways. The lines that are shown in the graphs are technically not necessary -- they are used as a visual aid to enable you to easily track where the averages for a single level go across levels of another factor. Keep in mind that the values shown in the tables and graphs are group averages on the outcome variable of interest. In this example, the outcome might be a test of achievement in the subject being taught. We will assume that scores on this test range from 1 to 10 with higher values indicating greater achievement. You should study carefully the outcomes in each figure in order to understand the differences between these cases.

The Null Outcome

Let's begin by looking at the "null" case. The null case is a situation where the treatments have no effect. This figure assumes that even if we didn't give the training we could expect that students would score a 5 on average on the outcome test. You can see in this hypothetical case that all four groups score an average of 5 and therefore the row and column averages must be 5. You can't see the lines for both levels in the graphs because one line falls right on top of the other.

The Main Effects

main effect is an outcome that is a consistent difference between levels of a factor. For instance, we would say there’s a main effect for setting if we find a statistical difference between the averages for the in-class and pull-out groups, at all levels of time in instruction. The first figure depicts a main effect of time. For all settings, the 4 hour/week condition worked better than the 1 hour/week one. It is also possible to have a main effect for setting (and none for time).
 In the second main effect graph we see that in-class training was better than pull-out training for all amounts of time.
 Finally, it is possible to have a main effect on both variables simultaneously as depicted in the third main effect figure. In this instance 4 hours/week always works better than 1 hour/week and in-class setting always works better than pull-out.

Interaction Effects

If we could only look at main effects, factorial designs would be useful. But, because of the way we combine levels in factorial designs, they also enable us to examine the interaction effects that exist between factors. An interaction effect exists when differences on one factor depend on the level you are on another factor. It's important to recognize that an interaction is between factors, not levels. We wouldn't say there's an interaction between 4 hours/week and in-class treatment. Instead, we would say that there's an interaction between time and setting, and then we would go on to describe the specific levels involved.
How do you know if there is an interaction in a factorial design? There are three ways you can determine there's an interaction. First, when you run the statistical analysis, the statistical table will report on all main effects and interactions. Second, you know there's an interaction when can't talk about effect on one factor without mentioning the other factor. if you can say at the end of our study that time in instruction makes a difference, then you know that you have a main effect and not an interaction (because you did not have to mention the setting factor when describing the results for time). On the other hand, when you have an interaction it is impossible to describe your results accurately without mentioning both factors. Finally, you can always spot an interaction in the graphs of group means -- whenever there are lines that are not parallel there is an interaction present! If you check out the main effect graphs above, you will notice that all of the lines within a graph are parallel. In contrast, for all of the interaction graphs, you will see that the lines are not parallel.
In the first interaction effect graph, we see that one combination of levels -- 4 hours/week and in-class setting -- does better than the other three. In the second interaction we have a more complex "cross-over" interaction. Here, at 1 hour/week the pull-out group does better than the in-class group while at 4 hours/week the reverse is true. Furthermore, the both of these combinations of levels do equally well.

Summary

Factorial design has several important features. First, it has great flexibility for exploring or enhancing the “signal” (treatment) in our studies. Whenever we are interested in examining treatment variations, factorial designs should be strong candidates as the designs of choice. Second, factorial designs are efficient. Instead of conducting a series of independent studies we are effectively able to combine these studies into one. Finally, factorial designs are the only effective way to examine interaction effects.
Reference
 http://www.socialresearchmethods.net/kb/expfact.php

Tuesday, 15 September 2015

Experimentation

An experiment deliberately imposes a treatment on a group of objects or subjects in the interest of observing the response. This differs from an observational study, which involves collecting and analyzing data without changing existing conditions. Because the validity of a experiment is directly affected by its construction and execution, attention to experimental design is extremely important.


Treatment

In experiments, a treatment is something that researchers administer to experimental units. For example, a corn field is divided into four, each part is 'treated' with a different fertiliser to see which produces the most corn; a teacher practices different teaching methods on different groups in her class to see which yields the best results; a doctor treats a patient with a skin condition with different creams to see which is most effective. Treatments are administered to experimental units by 'level', where level implies amount or magnitude. For example, if the experimental units were given 5mg, 10mg, 15mg of a medication, those amounts would be three levels of the treatment.
(Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)


Factor

factor of an experiment is a controlled independent variable; a variable whose levels are set by the experimenter.
A factor is a general type or category of treatments. Different treatments constitute different levels of a factor. For example, three different groups of runners are subjected to different training methods. The runners are the experimental units, the training methods, the treatments, where the three types of training methods constitute three levels of the factor 'type of training'.
(Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)


Experimental Design

We are concerned with the analysis of data generated from an experiment. It is wise to take time and effort to organize the experiment properly to ensure that the right type of data, and enough of it, is available to answer the questions of interest as clearly and efficiently as possible. This process is called experimental design.The specific questions that the experiment is intended to answer must be clearly identified before carrying out the experiment. We should also attempt to identify known or expected sources of variability in the experimental units since one of the main aims of a designed experiment is to reduce the effect of these sources of variability on the answers to questions of interest. That is, we design the experiment in order to improve the precision of our answers.
(Definition taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)


Control

Suppose a farmer wishes to evaluate a new fertilizer. She uses the new fertilizer on one field of crops (A), while using her current fertilizer on another field of crops (B). The irrigation system on field A has recently been repaired and provides adequate water to all of the crops, while the system on field B will not be repaired until next season. She concludes that the new fertilizer is far superior.The problem with this experiment is that the farmer has neglected to control for the effect of the differences in irrigation. This leads to experimental bias, the favoring of certain outcomes over others. To avoid this bias, the farmer should have tested the new fertilizer in identical conditions to the control group, which did not receive the treatment. Without controlling for outside variables, the farmer cannot conclude that it was the effect of the fertilizer, and not the irrigation system, that produced a better yield of crops.
Another type of bias that is most apparent in medical experiments is the placebo effect. Since many patients are confident that a treatment will positively affect them, they react to a control treatment which actually has no physical affect at all, such as a sugar pill. For this reason, it is important to include control, or placebo, groups in medical experiments to evaluate the difference between the placebo effect and the actual effect of the treatment.
The simple existence of placebo groups is sometimes not sufficient for avoiding bias in experiments. If members of the placebo group have any knowledge (or suspicion) that they are not being given an actual treatment, then the effect of the treatment cannot be accurately assessed. For this reason, double-blind experiments are generally preferable. In this case, neither the experimenters nor the subjects are aware of the subjects' group status. This eliminates the possibility that the experimenters will treat the placebo group differently from the treatment group, further reducing experimental bias.

Randomization

Because it is generally extremely difficult for experimenters to eliminate bias using only their expert judgment, the use of randomization in experiments is common practice. In a randomized experimental design, objects or individuals are randomly assigned (by chance) to an experimental group. Using randomization is the most reliable method of creating homogeneous treatment groups, without involving any potential biases or judgments. There are several variations of randomized experimental designs, two of which are briefly discussed below.

Completely Randomized Design

In a completely randomized design, objects or subjects are assigned to groups completely at random. One standard method for assigning subjects to treatment groups is to label each subject, then use a table of random numbers to select from the labelled subjects. This may also be accomplished using a computer. In MINITAB, the "SAMPLE" command will select a random sample of a specified size from a list of objects or numbers.

Randomized Block Design

If an experimenter is aware of specific differences among groups of subjects or objects within an experimental group, he or she may prefer a randomized block design to a completely randomized design. In a block design, experimental subjects are first divided into homogeneous blocks before they are randomly assigned to a treatment group. If, for instance, an experimenter had reason to believe that age might be a significant factor in the effect of a given medication, he might choose to first divide the experimental subjects into age groups, such as under 30 years old, 30-60 years old, and over 60 years old. Then, within each age level, individuals would be assigned to treatment groups using a completely randomized design. In a block design, both controland randomization are considered.Example
A researcher is carrying out a study of the effectiveness of four different skin creams for the treatment of a certain skin disease. He has eighty subjects and plans to divide them into 4 treatment groups of twenty subjects each. Using a randomized block design, the subjects are assessed and put in blocks of four according to how severe their skin condition is; the four most severe cases are the first block, the next four most severe cases are the second block, and so on to the twentieth block. The four members of each block are then randomly assigned, one to each of the four treatment groups.
(Example taken from Valerie J. Easton and John H. McColl's Statistics Glossary v1.1)


Replication

Although randomization helps to insure that treatment groups are as similar as possible, the results of a single experiment, applied to a small number of objects or subjects, should not be accepted without question. Randomly selecting two individuals from a group of four and applying a treatment with "great success" generally will not impress the public or convince anyone of the effectiveness of the treatment. To improve the significance of an experimental result, replication, the repetition of an experiment on a large group of subjects, is required. If a treatment is truly effective, the long-term averaging effect of replication will reflect its experimental worth. If it is not effective, then the few members of the experimental population who may have reacted to the treatment will be negated by the large numbers of subjects who were unaffected by it. Replication reduces variability in experimental results, increasing their significance and the confidence level with which a researcher can draw conclusions about an experimental.
Reference

 Repeated measurements on the same experimental unit may or may not constitute true replications; treating dependent observations as if they were independent is one of the most common statistical errors found in the scientific literature.

http://www.stat.yale.edu/Courses/1997-98/101/expdes.htm
http://www.lar.msstate.edu/pdf/Basics%20of%20Experimental%20Design.pdf