Running Experiments with Amazon Mechanical Turk

Running Experiments with Amazon Mechanical TurkI’ll start by saying that I think Amazon Mechanical Turk (MTurk) and online markets offer no less than a revolution in experimental psychology. By now, I’ve already ran over a hundred successful experiments on MTurk and have come to consider it as one of the most important tools available to me as an experimental social psychologist. Together with Qualtrics (see previous posts with tips – 1, 2, 3) MTurk is a very powerful tool for very quick and inexpensive data collection.   You don’t have to take my word for it, take it from those who know something. There are lots of high-profile articles popping up in various journals across all domains that have come to the same conclusion as I have – MTurk is an important tool. The following examples were chosen from psychology, management, economics, and even biology :

 

Social Psychology

From  Buhrmester,  Kwang, & Gosling (2011, Perspectives on Psychological Science) Amazon’s Mechanical Turk:  A New Source of Inexpensive, Yet High-Quality, Data?  :

Findings indicate that: (a) MTurk participants are slightly more representative of the U.S. population than are standard Internet samples and are significantly  more diverse than typical American college samples; (b) participation is affected by  compensation rate and task length but participants can still be recruited rapidly and  inexpensively; (c) realistic compensation rates do not affect data quality; and (d) the data  obtained are at least as reliable as those obtained via traditional methods.

From Paolacci and Chandler (2014, Current Directions in Psychological Science)  Inside the Turk Understanding Mechanical Turk as a Participant Pool :

Mechanical Turk (MTurk), an online labor market created by Amazon, has recently become popular among social scientists as a source of survey and experimental data. The workers who populate this market have been assessed on dimensions that are universally relevant to understanding whether, why, and when they should be recruited as research participants. We discuss the characteristics of MTurk as a participant pool for psychology and other social sciences, highlighting the traits of the MTurk samples, why people become MTurk workers and research participants, and how data quality on MTurk compares to that from other pools and depends on controllable and uncontrollable factors.

 

Clinical Psychology

From Shapiro, Chandler, & Mueller (2013, Clinical Psychological Science) : Using Mechanical Turk to Study Clinical Populations :

Although participants with psychiatric symptoms, specific risk factors, or rare demographic characteristics can be difficult to identify and recruit for participation in research, participants with these characteristics are crucial for research in the social,  behavioral, and clinical sciences. Online research in general and crowdsourcing software in particular may offer a solution. […]  Findings suggest that crowdsourcing software offers several advantages for clinical research while providing insight into potential problems, such as  misrepresentation, that researchers should address when collecting data online.

 

Economics

From Horton, Rand & Zeckhauser (2010, Experimental Economics) – The Online Laboratory: Conducting Experiments in a Real Labor Market :

We argue that online experiments can be just as valid— both internally and externally—as laboratory and field experiments, while requiring far less money and time to design and to conduct. In this paper, we first describe the benefits of conducting experiments in online labor markets; we then use one such market to replicate three classic experiments and confirm their results. We confirm that subjects (1) reverse decisions in response to how a decision-problem is framed, (2) have pro-social preferences (value payoffs to others positively), and (3) respond to priming by altering their choices.

 

Management/Cognition

From Paolacci, Chandler & Ipeirotis (2010, Judgment and Decision Making) – Running experiments on Amazon Mechanical Turk  :

Although Mechanical Turk has recently become popular among social scientists as a source of experimental data, doubts may linger about the quality of data provided by subjects recruited from online labor markets. We address these potential concerns by presenting new demographic data about the Mechanical Turk subject population, reviewing the strengths of Mechanical Turk relative to other online and offline methods of recruiting subjects, and comparing the magnitude of effects obtained using Mechanical Turk and traditional subject pools. We further discuss some additional benefits such as the possibility of longitudinal, cross cultural and prescreening designs, and offer some advice on how to best manage a common subject pool.

 

Biology

From Rand (2011, Journal of Theoretical Biology) – The promise of Mechanical Turk: How online labor markets can help theorists run behavioral experiments :

I review numerous replication studies indicating that AMT data is reliable. I also present two new experiments on the reliability of self-reported demographics. In the first, I use IP address logging to verify AMT subjects’ self-reported country of residence, and find that 97% of responses are accurate. In the second, I compare the consistency of a range of demographic variables reported by the same subjects across two different studies, and find between 81% and 98% agreement, depending on the variable. Finally, I discuss limitations of AMT and point out potential pitfalls.

 

More articles

 

Following I’d like to share a few of my tips that I’ve adopted across the many trials I’ve ran on MTurk. Lessons learned :

  1. [Update: I used to pay 0.01US$ for a minute for Indian workers, I now pay around 0.05-0.1US$ per minute for American MTurkers. It’s both due to changes in MTurk and changes in my available finances post-graduation].
    [Old:] Most of my experiments are 3-5 minutes, and so the payment is normally 5 cents. I usually set the number of participants to 200 and so an experiment normally costs me 10US$participants+1US$Amazoncommision = 11US$. I can’t begin to stress how important this is for a poor grad student. This is sometimes what people pay a single participant per session in HK/US/Israel.
  2. Most of the low paid participants are Indian. Their level of English proficiency varies, but you can test this to use as a control variable or disqualifier, or you can even set this as a requirement on MTurk before they complete the survey (especially for the longer higher paying surveys, not so much for the 3-5 minutes surveys). If you’d rather eliminate this sample altogether MTurk allows you to specify which countries you would like to include or not include in your task.
  3. Limit running the experiment to those who successfully completed atleast 50 HITs before and 95% acceptance.
  4. You need to verify that participants read and understand your survey, and that they don’t randomly click their answers. For that I do the following:
    1. After each scenario I run a quiz to test their understanding.
    2. Obviously, every part includes a check. A manipulation should always be tested, better with more than a single manipulation check.
    3. Add a timer for each page and include a check in your stat syntax to test whether they answered too fast.
    4. Include a funneling section and ask them what the survey was about and set a minimum characters answer. Go over the answers to see who puts in noise. Ofcourse, if you included a manipulation also test for suspicion and ask them what they thoughts the purpose was or whether they can see any connection between the manipulation and your tested DV.
  5. It goes without saying that you should test your survey before setting it off to the wild. But, very important point is to set email triggers and see that the answers you get are what they should be. It happened a few times that I discovered something wrong within the first ten participants, so I stopped the batch, corrected the mistake and restarted everything.

[UPDATE 2013/02/05 my answer to a discussion about this]

  1. One should be careful with money as an incentive for answering questionnaires on MTurk. I’ve actually found that 5 cents a questionnaire may at times yield higher quality results than a 2 dollar reward since it reduces the chance that people merely participate for the money. People still participate for 2-5 cents, and that couldn’t be just for the money in it.
    [Update: while this is still true, I do believe one should try to be generous, especially given the cost of the alternatives. If you have the available research grant, then no need to be cheap]
  2. There’s a special concern with participants from India. Though I try not to stereotype and generalize, but some studies that haven’t worked well with an international sample have worked very well on the rerun with the rule : “Location NOT India”.
  3. The questionnaire should show participants you’re a serious researcher. Meaning :
    1. Comprehension questions to make sure they understood the scenario or what they need to do in a task.
    2. Quiz questions about scenarios that they have to get right to proceed.
    3. 2 or 3 manipulation checks may work better than a single one.
    4. Lots of decoy questions that go in opposite directions and randomized into scales (ones I use often – “the color of the grass is blue” “in the same week, Tuesday comes after Monday” “rich people have less money than poor people” etc.)
    5. Randomizing question sequence and options for each section.
    6. Adding a funneling section.
    7. Language proficiency checks.
    8. Adding a timer to all questions to check how much time they spent on each page and when they clicked on things.
  4. Between subject manipulations are better than a simple survey since different participants see different conditions and hence reduce the chances of simply sharing answers.
  5. There’s no escape from going over the answers in detail, checking the answer timing, checking for duplicates and reading the funneling section. Consistently, about 20-35% of the MTurk answers fail this.

[end of UPDATE]

[UPDATE 2013-07-15]

For problems with running MTurkers, read :

[end of UPDATE]

 

For the technical details on how to set things up read the following :

 

There’s also a very helpful blog I strongly recommend that you visit – Experimental Turk which titles itself as A blog on social science experiments on Amazon Mechanical Turk. It hasn’t been updated for a while, but some viable info in there.   This is a presentation at HKUST (2012) :

Tools :

 

Further readings:

Got any other MTurk tips? have you had any experience running experiments on MTurk? Do share.