Running Experiments with Amazon Mechanical Turk

Follow and Like:
Follow by Email

Running Experiments with Amazon Mechanical TurkI’ll start by saying that I think Amazon Mechanical Turk (MTurk) and online markets offer no less than a revolution in experimental psychology. By now, I’ve already conducted over a hundred experiments on MTurk and have come to consider it as one of the most important tools available to me. Together with Qualtrics (see previous posts with tips – 1, 2, 3) MTurk is a very powerful tool for very quick and inexpensive data collection.   You don’t have to take my word for it, take it from those who know something. There are lots of high-profile articles popping up in various journals across all domains that have come to the same conclusion as I have – MTurk is an important tool. The following examples were chosen from psychology, management, economics, and even biology :


Social Psychology

From  Buhrmester,  Kwang, & Gosling (2011, Perspectives on Psychological Science) Amazon’s Mechanical Turk:  A New Source of Inexpensive, Yet High-Quality, Data?  :

Findings indicate that: (a) MTurk participants are slightly more representative of the U.S. population than are standard Internet samples and are significantly  more diverse than typical American college samples; (b) participation is affected by  compensation rate and task length but participants can still be recruited rapidly and  inexpensively; (c) realistic compensation rates do not affect data quality; and (d) the data  obtained are at least as reliable as those obtained via traditional methods.

From Paolacci and Chandler (2014, Current Directions in Psychological Science)  Inside the Turk Understanding Mechanical Turk as a Participant Pool :

Mechanical Turk (MTurk), an online labor market created by Amazon, has recently become popular among social scientists as a source of survey and experimental data. The workers who populate this market have been assessed on dimensions that are universally relevant to understanding whether, why, and when they should be recruited as research participants. We discuss the characteristics of MTurk as a participant pool for psychology and other social sciences, highlighting the traits of the MTurk samples, why people become MTurk workers and research participants, and how data quality on MTurk compares to that from other pools and depends on controllable and uncontrollable factors.


Clinical Psychology

From Shapiro, Chandler, & Mueller (2013, Clinical Psychological Science) : Using Mechanical Turk to Study Clinical Populations :

Although participants with psychiatric symptoms, specific risk factors, or rare demographic characteristics can be difficult to identify and recruit for participation in research, participants with these characteristics are crucial for research in the social,  behavioral, and clinical sciences. Online research in general and crowdsourcing software in particular may offer a solution. […]  Findings suggest that crowdsourcing software offers several advantages for clinical research while providing insight into potential problems, such as  misrepresentation, that researchers should address when collecting data online.



From Horton, Rand & Zeckhauser (2010, Experimental Economics) – The Online Laboratory: Conducting Experiments in a Real Labor Market :

We argue that online experiments can be just as valid— both internally and externally—as laboratory and field experiments, while requiring far less money and time to design and to conduct. In this paper, we first describe the benefits of conducting experiments in online labor markets; we then use one such market to replicate three classic experiments and confirm their results. We confirm that subjects (1) reverse decisions in response to how a decision-problem is framed, (2) have pro-social preferences (value payoffs to others positively), and (3) respond to priming by altering their choices.



From Paolacci, Chandler & Ipeirotis (2010, Judgment and Decision Making) – Running experiments on Amazon Mechanical Turk  :

Although Mechanical Turk has recently become popular among social scientists as a source of experimental data, doubts may linger about the quality of data provided by subjects recruited from online labor markets. We address these potential concerns by presenting new demographic data about the Mechanical Turk subject population, reviewing the strengths of Mechanical Turk relative to other online and offline methods of recruiting subjects, and comparing the magnitude of effects obtained using Mechanical Turk and traditional subject pools. We further discuss some additional benefits such as the possibility of longitudinal, cross cultural and prescreening designs, and offer some advice on how to best manage a common subject pool.



From Rand (2011, Journal of Theoretical Biology) – The promise of Mechanical Turk: How online labor markets can help theorists run behavioral experiments :

I review numerous replication studies indicating that AMT data is reliable. I also present two new experiments on the reliability of self-reported demographics. In the first, I use IP address logging to verify AMT subjects’ self-reported country of residence, and find that 97% of responses are accurate. In the second, I compare the consistency of a range of demographic variables reported by the same subjects across two different studies, and find between 81% and 98% agreement, depending on the variable. Finally, I discuss limitations of AMT and point out potential pitfalls.


[Update March 1st, 2016 : The APS Observer has a great summary article on MTurk : Under the Hood of Mechanical Turk ]


Other articles


Before we begin, I think this article is a MUST read for anyone thinking of using MTurk for academic research :  The Internet’s hidden science factory

From the article, I strongly recommend you watch this following video of a life of one MTurker :

Also see this PBS cover:



Lessons learned (some of these are rather old, I would strongly advise you in revisiting these):

  1. Most of the low paid participants used to be Indian. Their level of English proficiency varies, but you can test this to use as a control variable or disqualifier, or you can even set this as a requirement on MTurk before they complete the survey (especially for the longer higher paying surveys, not so much for the 3-5 minutes surveys). If you’d rather eliminate this sample altogether MTurk allows you to specify which countries you would like to include or not include in your task.
  2. Limit running the experiment to those who successfully completed atleast 100 HITs before and 95% acceptance.
  3. You need to verify that participants read and understand your survey, and that they don’t randomly click their answers. For that I do the following:
    1. After each scenario I run a quiz to test their understanding.
    2. Obviously, every part includes a check. A manipulation should always be tested, better with more than a single manipulation check.
    3. Add a timer for each page and include a check in your stat syntax to test whether they answered too fast.
    4. Include a funneling section and ask them what the survey was about and set a minimum characters answer. Go over the answers to see who puts in noise. Ofcourse, if you included a manipulation also test for suspicion and ask them what they thoughts the purpose was or whether they can see any connection between the manipulation and your tested DV.
  4. It goes without saying that you should test your survey before setting it off to the wild. But, very important point is to set email triggers and see that the answers you get are what they should be. It happened a few times that I discovered something wrong within the first ten participants, so I stopped the batch, corrected the mistake and restarted everything.

[UPDATE 2013/02/05 my answer to a discussion about this]

  1. One should be careful with money as an incentive for answering questionnaires on MTurk. [Update: while this is still true, I do believe one should try to be generous, especially given the cost of the alternatives. If you have the available research grant, then no need to be cheap]
  2. There’s a special concern with participants from India. Though I try not to stereotype and generalize, but some studies that haven’t worked well with an international sample have worked very well on the rerun with the rule : “Location NOT India”.
  3. The questionnaire should show participants you’re a serious researcher. Meaning :
    1. Comprehension questions to make sure they understood the scenario or what they need to do in a task.
    2. Quiz questions about scenarios that they have to get right to proceed.
    3. 2 or 3 manipulation checks may work better than a single one.
    4. Lots of decoy questions that go in opposite directions and randomized into scales (ones I use often – “the color of the grass is blue” “in the same week, Tuesday comes after Monday” “rich people have less money than poor people” etc.)
    5. Randomizing question sequence and options for each section.
    6. Adding a funneling section.
    7. Adding a timer to all questions to check how much time they spent on each page and when they clicked on things.
  4. Between subject manipulations are better than a simple survey since different participants see different conditions and hence reduce the chances of simply sharing answers.
  5. There’s no escape from going over the answers in detail, checking the answer timing, checking for duplicates and reading the funneling section. Consistently, about 20-35% of the MTurk answers fail this.

[end of UPDATE]


For problems with running MTurkers, read :


For the technical details on how to set things up read the following :


There’s also a very helpful blog I strongly recommend that you visit – Experimental Turk which titles itself as A blog on social science experiments on Amazon Mechanical Turk. It hasn’t been updated for a while, but some viable info in there.


Tools :


Tools to run with MTurk:


Further readings:


Alternatives to MTurk:

Got any other MTurk tips? have you had any experience running experiments on MTurk? Do share.


  1. To your comment above regarding pay where you said:

    “One should be careful with money as an incentive for answering
    questionnaires on MTurk. I’ve actually found that 5 cents a
    questionnaire may at times yield higher quality results than a 2 dollar
    reward since it reduces the chance that people merely participate for
    the money. People still participate for 2-5 cents, and that couldn’t be
    just for the money in it.”

    What is your measure for “higher quality data”? Are you speaking in terms of statistically significant differences, or just trends in the data or something else?

    I ask this question because i noticed your typical pay was a rate of .01 cent pet minute (5 cents for a 5 minute task). Your comment of $2 reward for what I am assuming is the same length task would work out to an hourly pay rate of $14 (pretty high in pay compared to what a majority of requesters are paying). Rather, I am wondering why you can’t pay a reasonable wage–something closer to minimum wage (~$6-8 per hour). That would mean your 5 minute survey would now pay about 50 cents.

    While data quality may not change, for me it comes down to what is fair and reasonable rather than how little money I can spend while still getting good data. Sadly, the message most researchers hear is how cheap and easy it is to get good data from Mturkers. I’ll admit it was the first thing I was told when I heard about mturk from a colleague (I am currently a 2nd year graduate student). However, after doing my research and talking to workers in forums I learned first hand from them what they considered fair and reasonable hourly pay rate (many said $6-8 was reasonable but might need to be more depending on the difficulty of the task). As a result, I keep my all my HITS in between the $6-8 per hour mark. Sadly, most researchers don’t seem to care or want to take the time to listen.

    I look forward to hearing your comments!

    1. Thanks for the comments. These are important questions and I understand your concerns, but you’re raising a few very different issues.

      As for ‘high quality data’. As I point out above, I include attention checks and decoy questions throughout my studies, as well as quiz exams to make sure participants understood the scenario/task at hand. Higher quality data means less errors, less failing attention checks, and overall better responding to open-questions and tasks. You’ll notice another post in this blog about honesty, and that’s another factor that often comes into play.

      In this post I report the bottom line. I ran every possible rate and I’m still adjusting pay amount based on the task difficulty and the target Mturkers.

      I fail to understand your argument about fair and reasonable, since no one forces MTurkers to participate (unlike students, for example) and an Mturker can decide whether to take on a task or not. It’s a free labor market. Requesters listen to market forces, and since people do participate, why would you ask them to pay more?

      Also, talking about Mturkers as a single group is misleading and inaccurate. While Mturkers from developed economies would see 6-8US$/hour as fair (maybe, not even sure about that), those in other parts of the world would probably want far less than that. So, if you’re not particular about the population or location you’re aiming for, then there’s no need to overpay as a requester. And… as I mentioned in the post, there are consequences to setting the pay higher than it needs to be as it attracts the sort of workers that we as researchers try to stay away from.

Leave a Reply

Your email address will not be published. Required fields are marked *