I’ll start by saying that I think Amazon Mechanical Turk (MTurk) and online markets offer no less than a revolution in experimental psychology. By now, I’ve already conducted over a hundred experiments on MTurk and have come to consider it as one of the most important tools available to me. Together with Qualtrics (see previous posts with tips – 1, 2, 3) MTurk is a very powerful tool for very quick and inexpensive data collection. You don’t have to take my word for it, take it from those who know something. There are lots of high-profile articles popping up in various journals across all domains that have come to the same conclusion as I have – MTurk is an important tool. The following examples were chosen from psychology, management, economics, and even biology :
Findings indicate that: (a) MTurk participants are slightly more representative of the U.S. population than are standard Internet samples and are significantly more diverse than typical American college samples; (b) participation is affected by compensation rate and task length but participants can still be recruited rapidly and inexpensively; (c) realistic compensation rates do not affect data quality; and (d) the data obtained are at least as reliable as those obtained via traditional methods.
Mechanical Turk (MTurk), an online labor market created by Amazon, has recently become popular among social scientists as a source of survey and experimental data. The workers who populate this market have been assessed on dimensions that are universally relevant to understanding whether, why, and when they should be recruited as research participants. We discuss the characteristics of MTurk as a participant pool for psychology and other social sciences, highlighting the traits of the MTurk samples, why people become MTurk workers and research participants, and how data quality on MTurk compares to that from other pools and depends on controllable and uncontrollable factors.
Although participants with psychiatric symptoms, specific risk factors, or rare demographic characteristics can be difficult to identify and recruit for participation in research, participants with these characteristics are crucial for research in the social, behavioral, and clinical sciences. Online research in general and crowdsourcing software in particular may offer a solution. […] Findings suggest that crowdsourcing software offers several advantages for clinical research while providing insight into potential problems, such as misrepresentation, that researchers should address when collecting data online.
We argue that online experiments can be just as valid— both internally and externally—as laboratory and ﬁeld experiments, while requiring far less money and time to design and to conduct. In this paper, we ﬁrst describe the beneﬁts of conducting experiments in online labor markets; we then use one such market to replicate three classic experiments and conﬁrm their results. We conﬁrm that subjects (1) reverse decisions in response to how a decision-problem is framed, (2) have pro-social preferences (value payoﬀs to others positively), and (3) respond to priming by altering their choices.
Although Mechanical Turk has recently become popular among social scientists as a source of experimental data, doubts may linger about the quality of data provided by subjects recruited from online labor markets. We address these potential concerns by presenting new demographic data about the Mechanical Turk subject population, reviewing the strengths of Mechanical Turk relative to other online and ofﬂine methods of recruiting subjects, and comparing the magnitude of effects obtained using Mechanical Turk and traditional subject pools. We further discuss some additional beneﬁts such as the possibility of longitudinal, cross cultural and prescreening designs, and offer some advice on how to best manage a common subject pool.
I review numerous replication studies indicating that AMT data is reliable. I also present two new experiments on the reliability of self-reported demographics. In the ﬁrst, I use IP address logging to verify AMT subjects’ self-reported country of residence, and ﬁnd that 97% of responses are accurate. In the second, I compare the consistency of a range of demographic variables reported by the same subjects across two different studies, and ﬁnd between 81% and 98% agreement, depending on the variable. Finally, I discuss limitations of AMT and point out potential pitfalls.
[Update March 1st, 2016 : The APS Observer has a great summary article on MTurk : Under the Hood of Mechanical Turk ]
- Separate but equal? A comparison of participants and data gathered via Amazon’s MTurk, social media, and face-to-face behavioral testing (Computers in Human Behavior, Nov2013).
- The relationship between motivation, monetary compensation, and data quality among US- and India-based workers on Mechanical Turk (Litman, Robinson, & Rosenzweig, 2014, BRM)
- Attentive Turkers: MTurk participants perform better on online attention checks than subject pool participants (Hauser & Schwarz, 2015, BRM) | Summary
- Comparing the Similarity of Responses Received from Studies in Amazon’s Mechanical Turk to Studies Conducted Online and with Direct Recruitment (Bartneck, Duenser, Moltchanova, & Zawieska, 2015, PLOSOne)
- Notes from a Day on the Forums: Recommendations for Maintaining a Good Reputation as an Amazon Mechanical Turk Requester (Yale David Rand’s lab, draft recommendations)
- Graduating from Undergrads: Are Mechanical Turk Workers More Attentive than Undergraduate Participants? (OSF)
- The Average Laboratory Samples a Population of 7,300 AmazonMechanical Turk Workers (JDM, 2015) (Summary post on Experimental Turk)
- MTurk ‘Unscrubbed’: Exploring the Good, the ‘Super’, and the Unreliable on Amazon’s Mechanical Turk
- Are samples drawn from Mechanical Turk valid for research on political ideology? (Research and Politics)
- The Generalizability of Survey Experiments (Journal of Experimental Political Science, 2015)
- Conducting Clinical Research Using Crowdsourced Convenience Samples (Annual Review of Clinical Psychology, 2016)
- Psychological research in the internet age: The quality of web-based data (Computers in Human Behavior, 2016) | reviewed on BPS
- Tosti-Kharas, J., & Conley, C. (2016). Coding Psychological Constructs in Text Using Mechanical Turk: A Reliable, Accurate, and Efficient Alternative. Frontiers in Psychology, 7, 741.
Before we begin, I think this article is a MUST read for anyone thinking of using MTurk for academic research : The Internet’s hidden science factory
From the article, I strongly recommend you watch this following video of a life of one MTurker :
Also see this PBS cover:
Lessons learned (some of these are rather old, I would strongly advise you in revisiting these):
- Most of the low paid participants used to be Indian. Their level of English proficiency varies, but you can test this to use as a control variable or disqualifier, or you can even set this as a requirement on MTurk before they complete the survey (especially for the longer higher paying surveys, not so much for the 3-5 minutes surveys). If you’d rather eliminate this sample altogether MTurk allows you to specify which countries you would like to include or not include in your task.
- Limit running the experiment to those who successfully completed atleast 100 HITs before and 95% acceptance.
- You need to verify that participants read and understand your survey, and that they don’t randomly click their answers. For that I do the following:
- After each scenario I run a quiz to test their understanding.
- Obviously, every part includes a check. A manipulation should always be tested, better with more than a single manipulation check.
- Add a timer for each page and include a check in your stat syntax to test whether they answered too fast.
- Include a funneling section and ask them what the survey was about and set a minimum characters answer. Go over the answers to see who puts in noise. Ofcourse, if you included a manipulation also test for suspicion and ask them what they thoughts the purpose was or whether they can see any connection between the manipulation and your tested DV.
- It goes without saying that you should test your survey before setting it off to the wild. But, very important point is to set email triggers and see that the answers you get are what they should be. It happened a few times that I discovered something wrong within the first ten participants, so I stopped the batch, corrected the mistake and restarted everything.
[UPDATE 2013/02/05 my answer to a discussion about this]
- One should be careful with money as an incentive for answering questionnaires on MTurk. [Update: while this is still true, I do believe one should try to be generous, especially given the cost of the alternatives. If you have the available research grant, then no need to be cheap]
- There’s a special concern with participants from India. Though I try not to stereotype and generalize, but some studies that haven’t worked well with an international sample have worked very well on the rerun with the rule : “Location NOT India”.
- The questionnaire should show participants you’re a serious researcher. Meaning :
- Comprehension questions to make sure they understood the scenario or what they need to do in a task.
- Quiz questions about scenarios that they have to get right to proceed.
- 2 or 3 manipulation checks may work better than a single one.
- Lots of decoy questions that go in opposite directions and randomized into scales (ones I use often – “the color of the grass is blue” “in the same week, Tuesday comes after Monday” “rich people have less money than poor people” etc.)
- Randomizing question sequence and options for each section.
- Adding a funneling section.
- Adding a timer to all questions to check how much time they spent on each page and when they clicked on things.
- Between subject manipulations are better than a simple survey since different participants see different conditions and hence reduce the chances of simply sharing answers.
- There’s no escape from going over the answers in detail, checking the answer timing, checking for duplicates and reading the funneling section. Consistently, about 20-35% of the MTurk answers fail this.
[end of UPDATE]
For problems with running MTurkers, read :
- Let’s keep discussing M Turk sample validity
- What’s a “valid” sample? Problems with Mechanical Turk study samples, part 1
- Fooled twice, shame on who? Problems with Mechanical Turk study samples, part 2
- My Experience as an Amazon Mechanical Turk (MTurk) Worker (Utpal Dholakia)
For the technical details on how to set things up read the following :
- Experiments using Mechanical Turk. Part 1
- Experiments using Mechanical Turk. Part 2
- THE TECHNICAL DETAILS, TUTORIALS, WALK-THROUGHS
- How to connect Qualtrics and mturk, Part II
- The right way to prevent duplicate workers – How to Block Past Workers from Doing Surveys
- MTurk + Qualtrics
There’s also a very helpful blog I strongly recommend that you visit – Experimental Turk which titles itself as A blog on social science experiments on Amazon Mechanical Turk. It hasn’t been updated for a while, but some viable info in there.
- If you’re using MTurk for academic data collection, you absolutely must use Turkprime (read my review)
- Preventing MTurkers who participated in one study from participating in certain other studies – Turk Check.
- PsyTurk (see presentation here)
- How to setup notifications for HITs
- Qualtrics surveys, ofcourse.
Multiple player games:
- Software Platform for Human Interaction Experiments (SoPHIE) (e.g. gossip games)
- “Breadboard is a software platform for developing and conducting human interaction experiments on networks. It allows researchers to rapidly design experiments using a flexible domain-specific language and provides researchers with immediate access to a diverse pool of online participants.”
- oTree offers integration with Amazon Mechanical Turk
- Identifying Careless Responses in Survey Data (Meade & Craig, 2012, Psychological Methods) – an excellent article on careless responses with online and student samples. A worthy read. Another article is Detecting and Deterring Insufficient Effort Responding to Surveys (Huang et al., 2012, JBS)
- Deneme – a blog of experiments on Amazon Mechanical Turk (who created – Iterative Tasks on Mechanical Turk)
- Is Mechanical Turk the future of cognitive science research?
- Looking for Subjects? Amazon’s Mechanical Turk
- The Pros & Cons of Amazon Mechanical Turk for Scientific Surveys
- Experimenting on Mechanical Turk: 5 How Tos
- Slides from ACR 2012 (good tips)
- Evaluating Amazon’s Mechanical Turk as a Tool for Experimental Behavioral Research (published at PLOSone, with a related blog post)
- Mechanical Turk and Experiments in the Social Sciences
- How naïve are MTurk workers? and the followup response – mTurk: Method, Not Panacea and the followup post – Consequences of Worker Nonnaïvete: The Cognitive Reflection Test
- ITWorld – Experimenting on Mechanical Turk: 5 How Tos
- High quality MTurk data
- Graduating from undergrads: Are MTurk workers less attentive than undergraduate students? (Poster from Manylabs)
- Recent studies on MTurk validity (Mturk for academics, 2016)
Alternatives to MTurk:
- Prolific Academic
- Call for participants
- Find participants
- Reddit (see academic paper about this option)
Got any other MTurk tips? have you had any experience running experiments on MTurk? Do share.