I’ll start by saying that I think Amazon Mechanical Turk (MTurk) and online markets offer no less than a revolution in experimental psychology. By now, I’ve already ran over fifty successful experiments on MTurk and have come to consider it as one of the most important tools available to me as an experimental social psychologist. Together with Qualtrics (see previous posts with tips – 1, 2, 3) MTurk is a very powerful tool for very quick and inexpensive data collection. You don’t have to take my word for it, take it from those who know something. There are lots of high-profile articles popping up in various journals across all domains that have come to the same conclusion as I have – MTurk is an important tool. The following examples were chosen from psychology, management, economics, and even biology :
Findings indicate that: (a) MTurk participants are slightly more representative of the U.S. population than are standard Internet samples and are significantly more diverse than typical American college samples; (b) participation is affected by compensation rate and task length but participants can still be recruited rapidly and inexpensively; (c) realistic compensation rates do not affect data quality; and (d) the data obtained are at least as reliable as those obtained via traditional methods.
Mechanical Turk (MTurk), an online labor market created by Amazon, has recently become popular among social scientists as a source of survey and experimental data. The workers who populate this market have been assessed on dimensions that are universally relevant to understanding whether, why, and when they should be recruited as research participants. We discuss the characteristics of MTurk as a participant pool for psychology and other social sciences, highlighting the traits of the MTurk samples, why people become MTurk workers and research participants, and how data quality on MTurk compares to that from other pools and depends on controllable and uncontrollable factors.
Although participants with psychiatric symptoms, specific risk factors, or rare demographic characteristics can be difficult to identify and recruit for participation in research, participants with these characteristics are crucial for research in the social, behavioral, and clinical sciences. Online research in general and crowdsourcing software in particular may offer a solution. [...] Findings suggest that crowdsourcing software offers several advantages for clinical research while providing insight into potential problems, such as misrepresentation, that researchers should address when collecting data online.
We argue that online experiments can be just as valid— both internally and externally—as laboratory and ﬁeld experiments, while requiring far less money and time to design and to conduct. In this paper, we ﬁrst describe the beneﬁts of conducting experiments in online labor markets; we then use one such market to replicate three classic experiments and conﬁrm their results. We conﬁrm that subjects (1) reverse decisions in response to how a decision-problem is framed, (2) have pro-social preferences (value payoﬀs to others positively), and (3) respond to priming by altering their choices.
Although Mechanical Turk has recently become popular among social scientists as a source of experimental data, doubts may linger about the quality of data provided by subjects recruited from online labor markets. We address these potential concerns by presenting new demographic data about the Mechanical Turk subject population, reviewing the strengths of Mechanical Turk relative to other online and ofﬂine methods of recruiting subjects, and comparing the magnitude of effects obtained using Mechanical Turk and traditional subject pools. We further discuss some additional beneﬁts such as the possibility of longitudinal, cross cultural and prescreening designs, and offer some advice on how to best manage a common subject pool.
I review numerous replication studies indicating that AMT data is reliable. I also present two new experiments on the reliability of self-reported demographics. In the ﬁrst, I use IP address logging to verify AMT subjects’ self-reported country of residence, and ﬁnd that 97% of responses are accurate. In the second, I compare the consistency of a range of demographic variables reported by the same subjects across two different studies, and ﬁnd between 81% and 98% agreement, depending on the variable. Finally, I discuss limitations of AMT and point out potential pitfalls.
- Separate but equal? A comparison of participants and data gathered via Amazon’s MTurk, social media, and face-to-face behavioral testing (Computers in Human Behavior, Nov2013).
- The relationship between motivation, monetary compensation, and data quality among US- and India-based workers on Mechanical Turk (Litman, Robinson, & Rosenzweig, 2014, BRM)
- Seems like MTurk has finally made its way to AMJ
Following I’d like to share a few of my tips that I’ve adopted across the many trials I’ve ran on MTurk. Lessons learned :
- I normally pay 0.01US$ for a minute. Most of my experiments are 3-5 minutes, and so the payment is normally 5 cents. I usually set the number of participants to 200 and so an experiment normally costs me 10US$participants+1US$Amazoncommision = 11US$. I can’t begin to stress how important this is for a poor grad student. This is sometimes what people pay a single participant per session in HK/US/Israel.
- Most of the low paid participants are Indian. Their level of English proficiency varies, but you can test this to use as a control variable or disqualifier, or you can even set this as a requirement on MTurk before they complete the survey (especially for the longer higher paying surveys, not so much for the 3-5 minutes surveys). If you’d rather eliminate this sample altogether MTurk allows you to specify which countries you would like to include or not include in your task.
- Limit running the experiment to those who successfully completed atleast 50 HITs before and 95% acceptance.
- You need to verify that participants read and understand your survey, and that they don’t randomly click their answers. For that I do the following:
- After each scenario I run a quiz to test their understanding.
- Obviously, every part includes a check. A manipulation should always be tested, better with more than a single manipulation check.
- Add a timer for each page and include a check in your stat syntax to test whether they answered too fast.
- Include a funneling section and ask them what the survey was about and set a minimum characters answer. Go over the answers to see who puts in noise. Ofcourse, if you included a manipulation also test for suspicion and ask them what they thoughts the purpose was or whether they can see any connection between the manipulation and your tested DV.
- It goes without saying that you should test your survey before setting it off to the wild. But, very important point is to set email triggers and see that the answers you get are what they should be. It happened a few times that I discovered something wrong within the first ten participants, so I stopped the batch, corrected the mistake and restarted everything.
[UPDATE 2013/02/05 my answer to a discussion about this]
- One should be careful with money as an incentive for answering questionnaires on MTurk. I’ve actually found that 5 cents a questionnaire may at times yield higher quality results than a 2 dollar reward since it reduces the chance that people merely participate for the money. People still participate for 2-5 cents, and that couldn’t be just for the money in it.
- There’s a special concern with participants from India. Though I try not to stereotype and generalize, but some studies that haven’t worked well with an international sample have worked very well on the rerun with the rule : “Location NOT India”.
- The questionnaire should show participants you’re a serious researcher. Meaning :
- Comprehension questions to make sure they understood the scenario or what they need to do in a task.
- Quiz questions about scenarios that they have to get right to proceed.
- 2 or 3 manipulation checks may work better than a single one.
- Lots of decoy questions that go in opposite directions and randomized into scales (ones I use often – “the color of the grass is blue” “in the same week, Tuesday comes after Monday” “rich people have less money than poor people” etc.)
- Randomizing question sequence and options for each section.
- Adding a funneling section.
- Language proficiency checks.
- Adding a timer to all questions to check how much time they spent on each page and when they clicked on things.
- Between subject manipulations are better than a simple survey since different participants see different conditions and hence reduce the chances of simply sharing answers.
- There’s no escape from going over the answers in detail, checking the answer timing, checking for duplicates and reading the funneling section. Consistently, about 20-35% of the MTurk answers fail this.
[end of UPDATE]
For problems with running MTurkers, read :
- Let’s keep discussing M Turk sample validity
- What’s a “valid” sample? Problems with Mechanical Turk study samples, part 1
- Fooled twice, shame on who? Problems with Mechanical Turk study samples, part 2
[end of UPDATE]
For the technical details on how to set things up read the following :
- Experiments using Mechanical Turk. Part 1
- Experiments using Mechanical Turk. Part 2
- THE TECHNICAL DETAILS, TUTORIALS, WALK-THROUGHS
- How to connect Qualtrics and mturk, Part II
There’s also a very helpful blog I strongly recommend that you visit – Experimental Turk which titles itself as A blog on social science experiments on Amazon Mechanical Turk. It hasn’t been updated for a while, but some viable info in there. This is a quick presentation I gave on the topic to an FSU lab:
And a presentation at HKUST :
- Preventing MTurkers who participated in one study from participating in certain other studies – Turk Check.
- PsyTurk (see presentation here)
- Identifying Careless Responses in Survey Data (Meade & Craig, 2012, Psychological Methods) – an excellent article on careless responses with online and student samples. A worthy read. Another article is Detecting and Deterring Insufficient Effort Responding to Surveys (Huang et al., 2012, JBS)
- Deneme – a blog of experiments on Amazon Mechanical Turk (who created – Iterative Tasks on Mechanical Turk)
- Is Mechanical Turk the future of cognitive science research?
- Looking for Subjects? Amazon’s Mechanical Turk
- The Pros & Cons of Amazon Mechanical Turk for Scientific Surveys
- Experimenting on Mechanical Turk: 5 How Tos
- Slides from ACR 2012 (good tips)
- Evaluating Amazon’s Mechanical Turk as a Tool for Experimental Behavioral Research (published at PLOSone, with a related blog post)
- Mechanical Turk and Experiments in the Social Sciences
- How naïve are MTurk workers? and the followup response - mTurk: Method, Not Panacea and the followup post - Consequences of Worker Nonnaïvete: The Cognitive Reflection Test
- ITWorld - Experimenting on Mechanical Turk: 5 How Tos
Got any other MTurk tips? have you had any experience running experiments on MTurk? Do share.