Getting Good Smartsourcing Results (Amazon Mechanical Turk Best Practices)

Try Smartsheet for Free
  
Brent Frei's blog - Apr 17 2009 - 10:11am

Several of our customers have asked for some 'best practices' help in getting the best possible results when Smartsourcing their work done. I've looked for good reference sites, but frankly, haven't found any that aren't written for developers or quants.  So, here are some of my own observed best practices in 4 areas.

1. Testing your questions for optimal results
2. Parameter settings that have the most impact
3. Establishing a reputation and tolerance for what is acceptable work
4. Best time of day and week to get work done

To put context to my points, I’ll use data from my recent submission. I had run a second survey to test the results of the Reader Quality Score data in last week’s post. This run asked the question in a different way. Rather than, “grade the substance and thoughtfulness of the blog’s comments and therefore their Readers”, I asked, “Based on what you read in an article’s comments, do you think the Readers are productive in their daily work lives and why.”

There wasn’t much new news here. The top and bottom half of the lists remained within their previous halves. The order did change some, the scale widened a bit, and the explanations for scoring were very entertaining. However, since I’d carefully managed the entire process throughout, it does serve as great data for some best practices suggestions.

                

The 1,250 individual responses came in across 4 days last week (represented by 8, 9, 10 and 11 in the charts below – April 8, 9, 10 and 11).

1) Test your question
Properly framing your question for a worker is the best way to improve results. Invariably you will discover that your first attempt at a work request produces results slightly different than you expected. A few rounds of refining the form and verbiage of your request will tighten up the quality of the results.

Submitting 10 – 25 rows as tests is also wise as even very large volumes of work will have a tendency to be completed within hours of the submission. Therefore, it’ll often be done before you have a chance to cancel the run when you see poor results.

In my test, over a hundred distinct workers submitted answers during the 4 days. The color bands depict volume by a specific worker. As you can see, most of the work was done in the first day. The x-axis labels refer to the day of the month April 8, 9, 10 and 11.

                               

2) Demand a High Approval Rating and Pay Well
By specifying a worker Approval Rating of 85% or higher, you improve the chances of a quality worker taking on your request. Good workers care about maintaining their high approval rating as it enables them to accept higher value work.

Pay well so you attract the best workers. The additional cost is more than offset by the reduction in time consumed to manage poor or fraudulent work. My rule of thumb for ‘paying well’ is to pay better than $10/hr. This means, doing one of my own requests and determining how long it takes to do it right, then applying this table to pricing which equates to about $0.17/minute.

                                           $ 0.02        7 secs
                                           $ 0.05      18 secs
                                           $ 0.10      30 secs 
                                           $ 0.25      90 secs 
                                           $ 0.50        3 mins
                                           $ 1.00        6 mins
                                           $ 2.50      15 mins
                                           $ 5.00      30 mins

Vary the parameters when testing for quality of work at particular prices. Pricing can be dramatically reduced by improving the workers speed and reducing any uncertainty associated with your expectations. Some of the key ways to do this are:

  • Reduce the steps a worker has to take to accomplish your task (e.g. include pre-configured web links to the data research sites). 
  • Optimize the number of data fields asked for. Sometimes is cheaper and faster to send a batch back through to fill out the busy work of copy and paste type data fields and leave the primary run for the real research and brain power tasks.
  • Show an example of what an approved result would look like.
  • Sometimes its beneficial to bunch work up into a larger job and pay more.  See my post on how.

3) Build a reputation and set a tone for your tolerance of what is acceptable work
Another advantage to running early smaller test runs beyond refining your request, is that you can establish a group of capable, interested workers for the bulk of your work. And, you can set the tone for your intolerance of poor work before it’s too late. The fraudsters are more likely to test the waters gently by only doing a few of your requests to see if they’ll be met with rejection, or if they have smooth sailing to whip unfettered through your list with flakey work.

In my example, over half the work was through the door before I began accepting or rejecting the results on the second day (9 in the chart below – purple = Rejected, green = Approved).

                           

As you can see from the work submission timeline below, the results started out mostly good, but as time wore on without repercussion, the shoddy and fraudulent work increased. It is important to note that some of the shoddy work was a result of workers assuming these requests were the same as previous work I’d submitted. Despite the title, instructions and data fields being entirely different, their familiarity with me as a requester and the similar nature of the research, was enough for them to perform the work incorrectly. With the onset of my accepting and rejecting, the work returned almost entirely to an acceptable set of results.

       

While hundreds of workers often contribute to your work, they tend to follow the 80-20. Eighty percent of the work is done by 20% of the workers. Because so much of the work was done before I began establishing a tone, I did not establish as high a concentration of success in the 20% workers as desired. Each bar below represents a single worker’s total submissions. Again, purple is rejected, green is approved.

 

Your best workers will most often participate across multiple days, so doing multiple runs is fine as they will search out the work they like each day and typically look to see if you’ve posted any more of what fit their interests. The graph below shows the worker id of the 20 biggest contributors across the 4 days of work submissions (each day a different color in the bar).

                        

4) Submit work during the Daytime hours in the U.S.
I’ve had best results with work performed during the U.S. business day hours during the work week, and less so with work performed at night and on weekends. It is estimated that as many as 80% of the 200,000 workers are located within the U.S.  The x-axis is the hours of the day relative to Pacific Daylight Time.

          

Again, assume that a large portion of your work will be performed in the hours following your submission.

There are dozens of other dimensions to consider in Smartsourcing, but hopefully these will give you some useful baselines. -Brent

Comments

How big is the MTurk community?

Great article Brent. It's difficult to tell how big Amazon has built up the community. Do you think it can really scale for HITs that in the thousands?

It's a very big MTurk community - 100s of Thousands

Shawn,

There are more than 250,000 people that regularly or semi-regularly sign on to do MTurk work. We routinely submit 10,000 HIT jobs that are done in a matter of days if not hours (depends on how much we are paying per HIT). Others submit jobs in the 100s of thousands of HITs with results in days. It's big.
-Brent

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.