Yes, people can transcribe texts in a language they don’t know

In a recent presentation at the Conference on the Indigenous Languages of Latin America, I presented the results of one of my first forays into crowdsourcing the transcription of handwritten indigenous-language manuscripts. Since the documents are handwritten and often make liberal use of special phonetic characters (and loads of combinations of diacritic marks in the most baroque documents), OCR isn’t a viable option, and we’re left with transcription. People have set up projects to transcribe sets of texts like these, but most have been led by a researcher who trains a small, dedicated team on paleography and the details of the language being transcribed. These efforts can (and have!) produced quality work, but are not particularly scalable models for the large volume of handwritten texts that exist in all of Indigenous America’s diverse languages. To make headway on the large and diverse volume of texts, we might consider crowdsourcing the transcriptions. But could novice transcribers produce quality transcriptions of a language they don’t understand?

To test this, I asked an undergraduate class complete a transcription task on a document in Chalcatongo Mixtec and compared their transcriptions against my own. The students’ class was an introductory survey on endangered languages and the students were not specifically trained in linguistics, and only a few had done any reading about the Mixtec languages. I have done a bit of reading about the Mixtec languages, and was familiar with many of the transcription conventions (since they’re largely the same as what’s found in Josserand (1983)), so I am confident my transcription can serve as the gold standard.


Measuring by average Levenshtein distance, the 56 transcribers did surprisingly well matching their own transcriptions of Chalcatongo Mixtec to my own. The worst matched the gold transcription about 65%, and a little less than half of the transcribers matched gold over 90% of the time. The much more conservative “Item match” criterion measures the percent of entire sentences that matched the gold standard, and even there over half of all transcribers got over half of all the sentences they transcribed exactly right.

The importance of providing quality instructions is highlighted by the 14 transcribers who have item match scores below 10%. These transcribers did not follow the substitution key that gave instructions to e.g. type <š> as <sh> and instead inserted the uncommon character itself.

Interested in transcribing some handwritten Mixtec documents yourself? Give it a whirl at my Kathryn Josserand Mixtec Language Surveys Collection on our FromThePage instance. You can transcribe a few pages as a guest before having to create an account. This is very much a beta of the crowdsourcing approach, and any feedback is welcome.