Mixtepec-Mixtec Corpus and Lexicography

Primary tabs


This project documents the Mixtepec-Mixtec language variety (iso: mix) (Sa'an Savi). MIX is spoken by roughly 9000-10000 people in the Juxtlahuaca district of Oaxaca, Mexico and is scarcely described in any published linguistic literature. In the most general terms, this project aims to provide support to the cause of preservation, documentation, and potentially a fully described source of language data to be used in efforts to invigorate this language. The first version of the corpus is being encoded in (XML-TEI); the eventual goal is to encode and release the final version of the language resources in a format compatible with linked open data and the semantic web. The foremost component of this project is the creation of a portable (re-usable) corpus of Mixtepec-Mixtec which can serve as a basis of any efforts in pursuit of the aforementioned causes; central to the process is the collection and markup of all publicly available/pre-existing, as well as the creation of original multi-media and text-based LR. The other major priority of this work is to establish an accurate inventory of all lexical and grammatical features of the language based on the data created and collected. Primary LR created in this project contain over 1000 '.wav' files of spoken MIX from conversations, interviews, and utterances elicited through both translated, image and media-based prompts. Most of the language resources in this project originate from consultation sessions with two native speakers from a town called Yucanani, which is part of the municipality of San Juan de Mixtepec with a population of around 500 people. The vocabulary covered spans a large number of semantic domains and includes a wide array of different construction types. External LR that have been integrated into this project include several publicly available videos featuring spoken MIX speech, and most significantly a collection of 33 orthographic children's texts published by SIL Mexico (Summer Institute of Linguistics). Some of the materials are primarily young MIX speakers designed for use as classroom handouts and/or lessons for primary/elementary school-level. The topical content of these publications contain both culturally specific, and non-culturally specific subject matter supplemented by illustrations such as: vocabulary, mathematics, telling time, geography, seasons, weather, local agricultural practices, amongst others. These publications make up the second most important collection in this project. Other external LR to be integrated into the corpus include some lexical data from a small number of unpublished academic presentations on this language by Mary Pastor and Rosemary Beam de Azcona (2004-2007). For all spoken language data, phonetic (IPA & ZSAMPA) and orthographic transcriptions (using the working Mixtec orthographic conventions) are created and where possible, glosses annotations are recorded. Transcriptions from each utterance are made, and where there are recordings/media files for the given utterance, these data are linked. Currently (2014-08-12), there are 2 separate TEI documents in which this information is encoded: one for all language data obtained from utterances and recordings created in this project; and the other for language data obtained from utterances and recordings created from publicly available MIX data.
Project Visibility: 
Public - accessible to all site users