ParTy: a parallel corpus for typology

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The corpus is freely available on GitHub (link) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ParTy corpus contains currently subtitles of various films and TED talks in fifteen and more languages. The corpus is created for typological and contrastive purposes. It is particularly suited for comparison of European languages. All files were downloaded from online repositories, and and aligned automatically at the level of sentences or their smaller constituents. The corpus is being constantly updated.

Differences from existing parallel corpora

The main difference of the ParTy corpus from other massively parallel corpora (see Cysouw & Wälchli 2007), such as Europarl, translations of the Bible or fiction, UN legal documents, etc, is its informal register. As for the film subtitles component, this claim is empirically supported by a clustering analysis of n-grams in original and translated English film subtitles and other registers of spoken and written English (see link), which shows that both original and translated subtitles are highly similar linguistically to the informal conversations that constitute a part of the British National Corpus and the Santa Barbara Corpus of Spoken American English. The main difference from Tiedemann's Opus corpus is that I tried to collect subtitles for the same film or talk in as many languages as possible, rather than focusing on a particular language pair.


The files are in the text format. Each file represents a language aligned with English. The sentence IDs (the left column) and line numbers correspond to the same English sentence in all files, so one can easily find equivalent fragments for any languages. The identification of equivalents was done automatically with the help of alignment software 'subalign' created by Jörg Tiedemann (download).

Please let me know if you have found this corpus useful. Any comments and suggestions will be very much appreciated!


This corpus is created as a part of my project "Mapping the causative continuum: A multivariate typological investigation of causative constructions based on a multilingual parallel corpus" supported by a grant from the F.R.S. - FNRS (Belgium).