Literary corpora

The National Laboratory for Digital Heritage is producing three literary corpora of machine-generated annotations. The aim of the ELTE Poetry Corpus is to present Hungarian canonical poetry. The ELTE Novel Corpus aims at presenting Hungarian canonical and less canonical fiction, and the ELTE Drama Corpus aims at presenting Hungarian canonical and less canonical drama. The three corpora provide an opportunity to look at Hungarian literature from the perspective of “reading at a distance”, i.e. to examine a very large number of texts on the basis of quantitative characteristics. This quantitatively based approach to literary texts can add new aspects to our knowledge of the more traditional methods of literary studies and can raise questions that have not been possible to answer so far.

In addition to the texts, the corpus contains annotations on the structural units of the texts, as well as the lexical form, the word type, and the morphosyntactic features of the words. In addition to the structural units and the grammatical features of the words, the Poetry Corpus also contains a number of additional features related to the setting of the poems. The texts in the corpus, together with their annotations, can be downloaded in TEI XML format from the corpus GitHub site and used freely for research purposes. All three corpora include an online query interface with a variety of search functions, which can be freely used by anyone. The easy-to-use search functions built into the query interfaces allow access to quantitative data related to words, word types, and other grammatical features without any IT knowledge. The three corpora will be continuously extended with additional texts. We also plan to add further annotation layers to the corpora. We hope that the literary corpus will be useful not only in research but also in other contexts, such as public education.