Natural Language Processing (NLP)

Developing NLP tools is a priority. The development of Hungarian language analysis machine intelligence is a prerequisite both for the exploration and preservation of the national digital heritage and for the market exploitation of the specialized tools developed.

The aim of the developments is to create tools (or further develop existing tools) that can be used not only in our own projects but also in a wider context, even by humanities scholars with fewer IT skills.

Our main sub-projects currently running are the gold standard corpus, HTR (handwriting recognition), and Huwikifier.

Gold standard corpus

The availability of large and high-quality text databases that can be used as teaching data for machine learning is a fundamental requirement for computer language processing. For this reason, DH-LAB’s main objective is to create a gold standard corpus of manually annotated reference corpus in Hungarian, with linguistic annotation covering the main levels of analysis from lemmatization to syntactic analysis.

HTR

The HTR sub-project aims to provide Hungarian researchers with a tool for the automatic processing and digitization of manuscripts. Our first model has been trained on the manuscripts of János Arany, but we are also working on the digitization of tabular data (e.g. handwritten birth records) in cooperation with the Hungarian National Archives.

Huwikifier

The Huwikifier sub-project aims to create a service to find and clarify Wikipedia entities found in the text. These can then be used by the software that uses the service to enrich the text, making it easier to include and searchable.