11 November 2026
Slovenija
Europe/Ljubljana timezone
Prijave so obvezne! / Registrations Obligatory!

Course provider: Jožef Stefan Institute (JSI), Department of Knowledge Technologies (E8)
Participating organisations:
Common Language Resources and Technology Infrastructure CLARIN.SI; CLARIN knowledge centre for South Slavic languages CLASSLA
Instructors:
Nikola Ljubešić (E8 JSI), Taja Kuzman Pungeršek (E8 JSI)

Language data are the crucial foundation for the development and evaluation of modern language technologies, which include large language models, automatic speech recognition models, machine translation systems and chatbots. Training data determine which languages a model can handle and what kinds of biases are reflected in its behavior. High-quality data are essential not only for initial model training, but also for adapting models to specific tasks, such as question answering or text classification, where large volumes of carefully designed and manually annotated examples are required. Finally, reliable and representative data are indispensable for evaluation of the model capabilities in the target language, such as the Slovenian language. This lecture introduces open language data provided through the CLARIN.SI Trust Core certified repository as a key resource for responsible and effective development of language technologies.

Learning objectives: The main objective of the course is to inform small enterprises and language technology developers about the availability of open language data that can be used across all stages of AI development and evaluation. Participants will become familiar with the CLARIN.SI infrastructure, which serves as the central national hub for language resources in Slovenia.

Course content: The lecture provides an overview of open language data types used in language technology development, including massive text corpora, speech data and evaluation datasets. Special attention is given to the CLARIN.SI repository, which contains approximately 700 language resource entries, around 400 of them focused on the Slovenian language, together comprising about 9 terabytes of data. The lecture presents concrete examples of widely used resources, such the large web-based text corpora used for training large language models, instruction-tuning and task-specific datasets, automatic speech recognition resources, and evaluation benchmarks for Slovenian and other South Slavic languages.

Learning outcomes: After the lecture, participants will understand the central role of language data in shaping the performance and reliability of language technologies. They will gain a clear overview of what the CLARIN.SI repository offers and how its resources can be effectively leveraged for the development and evaluation of AI systems for Slovenian and South Slavic languages. Participants will also be aware that the CLARIN.SI repository can be used to deposit their own language data, ensuring long-term archiving, increased visibility and reuse, and compliance with data management plan requirements.

Conference information

Date/Time

Starts

Ends

All times are in Europe/Ljubljana

Location

Slovenija
Go to map

Chairpersons

  • Nikola Ljubešić
  • Taja Kuzman Pungeršek

Extra information

Language: Slovenian; English; Croatian

Prerequisites: /
Target audience: developers of language technologies and AI systems, including large language models and speech technologies; companies and developers who require text and speech data for model development, adaptation, or evaluation; researchers in the fields of language