Milan Gritta

Milan Gritta

London, England, United Kingdom
2K followers 500+ connections

About

A mixture of Science, Engineering, Business and Leadership skills, I prefer companies…

Activity

Join now to see all activity

Experience

Education

  • University of Cambridge Graphic

    University of Cambridge

    -

    Activities and Societies: Data Science, Natural Language Processing, AI, Linguistics, Applied Machine Learning, Leadership, Philosophy, Human Behaviour Sciences

    Panda Alert - Bioinformatics, NLP, Deep Machine Learning Research, Information Extraction

  • -

    Activities and Societies: Cambridge University Entrepreneurs, 80000 hours, President of Fitzwilliam College Entrepreneurs Society, Cambridge University Technology and Enterprise Club

  • -

    Activities and Societies: Learning to Lead Scheme, Lead a team of 6 to win the software engineering project ahead of almost 20 other student teams.

Publications

  • Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems

    Transactions of the Association for Computational Linguistics

    Creating high-quality annotated data for task-oriented dialog (ToD) is known to be notoriously difficult, and the challenges are amplified when the goal is to create equitable, culturally adapted, and large-scale ToD datasets for multiple languages. Therefore, the current datasets are still very scarce and suffer from limitations such as translation-based non-native dialogs with translation artefacts, small scale, or lack of cultural adaptation, among others. In this work, we first take stock…

    Creating high-quality annotated data for task-oriented dialog (ToD) is known to be notoriously difficult, and the challenges are amplified when the goal is to create equitable, culturally adapted, and large-scale ToD datasets for multiple languages. Therefore, the current datasets are still very scarce and suffer from limitations such as translation-based non-native dialogs with translation artefacts, small scale, or lack of cultural adaptation, among others. In this work, we first take stock of the current landscape of multilingual ToD datasets, offering a systematic overview of their properties and limitations. Aiming to reduce all the detected limitations, we then introduce Multi3WOZ, a novel multilingual, multi-domain, multi-parallel ToD dataset. It is large-scale and offers culturally adapted dialogs in 4 languages to enable training and evaluation of multilingual and cross-lingual ToD systems. We describe a complex bottom–up data collection process that yielded the final dataset, and offer the first sets of baseline scores across different ToD-related tasks for future reference, also highlighting its challenging nature.

    See publication
  • A Systematic Study of Performance Disparities in Multilingual Task-Oriented Dialogue Systems

    Achieving robust language technologies that can perform well across the world's many languages is a central goal of multilingual NLP. In this work, we take stock of and empirically analyse task performance disparities that exist between multilingual task-oriented dialogue (ToD) systems. We first define new quantitative measures of absolute and relative equivalence in system performance, capturing disparities across languages and within individual languages. Through a series of controlled…

    Achieving robust language technologies that can perform well across the world's many languages is a central goal of multilingual NLP. In this work, we take stock of and empirically analyse task performance disparities that exist between multilingual task-oriented dialogue (ToD) systems. We first define new quantitative measures of absolute and relative equivalence in system performance, capturing disparities across languages and within individual languages. Through a series of controlled experiments, we demonstrate that performance disparities depend on a number of factors: the nature of the ToD task at hand, the underlying pretrained language model, the target language, and the amount of ToD annotated data. We empirically prove the existence of the adaptation and intrinsic biases in current ToD systems: e.g., ToD systems trained for Arabic or Turkish using annotated ToD data fully parallel to English ToD data still exhibit diminished ToD task performance. Beyond providing a series of insights into the performance disparities of ToD systems in different languages, our analyses offer practical tips on how to approach ToD data collection and system development for new languages.

    See publication
  • Pangu-coder: Program synthesis with function-level language modeling

    arXiv preprint arXiv:2207.11280

    We present PanGu-Coder, a pretrained decoder-only language model adopting the PanGu-Alpha architecture for text-to-code generation, i.e. the synthesis of programming language solutions given a natural language problem description. We train PanGu-Coder using a two-stage strategy: the first stage employs Causal Language Modelling (CLM) to pre-train on raw programming language data, while the second stage uses a combination of Causal Language Modelling and Masked Language Modelling (MLM) training…

    We present PanGu-Coder, a pretrained decoder-only language model adopting the PanGu-Alpha architecture for text-to-code generation, i.e. the synthesis of programming language solutions given a natural language problem description. We train PanGu-Coder using a two-stage strategy: the first stage employs Causal Language Modelling (CLM) to pre-train on raw programming language data, while the second stage uses a combination of Causal Language Modelling and Masked Language Modelling (MLM) training objectives that focus on the downstream task of text-to-code generation and train on loosely curated pairs of natural language program definitions and code functions. Finally, we discuss PanGu-Coder-FT, which is fine-tuned on a combination of competitive programming problems and code with continuous integration tests. We evaluate PanGu-Coder with a focus on whether it generates functionally correct programs and demonstrate that it achieves equivalent or better performance than similarly sized models, such as CodeX, while attending a smaller context window and training on less data.

    See publication
  • CrossAligner & Co: Zero-Shot Transfer Methods for Task-Oriented Cross-lingual Natural Language Understanding

    Findings of the Association for Computational Linguistics: ACL 2022

    Task-oriented personal assistants enable people to interact with a host of devices and services using natural language. One of the challenges of making neural dialogue systems available to more users is the lack of training data for all but a few languages. Zero-shot methods try to solve this issue by acquiring task knowledge in a high-resource language such as English with the aim of transferring it to the low-resource language(s). To this end, we introduce CrossAligner, the principal method…

    Task-oriented personal assistants enable people to interact with a host of devices and services using natural language. One of the challenges of making neural dialogue systems available to more users is the lack of training data for all but a few languages. Zero-shot methods try to solve this issue by acquiring task knowledge in a high-resource language such as English with the aim of transferring it to the low-resource language(s). To this end, we introduce CrossAligner, the principal method of a variety of effective approaches for zero-shot cross-lingual transfer based on learning alignment from unlabelled parallel data. We present a quantitative analysis of individual methods as well as their weighted combinations, several of which exceed state-of-the-art (SOTA) scores as evaluated across nine languages, fifteen test sets and three benchmark multilingual datasets. A detailed qualitative error analysis of the best methods shows that our fine-tuned language models can zero-shot transfer the task knowledge better than anticipated.

    See publication
  • XeroAlign: Zero-Shot Cross-lingual Transformer Alignment

    Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

    The introduction of pretrained cross-lingual language models brought decisive improvements to multilingual NLP tasks. However, the lack of labelled task data necessitates a variety of methods aiming to close the gap to high-resource languages. Zero-shot methods in particular, often use translated task data as a training signal to bridge the performance gap between the source and target language(s). We introduce XeroAlign, a simple method for task-specific alignment of cross-lingual pretrained…

    The introduction of pretrained cross-lingual language models brought decisive improvements to multilingual NLP tasks. However, the lack of labelled task data necessitates a variety of methods aiming to close the gap to high-resource languages. Zero-shot methods in particular, often use translated task data as a training signal to bridge the performance gap between the source and target language(s). We introduce XeroAlign, a simple method for task-specific alignment of cross-lingual pretrained transformers such as XLM-R. XeroAlign uses translated task data to encourage the model to generate similar sentence embeddings for different languages. The XeroAligned XLM-R, called XLM-RA, shows strong improvements over the baseline models to achieve state-of-the-art zero-shot results on three multilingual natural language understanding tasks. XLM-RA's text classification accuracy exceeds that of XLM-R trained with labelled data and performs on par with state-of-the-art models on a cross-lingual adversarial paraphrasing task.

    See publication
  • Conversation Graph: Data Augmentation, Training and Evaluation for Non-Deterministic Dialogue Management

    Transactions of the Association of Computational Linguistics

    Task-oriented dialogue systems typically rely on large amounts of high-quality training data or require complex handcrafted rules. However, existing datasets are often limited in size con- sidering the complexity of the dialogues. Additionally, conventional training signal in- ference is not suitable for non-deterministic agent behavior, namely, considering multiple actions as valid in identical dialogue states. We propose the Conversation Graph (ConvGraph), a graph-based representation of…

    Task-oriented dialogue systems typically rely on large amounts of high-quality training data or require complex handcrafted rules. However, existing datasets are often limited in size con- sidering the complexity of the dialogues. Additionally, conventional training signal in- ference is not suitable for non-deterministic agent behavior, namely, considering multiple actions as valid in identical dialogue states. We propose the Conversation Graph (ConvGraph), a graph-based representation of dialogues that can be exploited for data augmentation, multi- reference training and evaluation of non- deterministic agents. ConvGraph generates novel dialogue paths to augment data volume and diversity. Intrinsic and extrinsic evaluation across three datasets shows that data augmentation and/or multi-reference training with ConvGraph can improve dialogue success rates by up to 6.4%.

    See publication
  • A Pragmatic Guide to Geoparsing Evaluation

    Under review @ Springer (Language Resources and Evaluation)

    The manuscript introduces such framework in three parts. Part 1) Task Definition: clarified via corpus linguistic analysis proposing a fine-grained Pragmatic Taxonomy of Toponyms with new guidelines. Part 2) Evaluation Data: shared via a dataset called GeoWebNews to provide test/train data to enable immediate use of our contributions. In addition to fine-grained Geotagging and Toponym Resolution (Geocoding), this dataset is also suitable for prototyping machine learning NLP models. Part 3)…

    The manuscript introduces such framework in three parts. Part 1) Task Definition: clarified via corpus linguistic analysis proposing a fine-grained Pragmatic Taxonomy of Toponyms with new guidelines. Part 2) Evaluation Data: shared via a dataset called GeoWebNews to provide test/train data to enable immediate use of our contributions. In addition to fine-grained Geotagging and Toponym Resolution (Geocoding), this dataset is also suitable for prototyping machine learning NLP models. Part 3) Metrics: discussed and reviewed for a rigorous evaluation with appropriate recommendations for NER/Geoparsing practitioners.

    See publication
  • Which Melbourne? Augmenting Geocoding with Maps.

    Association for Computational Linguistics 2018

    ACL 2018 paper presented in Melbourne, Australia. The purpose of text geolocation is to associate geographic information contained in a document with a set (or sets) of coordinates, either implicitly by using linguistic features and/or explicitly by using geographic metadata combined with heuristics.
    Topics: Geoparsing, Toponym Resolution, Geocoding, Named Entity Recognition, Information Retrieval, Machine Learning, Deep Learning.

    See publication
  • Vancouver Welcomes You! Minimalist Metonymy Resolution

    Association for Computational Linguistics

    Outstanding Paper Award 2017 @ ACL
    We show how a minimalist neural approach combined with a novel predicate window method can achieve competitive results on the SemEval 2007 task on Metonymy Resolution. Additionally, we contribute with a new Wikipedia-based MR dataset called RelocaR, which is tailored towards locations as well as improving previous deficiencies in annotation guidelines.

    See publication
  • What’s missing in geographical parsing?

    Springer - Language Resources and Evaluation

    In this study, we evaluate and analyse the performance of a number of leading geoparsers on a number of corpora and highlight the challenges in detail. We also publish an automatically geotagged Wikipedia corpus to alleviate the dearth of (open source) corpora in this domain.

    See publication
Join now to see all publications

Honors & Awards

  • Outstanding Paper Award at www.acl2017.org

    Association of Computational Linguistics

    LINK: https://acl2017.wordpress.com/2017/06/02/wednesday-2-august-detailed-program/

    TITLE: Vancouver Welcomes You! Minimalist Location Metonymy Resolution

    RESOURCES: https://github.com/milangritta/Minimalist-Location-Metonymy-Resolution

  • Natural Environment Research Council Doctoral Studentship

    NERC Centre for Doctoral Training in Data, Risk and Environmental Analytical Methods

    Bioinformatics, NLP, DTAL, MML Cambridge

  • Winners of AppathonUK 2014

    Founders4Schools

    Two day hackathon building apps for children to tackle the growing illiteracy problems. The app called Clever Creatures allows children (age 3 - 7) to practice spelling, reading, learn new words, do basic maths and win rewards for doing it.

Test Scores

  • IELTS

    Score: 8 out of 9

    International English Language Testing System for Academics

Languages

  • German

    Limited working proficiency

  • Slovak

    Native or bilingual proficiency

  • English

    Native or bilingual proficiency

Recommendations received

More activity by Milan

View Milan’s full profile

  • See who you know in common
  • Get introduced
  • Contact Milan directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Add new skills with these courses