The Natural Language Processing Lab
Created: Damir Cavar, 2023-06-07
Last change: Damir Cavar, 2024-08-17
Ellipsis and other phenomena where words in sentences and utterances are elided or omitted are extremely interesting from a theoretical linguistic and cognitive language faculty perspective. In general, we recommend looking at The Oxford Handbook of Ellipsis and the numerous research articles, books, and dissertations discussed in the different sections of the handbook. There are also highly relevant articles mentioned below in the publications and on the websites from the various ellipsis corpus projects mentioned below.
There are various reasons why we are working on ellipsis and other word-omitting phenomena. Some of those are:
We will provide research reports here soon with quantified data related to these claims. These are strong claims, but our experience has shown that the limited use of phrase structure and dependency parsers significantly relates to the failure to process Dark Matter in Language. While certainly semantic and pragmatic approaches could be tried to reconstruct omitted linguistic content, we focus on syntactic and pattern-based methods with neural and symbolic algorithms, modeling the fast and slow processing of the human language faculty when it comes to elided linguistic content.
Our goals are ambitious:
The corpus format is documented here: The Hoosier Ellipsis Corpus - Data Format
We identified the following resources online:
If you have more links or if you want to share your data sets, please send us a note, dcavar at iu.edu.
The main corpus code and data links can be found at:
Damir Cavar, Zoran Tiganj, Ludovic Mompelat, Billy Dickson (2024) Computing Ellipsis Constructions: Comparing Classical NLP and LLM Approaches. In Proceedings of the 2024 Meeting of the Society for Computation in Linguistics (SCiL).
Damir Cavar, Ludovic V. Mompelat, Muhammad S. Abdo (2024) The Typology of Ellipsis: A Corpus for Linguistic Analysis and Machine Learning Applications. Paper to be presented at the ACL Special Interest Group on Typology (SIGTYP) 2024, colocated with the 18th Conference of the European Chapter of the Association for Computational Linguistics, St Julian’s, Malta (March 2024). (full paper)
Vance Holthenrichs, Damir Cavar, Zoran Tiganj, Billy Dickson (May 2024) On Ellipsis in Slavic: The Ellipsis Corpus and Natural Language Processing Results (16-19 May 2024) at the Formal Approaches to Slavic Linguistics 33 (FASL33), Halifax, Canada. (slides)
Damir Cavar, Ludovic Mompelat, Muhammad S. Abdo (2024) The Hoosier Ellipsis Corpus (HELC): Documenting Linguistic Dark Matter. Poster presented at the Midwest Speech and Language Days at the University of Michigan in Ann Arbor, April 15-16, 2024. (poster)
Muhammad S. Abdo, Damir Cavar (2024) The Hosiers Ellipsis Corpus: Building a Corpus of Ellipsis for Arabic Natural Language Processing. Poster presented at the Midwest Speech and Language Days at the University of Michigan in Ann Arbor, April 15-16, 2024. (poster)
Damir Cavar, Ludovic V. Mompelat, Muhammad S. Abdo (2024) The Typology of Ellipsis: A Corpus for Linguistic Analysis and Machine Learning Applications. Paper to be presented at the ACL Special Interest Group on Typology (SIGTYP) 2024, colocated with the 18th Conference of the European Chapter of the Association for Computational Linguistics, St Julian’s, Malta. (full paper)
This presentation is about ellipsis constructions in Arabic and three types of experiments using Logistic Regression, BERT-type of classifiers and guessers, and GPT-4 (ChatGPT) Large Language Models (LLM) to guess whether sentences contain ellipsis, where the ellipsis is located, and what the elided words are: