NLP Lab

Logo

The Natural Language Processing Lab

View the Project on GitHub dcavar/nlp-lab.github.io

Ellipsis and Elided Elements in Natural Language: The Hoosier Ellipsis Corpus

Created: Damir Cavar, 2023-06-07

Last change: Damir Cavar, 2024-01-27

Ellipsis and other phenomena where words in sentences and utterances are elided or omitted are extremely interesting from a theoretical linguistic and cognitive language faculty perspective. In general, we recommend looking at The Oxford Handbook of Ellipsis and the numerous research articles, books, and dissertations discussed in the different sections of the handbook. There are also highly relevant articles mentioned below in the publications and on the websites from the various ellipsis corpus projects mentioned below.

There are various reasons why we are working on ellipsis and other word-omitting phenomena. Some of those are:

We will provide research reports here soon with quantified data related to these claims. These are strong claims, but our experience has shown that the limited use of phrase structure and dependency parsers significantly relates to the failure to process Dark Matter in Language. While certainly semantic and pragmatic approaches could be tried to reconstruct omitted linguistic content, we focus on syntactic and pattern-based methods with neural and symbolic algorithms, modeling the fast and slow processing of the human language faculty when it comes to elided linguistic content.

Our goals are ambitious:

The corpus format is documented here: The Hoosier Ellipsis Corpus - Data Format

Online Resources

We identified the following resources online:

If you have more links or if you want to share your data sets, please send us a note, dcavar at iu.edu.

The Hoosier Ellipsis Corpus (THEC)

The main corpus code and data links can be found at:

Publications

Presentations

This presentation is about ellipsis constructions in Arabic and three types of experiments using Logistic Regression, BERT-type of classifiers and guessers, and GPT-4 (ChatGPT) Large Language Models (LLM) to guess whether sentences contain ellipsis, where the ellipsis is located, and what the elided words are: