Speaker
Description
Soup-to-Nuts11 is a program to automatically induce a lexicon of MultiWord Expressions (MWEs) from a corpus and re-tokenize the corpus based on the MWEs that were found. I will discuss how the program works, and give a demo to show how it performs.
Choueka (1988) was the first to induce a phrasal lexicon from a corpus, and that work is highly related. Bigrams and longer n-grams (up to six-grams) were extracted from a 10-million-word corpus of the New York Times. The n-grams were processed using an algorithm similar to my own: rejecting candidates with a closed-class word, identifying and rejecting chunks, and rejecting candidates that contain numbers, dates, and times. Schone & Jurafsky (2001) also induced a multiword lexicon from a corpus. They used a 6.7-million-word subset of the TREC databases (a set of corpora that are used for evaluating information retrieval systems). These datasets are too small. To put things in perspective, McKeown et al. (2017) mention that starting in fourth grade, the average frequency of the words to be acquired is one per million tokens, or less. MWEs are usually less frequent than individual words, making the problem even worse for vocabulary acquisition of MWEs. Frequency is a problem both for computational acquisition of MWEs as well as for acquisition by children. There are many factors that are important for inducing a phrasal lexicon. I compared three lexical association metrics: Pointwise Mutual Information (Church & Hanks, 1990), Log Likelihood (Dunning, 1994), and Mutual Rank Ratio (Deane, 2005). I found that Mutual Rank Ratio (MRR) was the best for inducing a lexicon. The expressions that were highly ranked using Pointwise Mutual Information were very infrequent, such as pprox harengus, balaena mysticetus, secale pprox, and erodium cicutarium. They were expressions that can occur in a dictionary, but they were mostly unfamiliar. In contrast, the expressions that were highly ranked by Log Likelihood included of the, for the, and such as. The most highly-ranked expressions using MRR included prime minister, science fiction, San Francisco, and human rights. Filtering candidates that start or end with a closed-class word,12 and filtering inflectional variants, made a major
improvement in recognizing good candidates. However, even with this filtering, none of the association metrics were effective when they were evaluated on a large corpus, and the frequency threshold was set so as to capture at least 50% of the attested MWEs. This was true for two different dictionaries that were used as a gold standard. I found it was essential to use multiple corpora. I evaluated four datasets: 1) a download of the Wikipedia; 2) a download of 30,000 books from Project Gutenberg; 3) Medline, a corpus of titles and abstracts from the biomedical literature; and 4) Juris, which is a corpus of legal text.13 Candidates with an MRR score greater than 2.0 were identified that appear in two or more corpora, and the candidates were re-ranked by dispersion (the number of corpora that the candidate occurs in). Those candidates that occur in all four corpora were
ranked first, then the candidates that occurred in three, and then those in two.
This more than doubled the average precision.
In addition to the association metric, we also need to consider morphological variation. This is important in selecting a normal form for the headword, and in identifying and rejecting chunks. A chunk is a part of a multiword expression that does not have an identify of its own. For example, Osama bin, and Leonardo da. In contrast, bin Laden and da Vinci are not chunks because they have an independent existence. We can Look for bin Laden, or we can say I saw a da Vinci in a museum. We need to recognize that automated teller is a chunk because it is a component of automated teller machine, and Yellowstone National is a chunk because it is a component of Yellowstone National Park. But we also see automated teller machines, and Yellowstone National Parks because of the context Grand Teton and Yellowstone National Parks. Sometimes plural forms
occur more frequently (e.g., civil rights, human rights). I use the most frequentform, singular or plural, as the norm for an individual corpus, and the form that has the greatest plurality in creating a common inventory. In tokenization, all morphological and orthographic variants are treated as an equivalence class. There are at least three types of chunks: 1) Expressions like Osama bin and Leonardo da that are a part of a single longer expression; 2) Expressions like density lipoprotein that are a part of high density lipoprotein and low density lipoprotein. Both of these expressions are supported by acronyms, and they are important in biomedical natural language processing; 3) Expressions such as Institute of Technology, and Bureau of Investigation. They represent a generative pattern in which there is a typed variable that is a part of the longer expression (e.g., Indian, Massachusetts, National, Federal). Dictionaries often include typed variables such as someone or something in their definitions, and these examples
illustrate other generative patterns that can occur in MWEs.
The most important issue to address is compositionality. There is previous work on using machine learning to help with this (Roberts & Egg, 2018; Cordiero et al., 2019), but I believe a larger and more focused effort is needed. I am in the process of preparing a dataset that is divided into strong-idioms (hot dog), weak idioms (room temperature), and compositional expressions. The compositional expressions are annotated with the type of relationship using Lexical Functions (Mel’čuk, 2023), and Qualia relationships (Pustejovsky, 1998). The aim of these annotations is to increase confidence that the MWE candidate does or does not belong in the lexicon.
The primary purpose of re-tokenization is to create associations between MWEs and MWEs (e.g., Abraham_Lincoln and Emancipation_Proclamation), and between MWEs and individual words (e.g, Abraham_Lincoln and slavery). The goal of creating these associations is to develop better methods for teaching and assessing vocabulary. Most work on vocabulary assessment focuses on breadth. Previous work has used lexical associations for assessing depth of knowledge as well (McKeown et al., 2017). I am extending that work to include associations involving MWEs. The demo will show some of the associations that were identified using the program.
Soup-to-Nuts is open-source and will be distributed via GitHub.