Investigating Idiomaticity in Word Representations: Wei He Tiago Kramer Vieira
Investigating Idiomaticity in Word Representations: Wei He Tiago Kramer Vieira
Representations
Idiomatic expressions are an integral part of human languages, often used to express complex
ideas in compressed or conventional ways (e.g. eager beaver as a keen and enthusiastic person).
However, their interpretations may not be straightforwardly linked to the meanings of their
individual components in isolation and this may have an impact for compositional approaches.
In this paper, we investigate to what extent word representation models are able to go beyond
compositional word combinations and capture multiword expression idiomaticity and some of
the expected properties related to idiomatic meanings. We focus on noun compounds of varying
levels of idiomaticity in two languages (English and Portuguese), presenting a dataset of minimal
pairs containing human idiomaticity judgments for each noun compound at both type and token
levels, their paraphrases and their occurrences in naturalistic and sense-neutral contexts, totalling
32,200 sentences. We propose this set of minimal pairs for evaluating how well a model captures
idiomatic meanings, and define a set of fine-grained metrics of Affinity and Scaled Similarity, to
determine how sensitive the models are to perturbations that may lead to changes in idiomaticity.
Affinity is a comparative measure of the similarity between an experimental item, a target and
a potential distractor, and Scaled Similarity incorporates a rescaling factor to magnify the
meaningful similarities within the spaces defined by each specific model. The results obtained with
Action editor: Tal Linzen. Submission received: 22 April 2024; revised version received: 29 August 2024; accepted for
publication: 22 October 2024.
a variety of representative and widely used models indicate that, despite superficial indications
to the contrary in the form of high similarities, idiomaticity is not yet accurately represented in
current models. Moreover, the performance of models with different levels of contextualisation
suggests that their ability to capture context is not yet able to go beyond more superficial lexical
clues provided by the words and to actually incorporate the relevant semantic clues needed
for idiomaticity. By proposing model-agnostic measures for assessing the ability of models to
capture idiomaticity, this paper contributes to determining limitations in the handling of non-
compositional structures, which is one of the directions that needs to be considered for more
natural, accurate and robust language understanding. The source code and additional materials
related to this paper are available at our GitHub repository1 .
1. Introduction
The evolution of word representation models has resulted in models with seemingly remarkable
language abilities. Not surprisingly these models have been found to store a wealth of linguistic
information (Henderson 2020; Manning et al. 2020; Vulić et al. 2020; Lenci et al. 2022), displaying
high levels of performance on various tasks ranging from the abilities of even the static models of
detecting semantic similarities between different words (Lin 1999; Mikolov et al. 2013; Baroni,
Dinu, and Kruszewski 2014) to those of contextualised models of grouping representations in
clusters which seem to be related to the various senses of the word (Schuster et al. 2019) and can be
matched to specific sense definitions (Chang and Chen 2019). While substantial evaluation efforts
have concentrated on word and subword units and on larger compositional combinations derived
from them, there is less understanding about their ability for handling less compositional structures,
such as those found on multiword expressions (MWEs), like noun compounds (NCs) (Garcia et al.
2021a), verb-noun combinations (King and Cook 2018; Hashempour and Villavicencio 2020) and
idioms (Yu and Ettinger 2020; Dankers, Lucas, and Titov 2022). Indeed, MWEs include a variety
of distinct phenomena and have been described as interpretations that cross word boundaries (Sag
et al. 2002), whose meanings are not always straightforwardly derivable from the meanings of
their individual components. Moreover, although they include, on the one hand, more transparent
and compositional expressions (like salt and pepper) or expressions with implicit relations (like
olive oil as oil made from olives), on the other hand they also include more idiomatic expressions
(like eager beaver as a person who is willing to work very hard2 ), falling into a continuum
of idiomaticity3 (Sag et al. 2002; Fazly, Cook, and Stevenson 2009). This leads to potential
problems for models if they follow the Principle of Compositionality (Frege 1956; Montague
1973), building the meaning of a larger unit (like a sentence or an expression) from a combination
of the individual meanings of the words that are contained in it, as this would result in potentially
incomplete or incorrect interpretation for more idiomatic cases (e.g. the idiomatic eager beaver
interpreted literally as impatient rodent). Although understanding the meaning of an MWE may
require knowledge that goes beyond that of the meanings of these individual words in isolation
(Nunberg, Sag, and Wasow 1994), failure to take idiomaticity into account can affect the quality
of downstream tasks (Sag et al. 2002; Constant et al. 2017; Cordeiro et al. 2019) such as reasoning
and inference (Chakrabarty, Choi, and Shwartz 2022; Chakrabarty et al. 2022; Saakyan et al.
2022), information retrieval (Acosta, Villavicencio, and Moreira 2011) and machine translation
(Dankers, Lucas, and Titov 2022). For machine translation, for example, the degree of idiomaticity
1 https:
//github.com/risehnhew/Finding-Idiomaticity-in-Word-Representations
2 Definition from the Cambridge dictionary.
3 We understand idiomaticity as semantic opacity, and its continuum as different degrees of opacity.
2
Wei He Investigating Idiomaticity in Word Representations
and ambiguity of MWEs (literal vs. idiomatic usages) were found to have an impact on the quality
of the results obtained (Dankers, Lucas, and Titov 2022). Due to their non-compositional nature,
idiomatic expressions result in lower quality translations than literal expressions, as evidenced
by lower BLEU scores for translations that are paraphrased rather than translated word-for-word.
In this paper, we investigate to what extent widely used word representation models are able to
capture idiomaticity in MWEs. We focus, in particular, on their initial abilities for representing
idiomaticity, looking at noun compounds of varying degrees of idiomaticity.4 In addition to the
complex interactions between MWEs, their component words and their contexts (Sag et al. 2002),
characteristics of languages and of word representation models may affect how accurately MWEs
can be represented and processed, and we investigate the impact of some of these factors for
compounds in two different languages (English and Portuguese).
One of the challenges is that uncovering how word representation models capture a specific
type of knowledge is a non-trivial problem (Vulić et al. 2020), and may depend on factors like the
particular model and the way it encodes different types of linguistic information (Yu and Ettinger
2020). For instance, while in Transformer-based models, the initial layers seem to represent
more lexical level knowledge and the final layers seem to capture more semantic and pragmatic
information (Rogers, Kovaleva, and Rumshisky 2020), determining where phenomena which sit
at the interface of various levels are encoded, like multiword expressions (Sag et al. 2002), is
challenging since they could potentially involve information distributed across different layers.
Moreover, the possible findings from an investigation about where in the architecture of a given
model idiomaticity is encoded, or about the role of particular components in representing it
may not generalise to other models and architectures. In this paper we propose instead a set of
model-agnostic idiomatic probes for assessing the representation of idiomaticity. These probes
contain NCs of different levels of idiomaticity, ranging from idiomatic to compositional cases,
which form the basis for minimal pairs. In these pairs one of them contains an NC and the other
contains a semantically related item (such as a synonym ) or a distractor. The hypothesis is that if a
model is able to accurately represent an NC, higher similarities will be observed for minimal pairs
involving NCs and their synonyms (e.g. for the idiomatic eager beaver and hardworking person).
Conversely, for minimal pairs with variants that may incorporate changes in meaning, such as
those containing NCs and synonyms of their individual component words (e.g. the idiomatic eager
beaver and impatient rodent) or other distractors, lower similarities should be observed.
As word representation models may form spaces that are anisotropic (Ethayarajh and Jurafsky
2021) with representations concentrating on parts of the space, or may have rogue dimensions
that dominate similarity measures (Timkey and van Schijndel 2021), these could lead to high
similarities overall (Liu et al. 2020), affecting the ability to distinguishing meaningful similarities
from spurious ones arising from specific characteristics of a given space. In this paper, we propose
two new measures to assess idiomaticity within a model while taking into account its potential for
high similarities. The first, Assessment of Feature Familiarity and Idiomatic Nuance by Interpreting
Target Yielding (Affinity), takes two representations of different levels of relatedness to a given
target, and can be used to determine if a model accurately reflects their degree of similarity to
the target. Focusing on idiomaticity, we use Affinity to assess if greater similarities are observed
for NCs and related words (in this case their synonyms), than for NCs and other potentially less
related alternatives including distractors. The second measure, Scaled Similarity, determines a
4 We use the off-the-shelf publicly available pre-trained versions of widely adopted word representation models,
standard operations and common similarity measures. Even in scenarios in which adopting additional optimisations,
more complex operations or fine-tuning could lead to improvements in performance, this may depend on the
availability of comprehensive training data for the target model, domain and language. Measuring the initial
idiomaptic abilities of models can help understand the potential loss of idiomatic meaning that could be propagated to
the downstream tasks that use them off-the-shelf.
3
Computational Linguistics Volume 1, Number 1
new lowerbound for a given space in terms of similarities for unrelated representations, rescaling
the space to help distinguish them from the meaningful similarities for related representations.
For idiomaticity, we analyse the similarities between the NCs and their synonyms adopting the
similarities between the NCs and random items as a new lowerbound. These measures of Affinity
and Scaled Similarity do not directly address the problem of rogue dimensions, and we discuss
this further in the Conclusions section.
Using these metrics and minimal pairs for evaluation, this paper presents a fine-grained
analysis of the ability of a model to capture idiomaticity, looking at the following questions:
• The Noun Compound Idiomaticity Minimal Pairs (NCIMP) Dataset, a dataset of minimal
pair sentences containing NCs of varying levels of idiomaticity, along with human judgments
about the degree of NC idiomaticity and gold standard paraphrases, at both type and token
level. In total, the dataset contains 32,200 sentences for two languages (19,600 in English
and 12,600 in Portuguese).5
• A comparative measure of Affinity to help determine how accurately idiomaticity is
incorporated in these representations contrasting similarities for semantically related and
unrelated representations.
• A novel model-agnostic measure of Scaled Similarity, which rescales a space in relation to a
new lowerbound taking into account expected similarities among random items to magnify
meaningful similarities among semantically related representations.
• In-depth analyses of the representation of idiomaticity in widely used word representation
models, examining their ability to display sensitivity to changes in idiomaticity.
5 This work extends the idiomatic probes proposed by Garcia et al. (2021b) and the type and token annotations by
Garcia et al. (2021a), also introducing new measures, additional tests and substantially expanding the analyses with
new baselines and results from a larger set of models.
4
Wei He Investigating Idiomaticity in Word Representations
The remainder of this paper is organised as follows: Section 2 presents related work, and
Section 3 the NCIMP dataset (Section 3.1), the models (Section 3.2) and the proposed idiomatic
probes and measures (Section 3.3). Finally, in Section 4 we discuss the results of our experiments
and draw conclusions in Section 5.
A variety of vector models have been used to investigate the representation of multiword
expressions (MWEs), ranging from static to contextualised representations, each with its own
set of challenges (Contreras Kallens and Christiansen 2022; Garcia et al. 2021a; Liu and Neubig
2022). The former include models like Word2Vec (Mikolov et al. 2013), GloVe (Pennington,
Socher, and Manning 2014) and fastText (Bojanowski et al. 2017), which represent words at
type-level, producing a single vector for each word that conflates all its senses. At this level,
MWEs are often represented based on their overall syntactic and semantic properties as they are
generally understood, without taking into account the variability of contexts. For example, both
the literal and the idiomatic meaning of gold mine6 would be represented jointly in a single vector
regardless of its use in any specific sentence. At the other end of the scale are the contextualised
models, from ELMo (Peters et al. 2018), BERT (Devlin et al. 2019) and GPT-3 (Brown et al. 2020)
to LLaMA (Touvron et al. 2023) and other large language models, which produce token-level
dynamic representations dedicated to capturing specific usages of a word in a particular context,
resulting in several vectors for each word (Lenci et al. 2022; Apidianaki 2022). Token-level
representations focus on the specific occurrences of words or subwords within contexts, and how
their meaning or function may vary or be influenced by the surrounding text. Therefore, they have
the potential for accurately representing MWEs, capturing the interdependence of the idiomatic
meaning on a particular configuration of words, while also anchoring the MWEs in relation to their
immediate linguistic environment. The primary challenge at token-level is accurately determining
the presence, meaning and role of MWEs in specific contexts, especially when they have possibly
multiple literal and idiomatic readings or when they are part of complex syntactic structures (Zeng
and Bhat 2021).
Evaluation of successive generations of word representation models, ranging from static
(Landauer and Dumais 1997; Lin 1999; Baroni and Lenci 2010; Mikolov et al. 2013; Bojanowski
et al. 2017) to contextualised models (Peters et al. 2018; Devlin et al. 2019; Brown et al.
2020; Touvron et al. 2023), has devoted considerable attention to their linguistic abilities
(Mandera, Keuleers, and Brysbaert 2017; Wang et al. 2018; Henderson 2020; Rogers, Kovaleva,
and Rumshisky 2020; Lenci et al. 2022). On lexical semantics, the representations extracted
from contextualised models seem to be able to reflect word senses in clusters of vectors (e.g.,
Wiedemann et al. (2019) for BERT) including in cross-lingual alignments involving polysemous
words (e.g., Schuster et al. (2019) for ELMo). However, controlled uniform evaluations of different
generations of word representation models settings have also reported strong performances from
static models, which were able to outperform contextualised models in most tasks (Lenci et al.
2022).
6 “Opportunity for making a lot of money” (definition from the Cambridge dictionary).
5
Computational Linguistics Volume 1, Number 1
7 See Miletić and Walde (2024) for a recent survey on the representation of MWEs in Transformer-based models.
6
Wei He Investigating Idiomaticity in Word Representations
In addition to the level of contextualisation, the performance of vector space models may also
be affected by the way the target words of an expression are composed, with functions like sum,
concatenation and multiplication used for combining the words of static models (Cordeiro et al.
2019; Mitchell and Lapata 2010; Reddy, McCarthy, and Manandhar 2011) or the subwords of
contextualised models (Garcia et al. 2021b). For the embeddings extracted from language models,
other potential sources of variation include which input is given to the model (e.g., one vs. several
sentences including the target MWE in evaluations at the type level), or the number of layers that
will be taken into account to obtain the vector representation (Miletić and Walde 2024). In this
regard, the intermediate and last layers seem to encode more semantic information at the token
level (Tenney, Das, and Pavlick 2019; Garcia 2021), while other evaluations at the type level found
that the averaging the initial layers of the target expressions achieved the best results (e.g., Miletic
and Schulte im Walde (2023) for NCs and Vulić et al. (2020) for single word semantic tasks). With
respect to semantic composition, Yu and Ettinger (2020) explored the type level representation
of two word phrases (which in many cases correspond to NCs as the ones used in our study) in
various contextualised models, showing that phrase representations miss compositionality effects
as they heavily rely on word content. Similar conclusions, for neural machine translation, can
be inferred from Dankers, Lucas, and Titov (2022). While some of these evaluations rely on
substitutivity and the changes to a larger phrase representation caused by substitutions to its
constituents (Garcia et al. 2021b; Yu and Ettinger 2020), alternatively, the notion of localism has
also been analysed (Liu and Neubig 2022) focusing on whether the operations of a model are
local (Hupkes et al. 2021), that is, the extent to which the representation of a phrase is derivable
from its local structure.
Crucially, a substantial amount of the discussed studies evaluate idiomaticity at the type-
level, i.e., they obtain the embedding of a given MWE by averaging its representation in several
sentences that have been previously extracted in an automatic way. A more detailed controlled
comparison of type-level and token-level idiomaticity reported compatible results for both levels,
with type-level being a close approximation for token-level (Garcia et al. 2021a) in sentences
where the NC occurs with the same sense. Further analysis of the occurrences of these NCs in
fine-grained sense annotations of literal and idiomatic usages (Tayyar Madabushi et al. 2021)
provided additional confirmation that the ability of contextualised models to capture idiomaticity
during pre-training was limited, with approaches for building single token representations (Phelps
et al. 2022) and for fine-tuning leading to more accurate representations (Tayyar Madabushi
et al. 2022). Recent alternatives for representing idiomatic expressions also include adding a
new adapter module which has been developed and trained to recognise idioms (Zeng and Bhat
2022). This module functions as a language expert for idioms, augmenting the learning process of
BART (Lewis et al. 2019) with additional information, and this approach effectively improves the
representation of idiomatic expressions in off-the-shelf pre-trained language models, equipping
them with greater ability to navigate the intricacies of natural language. Zeng and Bhat (2023) also
proposed PIER+, a language model improvement for handling both literal and figurative language.
This is achieved by combining a base model with an additional curriculum learning framework
that gradually introduces more complex potentially idiomatic expressions. Compared to other
models, PIER+ demonstrates better performance at identifying, understanding, and maintaining
proficiency in both types of expressions. Finally, Zeng et al. (2023) introduce a knowledge graph
designed to enhance the understanding of idiomatic expressions, which integrates commonsense
knowledge to aid in deciphering the non-literal meanings of idioms. This work demonstrates
how to inject MWE-related knowledge into pre-trained language models effectively. However,
it is still unclear to what extent the context and its representation in contextualised models are
7
Computational Linguistics Volume 1, Number 1
Shedding some light on these questions requires a more controlled evaluation setup and measures
that can abstract away from the particularities of these word representation spaces. In this effort,
we take inspiration in psycholinguistic methodologies, which have been traditionally used to
examine how humans process language in controlled experimental setups, to allow the removal
of obvious biases and potentially confounding factors from evaluations (Linzen, Dupoux, and
Goldberg 2016; Gulordava et al. 2018). They also enable comparative analyses of performance in
artificially constructed but controlled sentences and in naturally occurring sentences.
Setups like these have been used, for instance, to investigate how models represent syntax,
if they understand negation (van Schijndel and Linzen 2018; Prasad, van Schijndel, and Linzen
2019; Ettinger 2020; Kassner and Schütze 2020), and if they are aware of which properties are
relevant for which concepts (Misra, Rayz, and Ettinger 2023). Adopting evaluation protocols that
use minimal pair sentences (e.g.,Warstadt et al. (2020); Misra, Rayz, and Ettinger (2023)) allows
for a controlled comparison of the target item against carefully selected distractors that may share
linguistic properties with them. For instance, a dataset of Conceptual Minimal Pair Sentences
(COMPS) was used to compare the performance of 22 large language models including both
masked language models (like BERT) and autoregressive language models (like GPT-2), where
the models have to validate which of two concepts a given property belongs to (e.g. stripes for
zebras vs. oaks). Although the models seem to obtain relatively high accuracies for attributing
properties to concepts, when semantically related concepts are involved or distractors are included,
performances drop substantially, and go below chance even for models like GPT-3 (Misra, Rayz,
and Ettinger 2023). Similarly, in targeted syntactic evaluation (Marvin and Linzen 2018), models
are assessed using minimal pairs datasets focused on specific syntactic phenomena, such as those
included in the BLiMP dataset for English (Warstadt et al. 2020). Analyses like these highlight
the importance of adding controls to the experimental setup to distinguish seemingly sophisticated
behaviour with high performances that give the illusion of knowledge from robust understanding
with access to meaning (Misra, Rayz, and Ettinger 2023; de Dios-Flores, Garcia Amboage, and
Garcia 2023). With this is mind, we follow Garcia et al. (2021b) and use minimal pairs to propose
a set of intrinsic evaluations including probes and affinity measures aimed at gaining a better
understanding of how vector space models represent MWEs with different degrees of semantic
compositionality in context.
Concerning experimental data, the first datasets to evaluate computational models were composed
of different types of multiword expressions annotated at the type-level (McCarthy, Keller, and
Carroll 2003; Venkatapathy and Joshi 2005). Further studies released annotations of MWEs in
context, such as the VNC-tokens dataset (Cook, Fazly, and Stevenson 2008), which includes
60 English verb-noun combinations occurring in almost 3,000 sentences annotated as idiomatic
or literal, or the IDIX corpus (Sporleder et al. 2010), with almost 6,000 labeled sentences of
78 expressions extracted from the BNC. Using a crowdsourcing platform, Reddy, McCarthy,
and Manandhar (2011) released a dataset with numerical ratings of the compositionality degree
of 90 noun compounds in English, which also includes the contribution of each component to
the meaning of the MWEs. Similar efforts were carried out for other languages, such as the
GhoSt-NN dataset for German (Schulte im Walde et al. 2016), or the NC Compositionality (NCC)
dataset (Cordeiro et al. 2019), which expanded the resource provided by Reddy, McCarthy, and
8
Wei He Investigating Idiomaticity in Word Representations
Manandhar (2011) with additional NCs for English, and new data for Portuguese and French.
Semi-automatic techniques combined with crowdsourced annotations were used to compile
MAGPIE (Haagsma, Bos, and Nissim 2020), a large resource of more than 50,000 sentences
with binary annotations at the token level of potentially idiomatic expressions. Similarly, the
AStitchInLanguageModels dataset (Tayyar Madabushi et al. 2021), used in the “SemEval-2022
Task 2: Multilingual Idiomaticity Detection and Sentence Embedding” (Tayyar Madabushi et al.
2022), also contains potentially idiomatic expressions annotated in naturalistic sentences.
Recently, Garcia et al. (2021a) and Garcia et al. (2021b) enriched the English and Portuguese
data of the NCC dataset with crowdsourced annotations of the compositionality degree of noun
compounds and their components at the token level, paraphrases of the NCs in context, and
different types of controlled replacements. These variants compose a large set of minimal pairs
which allow for the systematic exploration of the representation of idiomaticity in vector space
models.8
In this paper, we adopt the minimal pairs paradigm as one of the bases for the evaluation
and present the Noun Compound Idiomaticity Minimal Pairs dataset, which contains a set of
idiomatic probes to explore to what extent idiomaticity is captured in word representation models.
To do so, we rely on the datasets for English and Portuguese by Garcia et al. (2021a) and Garcia
et al. (2021b) and extend them with new semantically related variants and distractors and sets of
minimal pairs as discussed in the next section to conduct in-depth intrinsic evaluations.
The Noun Compound Idiomaticity Minimal Pairs (NCIMP) dataset contains 32,200 sentences
targeting two-word NCs in two languages, 280 in English (EN) and 180 in Portuguese (PT), with
idiomatic (e.g. gravy train9 ), partly compositional (e.g., grandfather clock10 ), and compositional
(e.g., research project) NCs.11 For each NC, the dataset contains minimal pairs formed by a first
sentence with the target NC and a second sentence where the NC was replaced by an experimental
item. These experimental items were selected on the basis of MWE properties, like more limited
substitutability (or greater lexical fixedness), and can be used to determine if models are sensitive
to perturbations to these properties, and if this is affected by how idiomatic the NCs are. For
example, depending on the degree of lexical fixedness of an NC, the variants generated may not
fully retain its original meaning (e.g. panda car12 and ?bear automobile). In particular, we analyse
the following:
• N CSyn : the minimal pairs are formed by the NC being replaced by one of the gold
standard synonyms provided holistically for the NC by the annotators (e.g. brain for grey
matter). In this case, we adopted the synonyms provided by the Noun Compound Senses
(NCS) dataset (Garcia et al. 2021b), which were selected on the basis of the most frequent
paraphrases given by native speaker annotators. These pairs are used to assess if the models
8 We refer to Ramisch (2023) for a recent review on MWEs processing, including datasets, and to Schulte im Walde
(2023) for a comprehensive overview on compositionality ratings for MWEs.
9 Referring to an easy way of making money without doing much work (Cambridge Dictionary).
10 A type of tall free-standing clock.
11 The NCIMP dataset is based on the Noun Compound Senses (Garcia et al. 2021b), the Noun Compound Type and
Token Idiomaticity (Garcia et al. 2021a) and the NC Compositionality (Cordeiro et al. 2019) datasets, significantly
extending them with new data.
12 Referring to a police car.
9
Computational Linguistics Volume 1, Number 1
provide similar representations for NCs and their synonyms, even if they involve lexically
diverse surface forms.
• N CW ordsSyn : minimal pairs where each component word of the NC is replaced individually
by a synonym generating new two-word compositional replacements (e.g. forming
alligator sobs for the NC crocodile tears by replacing alligator for crocodile and sobs for
tears). The synonyms were manually selected from WordNet (Miller 1995) for English,
and OpenWordNet (Rademaker et al. 2014) for Portuguese, and from online dictionaries of
synonyms where additional coverage was required. In case of ambiguity (due to polysemy
or homonymy), the most common meaning of each component was selected. For each
NC, 5 compositional replacements were generated. These pairs are used to evaluate how
sensitive a model is to the conventionality and lexical fixedness of these NCs, especially the
more idiomatic ones, and if it can detect when the (idiomatic) meaning changes with the
replacements.
• N CComp : the minimal pairs are formed by replacing the NC by only one of its component
words i.e., replacing the NC by its head in one minimal pair, and by the modifier in the other
pair (e.g. crocodile for crocodile tears and tears for crocodile tears). These pairs are used
to explore if the models can detect when the meaning of an NC is related to the meaning of
a component (in more compositional cases) from when it is not (in more idiomatic cases).
• N CRand : the random replacement controlled by frequency is a two word expression
in which the words are chosen to match the frequencies of the components of the target
NC. The frequency values were extracted from corpora (in this case ukWaC and brWaC)
as follows: we averaged the frequency of each NC and of its components (favg = (fN C +
fw1 + fw2 )/3), and extracted the compound with the closest average value (e.g. police car
and supermarket city). For each NC, 5 random replacements were used for each sentence.
These pairs are used as controls to determine the lowerbound similarities for the target NCs,
avoiding the potential impact of any differences in frequency.
The NCs were pre-selected by experts trying to maintain a balance between the 3 classes
(idiomatic, partial, and compositional)13 , and they appear in the context of three naturalistic
sentences (N at) from corpora that exemplify the same compound sense (Garcia et al. 2021a).
Using Amazon Mechanical Turk (for English) and a dedicated custom built online platform (for
Portuguese) compositionality scores for each NC and its components were obtained following the
procedure of Reddy, McCarthy, and Manandhar (2011) and Cordeiro et al. (2019). A Likert scale
from 0 (idiomatic) to 5 (compositional) was used for the human judgements, and the resulting
scores were aggregated from the average of the different annotators (Garcia et al. 2021a).14
The annotators also provided synonyms or paraphrases for the NCs in these sentences, which
were used by language experts to manually generate the N CSyn variants (Garcia et al. 2021a).
These annotations, including the synonyms, were collected at two levels of granularity: a more
fine-grained token level, where annotations for each sentence are collected individually, and a
more rough-grained type level, where a single annotation for each NC is collected considering
all three sentences at once (Garcia et al. 2021b). This allows for analyses of the impact of each
individual context in the interpretation of the NC. A total of 8,725 annotations was obtained for
English (421 annotators, each labelling an average of 21 sentences, resulting in a 10.4 annotations
13 The two-word compounds were selected to be representative cases of compositional NCs (meaning related to the two
words), partly idiomatic (meaning related to one of the words) and idiomatic (meaning unrelated to either of the two
words), as our aim is to investigate to what extent the degree of idiomaticity affects the ability of models to generate an
accurate representation. For English, the dataset contains 103, 88, and 89 idiomatic, partial, and compositional
expressions respectively, while for Portuguese it has 60 NCs per class.
14 On average, the compositionality scores were of 0.95/2.34/4.13 for English, and of 1.52/2.46/3.61 for Portuguese
(idiomatic/partial/compositional).
10
Wei He Investigating Idiomaticity in Word Representations
per sentence). In Portuguese, 5,091 annotations were provided by 33 annotators (with an average
of 154 annotated sentences per annotator, and 9.4 annotations per sentence).
# NC Sentence
1 Original John Paul II was an effective front man for the catholic church.
2 N CSyn John Paul II was an effective representative for the catholic church.
3 N CW ordsSyn John Paul II was an effective forepart woman for the catholic church.
4 N CComp John Paul II was an effective man for the catholic church.
John Paul II was an effective front for the catholic church.
John Paul II was an effective battlefront serviceman for the catholic
5 N CRand
church.
Table 1: Naturalistic sentence containing the NC front man (in row 1) forming minimal pairs with
sentences in rows 2-4, and with control random baselines in row 5.
In addition, NCIMP also contains sense-neutral sentences (N eut) in which the NCs appear
in uninformative contexts containing only 5 words and following the pattern This is a/an <NC>
for English (e.g. “This is an eager beaver”) and the Portuguese equivalent Este/a é um(a) <NC>.15
These neutral contexts can be used to examine how much contextual information is added to a
representation in the more informative naturalistic contexts. Moreover, as some NCs may have
more than one meaning (e.g. fish story as either the literal aquatic tale or the idiomatic big lie),
they can also be used to determine the default usage elicited for the NC in the absence of any
informative contextual clues, in particular, whether it leans towards an idiomatic or a literal sense,
potentially serving as an indication of the predominant sense sampled during training.
Experts (native or near-native speakers with background in Linguistics) reviewed both
the naturalistic and the sense-neutral sentences in the minimal pairs, editing them if needed
for preserving grammaticality after substitution (e.g. revising gender, number and definiteness
agreement with determiners and adjectives). However, some of the variants generated may be
semantically nonsensical, especially those involving random replacements. Table 1 displays an
example with the original sentence in the first row and the relevant sentences for each of the
minimal pairs in the other rows.
Finally, each NC was also annotated with frequency, Pointwise Mutual Information (Church
and Hanks 1989) and Positive Pointwise Mutual Information values, calculated from the ukWaC
(2.25B tokens, Baroni et al. (2009)) and brWaC corpora (2.7B tokens, Wagner Filho et al. (2018)),
which can serve as approximations for their familiarity and conventionality.
We evaluate representative static and contextualised models. For the former, we compare GloVe
and Word2Vec, using the official models for English, and the 300 dimensions vectors for
Portuguese (Hartmann et al. 2017).
For the latter, we evaluate a large set of models, including the Bi-LSTM-based ELMo (Peters
et al. 2018), and several Transformer-based language models: BERT (Devlin et al. 2019) and
15 NCIMP also contains a second longer pattern of uninformative neutral sentences (10 words in English and 9 in
Portuguese) following the patterns This is what a/an <NC> is supposed to be and the Portuguese equivalent Isto é o
que um/uma <NC> deveria ser, to measure the potential impact of the length of the neutral context and of the
position of the NC in the sentence. As the two types of neutral sentences elicit similar results, in the paper we only
present the results for the short neutral sentences.
11
Computational Linguistics Volume 1, Number 1
some of its variants, such as multilingual BERT (mBERT16 ) (Pires, Schlinger, and Garrette
2019), multilingual DistilBERT (mDistilB17 ) (Sanh et al. 2019) and multilingual Sentence-BERT
(mSBERT18 ) (Reimers and Gurevych 2019b). The recent flagship model LLaMA2 (Touvron et al.
2023) is also included in our experiments. OpenAI text embeddings (Neelakantan et al. 2022) are
included in the evaluations at sentence-level as they can only be accessed by the API 19 rather
than by direct inspection of the whole model, which would be required for analyses at NC-level.
Therefore, the latter are not conducted for OpenAI text embeddings. For ELMo, we use the small
model provided by Peters et al. (2018), and for Portuguese we adopt the weights released by
Quinta de Castro, Félix Felipe da Silva, and da Silva Soares (2018). For LLaMA2 and OpenAI’s
embeddings, we use the 13B version and text-embedding-ada-002 version, respectively. For all
other contextualised models, we use the pre-trained weights publicly available through Flair20
(Akbik et al. 2019) and HuggingFace21 (Wolf et al. 2020). For BERT-based models (and for
DistilB in English), we report the results obtained both by the multilingual uncased (ML) and
by monolingual models for English (large, uncased) and Portuguese (large, cased), all available
through HuggingFace.
3.2.1 Sentence and NC Embeddings. Embeddings for the whole sentence as well as for the NCs
are generated by averaging the (sub)word embeddings22 of the relevant tokens involved, according
to the model:
• for static models, the word embeddings are derived directly from the vocabulary, with
missing out-of-vocabulary words being ignored;
• for ELMo the output word embeddings are averaged, and the concatenation of its three
layers is adopted;
• for Transformer-based models, the word embeddings are generated by averaging the
representations of the sub-tokens and we report results using the last four layers.23
In general we adopt standard widely used configurations to determine what the landscape of
results is before any task optimisation, even if alternative tokenisation approaches (Gow-Smith
et al. 2022), dedicated representations for MWEs as single-tokens (Cordeiro et al. 2019; Phelps
et al. 2022) and different combinations of layers and weighting schemes (Reimers and Gurevych
2019a; Vulić, Korhonen, and Glavaš 2020; Rogers, Kovaleva, and Rumshisky 2020) may generate
better results in downstream tasks. Additional configurations were also extensively analysed and
as they produced qualitatively similar results, they are not included in the paper.
The general premises of this work, shared by many similar investigations, are that:
16 https://huggingface.co/google-bert/bert-base-multilingual-cased
17 https://huggingface.co/distilbert/distilbert-base-multilingual-cased
18 https://huggingface.co/sentence-transformers/
distiluse-base-multilingual-cased
19 https:
//platform.openai.com/docs/guides/embeddings/what-are-embeddings
20 https://github.com/flairNLP/flair
21 https://github.com/huggingface/transformers
22 In our preliminary experiments, we tested various pooling strategies, including max pooling, min pooling, the CLS
token from BERT, concatenation, and mean pooling. The performance was similar across these methods, but to
maintain simplicity and avoid complications from variable vector lengths, we chose mean pooling for the reported
experiments.
23 Extensive evaluation of the individual layers and their combination were performed, but as the results follow the trend
of those reported here, they are not included in the paper.
12
Wei He Investigating Idiomaticity in Word Representations
3.3.1 The probing strategies. To evaluate how word representation models deal with idiomaticity,
we propose a probing strategy where a target item in a sentence, in this case an NC, is
systematically replaced by a set of different paraphrases or probes (P), forming the minimal
pairs discussed in section 3.1. We then use similarity measures to compare the representation for
the sentence before and after replacing NC by P. Given the focus on idiomaticity we select a set of
probes specifically for the expected changes in meaning they would induce in a sentence, and we
refer to these potential changes in meaning as Linguistic Predictions (LPs). If the representations
generated by a model reflect these predictions, passing the probing tests, then we consider that
particular model as capturing to some extent the idiomatic meaning in NCs. The idiomatic probes
are defined as follows, where Comp is the average human annotation compositionality score:
• PSyn - The true synonym. The replacement is a single word or a two word compositional
noun compound that represents closely the meaning of the target NC, forming the minimal
pair N CSyn . Linguistic Prediction: after the replacement, the resulting sentence should be
a near perfect paraphrase of the original sentence. Therefore high similarities are expected
for all minimal pairs independently of the degree of compositionality of the target NC, from
the more idiomatic grey matter (and brain) to the more literal economic aid (and financial
assistance), with no correlation expected with Comp.
• PComp - The partial expression. The replacement is one of the component words of the
target compound, and in particular we consider the one that preserves most of the meaning,
forming the minimal pair N CComp . Linguistic Prediction: the resulting sentence may
preserve some of the original meaning for more compositional cases, but not for idiomatic
cases. Therefore, high similarities are only expected between minimal pairs involving
compositional and partly compositional cases (e.g. economic aid and aid, crocodile tears
and tears, but not for wet blanket and blanket or wet), with some correlation expected with
Comp.
• PW ordsSyn - The literal synonyms of the individual NC components. The replacement
is a two-word expression formed from frequent out-of-context synonyms for each of the
component words of an NC when considered independently, forming the minimal pair
24 Other compositional operations and measures of distance were also used during these analyses, but with qualitatively
similar results, and have been omitted from the paper.
13
Computational Linguistics Volume 1, Number 1
N CW ordsSyn . Linguistic Prediction: after replacement, the resulting sentence may not pre-
serve the meaning of the original sentence, especially for more idiomatic cases. Therefore,
higher similarities are only expected for minimal pairs involving more compositional NCs
(e.g. wedding day and marriage date but not eager beaver and restless rodent), with a high
correlation expected with Comp.
• PRand - The random replacement controlled by frequency. The replacement is a two word
expression where the words are chosen to match the frequencies of the components of the
target NC, forming the N CRand minimal pair. Linguistic Prediction: after replacement, the
resulting sentence should not preserve the meaning of the original sentence, independently
of the level of idiomaticity of the original NC (e.g. for police car and supermarket city),
with no correlation expected with Comp.
For a more in-depth analysis of expected changes in meaning, we follow Garcia et al. (2021b)
comparing representations both at a macro sentence level and also at a micro NC level, analysing
the representations of NC (and its variants P) extracted from the context of the sentence. Although
any differences in meaning should be reflected both at sentence and at NC representation levels
(only magnified in the latter), this comparison aims to highlight the impact of the level of
granularity used when analysing idiomaticity.25
3.4 Metrics
3.4.1 The Human Compositionality score (Comp). Assuming a list of N NCs, chosen to provide
balanced test scenarios of different levels of idiomaticity, we denote NCα with α = 1, ..., N the
different NCs to be evaluated. The meaning of theses NCs is exemplified by a set of N × M
sentences Sentαβ with α = 1, ..., N and β = 1, ..., M the sentence index. The dataset contains
M =3 naturalistic sentences to exemplify the use of each NC (see section 3.1), with each sentence
annotated by human judges according to the compositionality of the target NC in the sentence.
The resulting scores are denoted Compαβj , with α = 1, ..., N , β = 1, ..., M , and j = 1, ..., Aαβ
where Aαβ is the number of annotators for sentence Sentαβ . Compαβj are integer values derived
from a Likert scale and range from 0 (totally idiomatic) to 5 (totally compositional). We define the
compositionality score for a specific NCα as the average of the annotations for sentences Sentαβ ,
D E
Comp(NCα ) = Compαβj Annot Sent
(1)
where ⟨· · · ⟩Sent are averages on sentences and ⟨· · · ⟩Annot averages on annotations. These average
values are the gold standard in this work.
3.4.2 The Similarity score (Sim). Probing the meaning of a compound NCα in a sentence Sentαβ
requires the generation of a new set of modified sentences SentP i βγ where NCα is replaced by a
probe Pi (discussed in section 3.3.1). We measure the effect of the probe substitution directly from
the similarity between the representation of the original expression, X, and the representation of
the new expression after substitution, Y, adopting, throughout this paper, cosine similarity as a
measure of the similarity of meaning between two vector embeddings.
ϵX · ϵY
cossim(X, Y ) = (2)
||ϵX || ||ϵY ||
25 Our prior work reveals that only looking at similarities at sentence level when comparing the representations of the
original and the resulting sentences may not accurately reflect their differences (Garcia et al. 2021b).
14
Wei He Investigating Idiomaticity in Word Representations
where ϵX and ϵY are vector embeddings of D components, ϵX · ϵY their inner products, and
||ϵX ||, ||ϵY || are their L2 norms. Therefore the average similarity between the original expression
and the probe-modified expression for a given NC can be defined as
where expr(NC) is the target NC expression, and expr(Pi) is the expression where NC is replaced
by a probe of the type Pi, and ⟨· · · ⟩P i means the average over possible substitutions of this type.
We use more than one substitution only for random probes (PRand ), for all other probes a single
substitution is reported.
3.4.3 The Affinity score (Aff). Cosine similarity measures are not sensitive enough to capture
subtle meaning differences, especially in anisotropic representation spaces (Ethayarajh and
Jurafsky 2021). Additionally, there may be a ‘horizon of interest,’ beyond which word connections
lose meaningful inference (Karlgren and Kanerva 2021), which may be a challenge for representing
idiomatic expressions, as the necessary context may lie within this critical boundary. Investigating
measures that account for anisotropic spaces and for a horizon of interest are interesting avenues
for future research for improving idiomaticity detection. In this paper, we propose a comparative
measure that we refer to as Affinity (Assessment of Feature Familiarity and Idiomatic Nuance by
Interpreting Target Yielding), that identifies which between two representations is the closest to a
given target representation.
Given a target representation T arget and two possible probes Pi and Pj, the affinity is defined
as:
Affinities closer to 1 or larger indicate a greater similarity between the target and the first probe
Pi, values closer to -1 or lower indicate the opposite situation where the target is more similar to
the second probe Pj, and values near zero indicate no preference. Given the focus of this paper on
detecting idiomaticity in representations, we measure the affinities involving the minimal pairs
defined in section 3.3.1, analysing if, as expected, the target NCs have higher similarities with
probes with substitutions that maintain the original meaning as Pi than with probes that involve
potential changes in meaning as Pj. In particular:
• Affinity ASyn|W ordsSyn = Aff(PSyn , PW ordsSyn |NC) measures if the target NCs have
greater similarities with their gold synonyms than with synonyms of the individual
components (e.g. eager beaver with hardworking person than with restless rodent).
• Affinity ASyn|Rand = Aff(PSyn , PRand |NC) compare if the target NCs display greater
similarities to their gold synonyms than to random substitutions.
Our Affinity measure extends traditional forced-choice evaluations (Warstadt et al. 2020)
by quantifying the degree of similarity preference between two options. Unlike binary choices,
Affinity provides a continuous measure of relative similarity, offering a more detailed assessment
of how well models capture idiomatic meanings. This nuanced analysis reveals subtle differences
in model performance, providing deeper insights into the representation of idiomatic expressions.
3.4.4 The Scaled Similarity score (SimR ). Even though Affinity is an advance over the simple
similarity measure, additional measures may still need to be adopted for models if the average
similarity between two random embeddings is larger than zero, as affinities will tend to have small
15
Computational Linguistics Volume 1, Number 1
values even for very dissimilar probes (see discussion). To address this issue, we propose a scaled
version of the similarity:
Sim(Pi, T arget) − Sim(PRand , T arget)
SimR (Pi|T arget) = (5)
1 − Sim(PRand , T arget) Sent
where ⟨· · · ⟩Sent denotes the average over the M sentences that illustrate the meaning of a
particular NC and PRand is a random substitution. The scaled similarity is defined such that if
replacing the target with a probe Pi results in cosine similarities close to one (Sim(Pi, T arget) ≈
1), the scaled similarity is also close to one, SimR ≈ 1. Conversely, if the replacement is similar to
a random replacement (Sim(Pi, Target) ≈ Sim(PRand , T arget)), then SimR ≈ 0. This approach
is equivalent to a max-min normalisation26 in the anisotropic space of a model.
In particular, given the focus on idiomaticity, we focus as before on two similarities:
• SimR|Syn = SimR (PSyn |NC), where the NCs are replaced by gold synonyms and no
changes in meaning are expected, therefore SimR|Syn should be close to 1.
• SimR|W ordsSyn = SimR (PW ordsSyn |NC), where the NCs are replaced by synonyms of
the individual components and greater changes in meaning, and therefore small values
(∼ 0) of SimR , are expected for more idiomatic cases.
3.4.5 The Correlation measure (ρ). Finally, to assess the impact of idiomaticity for the probe
substitutions we use Spearman Correlation between the different measurements and the gold
standard human annotations of compositionality (Comp) given by Eq. 1.
4.1 Are the representations of the NCs and their synonyms similar?
A first indication of the successful modeling of idiomaticity is if a model assigns similar represen-
tations for the target NCs and for their synonyms, regardless of their level of compositionality.
We measure this using the minimal pairs of probe PSyn and compare it with less appropriate
substitutions represented by the other probes Pj. The distribution of similarities obtained for
each of the probes is shown in Figure 1, along with the correlations of these similarities with the
human compositionality scores for the NCs at sentence (ρSent ) and NC (ρN C ) levels, in Tables 2
and 3. Considered in isolation, the high similarity scores for PSyn at sentence level (close to 1
for naturalistic sentences, and mostly above 0.75 for neutral sentences, Figure 1(PSyn )) seem to
suggest that these models are able to capture idiomaticity. However, when compared against the
scores for the minimal pairs of the other probes a different story emerges.
When the components of a target NC are replaced with one of their component words
(Figure 1(PComp )) or with the synonyms of their component words (Figure 1(PW ordsSyn )), lower
similarities should be observed between the minimal pairs since, although these substitutions
could preserve some of the meaning of the more compositional cases, they would not do so for
the more idiomatic cases. Moreover, random substitutions should lead to even lower similarities
for all NCs (Figure 1(PRand )), since they could result in nonsensical sentences. This expected
x − min(x)
x′ = .
max(x) − min(x)
16
Wei He Investigating Idiomaticity in Word Representations
PSyn PComp
PW ordsSyn PRand
Ideal Values
Figure 1: Distribution of cosine similarities between the minimal pairs at sentence level, with the
original NC and the probe-modified substitution for English (EN, in blue) and Portuguese (PT, in
orange), with naturalistic (Nat) sentences in darker shade and neutral (Neut) in lighter. The lower
panel (Ideal Values) is an illustration of similarity values ideally expected for the different probes.
The means and standard deviations are in Table 10 in the Appendix.
17
Computational Linguistics Volume 1, Number 1
staggered pattern of similarities, highest for PSyn , moderate for PComp and PW ordsSyn , and lower
for PRand , illustrated in Figure 1(Ideal Values) does not seem to be reflected by a visible reduction
of the similarities at sentence level, in Figures 1. In fact, even random substitutions seem to result
in high sentence similarities, even if they are not as high the other substitutions.
Another important point relates to the correlation of these similarities for the different
NCs with human judgments for compositionality. It is expected that there would be almost no
correlation for the similarities derived from PSyn and PRand , and a moderate correlation for
PComp and PW ordsSyn , as they may be more acceptable for compositional NCs than for idiomatic
ones. However, this expected pattern is not observed in the results presented in Table 2. For most
models, ρSent (PSyn ) shows moderate correlation, while ρSent (PComp ) and ρSent (PW ordsSyn )
are either weak or non-significant.
ρSent Word2Vec GloVe ELMo SBERT BERT BERT DistilB LLaMA2 OpenAI
ML ML ML
PSyn
EN-Nat 0.30 0.31 0.43 0.47 0.39 0.51 0.38 0.15 0.41
EN-Neut 0.60 0.58 0.55 0.60 0.51 0.53 0.56 0.37 0.54
PT-Nat 0.18 0.13 0.33 0.31 0.32 0.29 0.20 0.27 0.46
PT-Neut 0.31 0.22 0.37 0.46 0.35 0.30 0.31 0.31 0.51
PComp
EN-Nat - - - - 0.17 - - - 0.37
EN-Neut 0.19 0.29 - - - - -0.12 - 0.51
PT-Nat - -0.12 0.12 - 0.16 - -0.15 - 0.21
PT-Neut 0.13 - 0.17 - - -0.14 - - 0.27
PW ordsSyn
EN-Nat - - - - - - - - 0.21
EN-Neut 0.19 - - -0.13 -0.15 - - 0.20 0.13
PT-Nat -0.12 -0.19 - - - - -0.14 - 0.11
PT-Neut - -0.13 - - - - - - 0.17
PRand
EN-Nat - -0.11 -0.13 -0.16 -0.27 -0.11 -0.18 -0.11 -
EN-Neut 0.11 - -0.31 -0.36 -0.29 - -0.13 - -
PT-Nat -0.17 -0.20 -0.13 -0.11 -0.14 -0.12 - -0.18 -
PT-Neut 0.13 -0.17 -0.14 -0.11 -0.22 -0.11 - - -
Table 2: Spearman ρ correlation between cosine similarities and human compositionality
judgments (Comp) at sentence level. Only significant results (p≤0.05) are displayed, for PSyn ,
PComp , PW ordsSyn and PRand , for English (EN) and Portuguese (PT), naturalistic (Nat) and
neutral (Neut) sentences.
Since in these minimal pairs only the target NCs and their substitutions change, the high
similarities found may be an effect of the lexical overlap between the sentences of a minimal
pair. Indeed, comparing the output of the models in relation to sentence lengths for naturalistic
sentences, there is a significant moderate to strong positive correlation between the lexical overlap
and the cosine similarity of a pair, for both English and Portuguese (Table 4), where the greater
the overlap between the sentences, the higher their similarity. This can also explain the higher
similarities observed for naturalistic than for neutral sentences, since the former are longer than
the latter with a higher lexical overlap proportional to the length of the sentence: average sentence
length for naturalistic sentences is 23.4 words for English (lexical overlap > 91%) and 13.0
words for Portuguese (overlap > 84%), while for the neutral sentences it is five words (overlap >
60%) for both languages.27 It could be argued that the influence of lexical overlap is expected,
given that a compositional representation is used for sentences, where the embeddings for each
27 We also compared longer neutral contexts with 10 words for English (> 80%), and 9 words for Portuguese (> 77%),
and found similar results.
18
Wei He Investigating Idiomaticity in Word Representations
token are added. However, while this holds true for static models, it may not necessarily apply
to contextualised models. In contextualised models, it is expected that each token/word would
interact with others via attention heads, and if the model accurately captures semantics, all
tokens/words will adjust to the context of the sentence as a whole. Ideally, even with the simple
compositional representation of the sentence, we would anticipate that a correct sentence would
exhibit low similarity with the mostly nonsensical sentences produced by random probes. Even
though similarities coming from contextualised model seems to present lower correlations with
sentence size, still lexical overlap appears to dominate across all types of models.
To minimise the effect of the lexical overlap in the similarities, we now focus our analyses
only on the similarities among the tokens representing the NCs and their substitutions in the
19
Computational Linguistics Volume 1, Number 1
PSyn PComp
PW ordsSyn PRand
Ideal Values
Figure 2: Distribution of cosine similarities between the minimal pairs at NC level, with the
original NC and the probe-modified substitution for English (blue) and Portuguese (orange), with
naturalistic sentences in darker shade and neutral in lighter. The lower panel (Ideal Values) is an
illustration of similarity values ideally expected for the different probes. The means and standard
deviations are in Table 10 in the Appendix.
20
Wei He Investigating Idiomaticity in Word Representations
context of the target sentences. In this case, lower similarities were obtained for all probes and
all models compared to those at sentence level (Figure 2 vs. Figure 1). This is even the case for
similarities for the NCs and their synonyms (PSyn ), which are centred around the same values
as those for the NCs and synonyms of the individual components (PW ordsSyn ) and those for the
random replacements (PRand ) follow this trend, and they are all lower than those for the NCs and
only one NC component (PComp ). In fact, similarities for the gold standard synonyms are lower
than for many of the other probes, regardless of the extent to which the original NC meaning
is changed, as probes PComp to PRand involve some change in meaning while PSyn does not.
Finally, there is more variation displayed among the models, as there are lower similarities for
static than for most contextualised models. Overall, the resulting similarities at NC level do not
follow the expected patterns for representing idiomaticity, illustrated in Figure 2 (Ideal Values).
The same holds true for their correlations with human judgements. In line with what occurs
at the sentence level, the similarities at NC level exhibit correlations that contradict linguistic
expectations. In particular, it is expected that true synonymous substitutions work well across
the idiomatic-compositionality spectrum. Therefore, no correlation should be expected for PSyn ,
while for PComp and PW ordsSyn , a moderate correlation is expected and no correlation for PRand .
However, Table 3 indicates that for most models, ρN C (PSyn ) > ρN C (PComp or PW ordsSyn )
with the latter being either weak or not significant.
In the next section we analyse if, at least at a detailed level, the similarities between NCs and
their synonyms are mostly higher than of other alternatives.
4.2 Are the representation of the NCs and their synonyms relatively more similar when
compared to other alternatives?
21
Computational Linguistics Volume 1, Number 1
Ideal Values
Figure 3: Affinity at the NC level for English (blue) and Portuguese (orange), with naturalistic
sentences in darker shade and neutral in lighter. The lower panel (Ideal Values) is an illustration
of values ideally expected for the different affinities. The means and standard deviations are in
Table 12 in the Appendix.
than partly compositional NCs, and the latter show higher Affinities than idiomatic NCs. This is
confirmed by the correlation analysis in Table 5, with most models displaying significant weak
to moderate correlations between Affinities and human compositionality judgements, for all
Affinities types, including neutral sentences and Portuguese data.28 This contradicts what was
generally expected: Affinity ASyn|W ordsSyn values should exhibit a negative correlation with
compositionality, while Affinity ASyn|Rand should show no correlation at all.
These results suggest that representations of idiomatic NCs may not be accurately incorporat-
ing their meanings, since NCs are not closer to their synonyms than to other alternatives, even if
they are random. Moreover, the more idiomatic NCs seem to be more similar to synonyms of their
individual components, which suggests that the surface clues about their individual components
28 We omitted the equivalent of Figure 4 for neutral sentences and Portuguese data due their visual similarity to the
English naturalistic version.
22
Wei He Investigating Idiomaticity in Word Representations
Figure 4: Affinity by idiomaticity Class at NC level for English (EN) and Portuguese (PT) natural-
istic sentences. Idiomatic (I) in green, partly compositional (PC) in yellow and Compositional
NCs (C) in blue.
may be playing a greater role in driving these similarities, even in contextualised models. This
result remains valid even after removing compounds from the dataset that have lexical overlaps
with the NCSyn produced by the annotators (see Table 14 in the appendix).
If random substitutions that should result in Affinities around 1 (ASyn|Rand in Figure 3 (Ideal
Values)) result instead in values mostly below 0.5, the latter may represent the de facto upperbound
for Affinity for these models. In this case, a rescaling factor may need to be adopted that could
magnify meaningful similarity values. To implement this, we propose the Scaled Similarity (eq.5),
which takes into account the threshold defined by random replacements when calculating the
cosine similarities between the target representation and a given probe. In this section we explore
the behavior of SimR|Syn and SimR|W ordsSyn defined in section 3.4.
23
Computational Linguistics Volume 1, Number 1
The Scaled Similarity values (Figure 5) reveal, even more than the Affinities, the equivalences
displayed by the behavior of these models, with SimR being able to abstract away from the
particularities of the spaces defined by each of these models. Interestingly, comparing different
levels of contextualisation (e.g. static models on the left and contextualised on the right half
of Figure 5) the Scaled Similarities produced by static models like Word2Vec and GloVe are
comparable to those by a contextualised large language model like LLaMA2. These results seems
to hold independently of how informative the context is (naturalistic vs. neutral sentences), with
NC representations from naturalistic sentences displaying no real advantage over those from
neutral sentences. Overall, these results suggest that the informative contexts provided by the
naturalistic sentences may not yet be adequately incorporated even by the larger contextualised
models.
24
Wei He Investigating Idiomaticity in Word Representations
Figure 6: SimRP 1 per Compositionality Class: green for Idiomatic (I), yellow for Partly Composi-
tional (PC) and blue for Compositional (C), in English (EN) and Portuguese (PT), in Naturalistic
and Neutral sentences.
Inspecting the SimR|Syn values according to idiomaticity level (Figure 6), the models display
lower Scaled Similarities for the more idiomatic than for the more compositional NCs, confirming
what was already indicated by the Affinities that the models are less able to capture the idiomatic
meanings and as a consequence the expected high similarities with their gold standard synonyms
are not observed. This is further confirmed by analysing the values obtained for the synonyms
of the individual components (Figures 5 and 7) with the distributions of SimR|W ordsSyn values
having similar averages but considerably lower variances when compared to SimR|Syn , whereas
the expected result would be the opposite: lower averages and variances for SimR|Syn . In fact the
average and standard deviation for the ratio SimR|Syn /SimR|W ordsSyn (Figure 8) show that the
ratio oscillates around 1, which indicates that as a whole the models respond similarly to PSyn
and PW ordsSyn substitutions. In addition, the average values and variances for SimR|W ordsSyn
do not depend on the degree of compositionality of the target NC (Figure 7 for SimR|W ordsSyn
and Figure 9 for the average and standard deviation for the ratio SimR|Syn /SimR|W ordsSyn ,
according to idiomaticity level). The whole picture indicates that for all models (contextualised or
25
Computational Linguistics Volume 1, Number 1
Figure 7: SimRP 3 per Compositionality Class: green for Idiomatic (I), yellow for Partly Composi-
tional (PC) and blue for Compositional (C), in English (EN) and Portuguese (PT), in Naturalistic
and Neutral sentences.
not) replacing the NC by literal synonyms of the component words is more effective (produces
higher similarities) than using their gold synonyms. In particular for idiomatic NCs we observe
that SimR|Syn < SimR|W ordsSyn which indicates that the lexical similarity (as opposed to the
similarity of meaning) is still a dominant factor in the representations even for the contextualised
models, and provides additional confirmation for the possibility that the component words of an
idiomatic NC may be represented individually and combined compositionally by these models.
Table 6 summarises these results in terms of the Spearman correlations between SimR values
and the human judgments for compositionality. It shows that, considering the different models,
SimR|Syn is almost always moderately correlated with the compositionality score: the higher the
compositionality score, the higher the value SimR|Syn , and consequently the more the meaning
is preserved with a PSyn substitution. SimR|W ordsSyn , in contrast, rarely displays significant
correlation with compositionality score. As discussed above this is a demonstration that the
idiomatic meaning is not captured by these models, not even by those that are contextualised. As
with Affinities, this discrepancy in the behavior of Scaled Similarities persists even after removing
26
Wei He Investigating Idiomaticity in Word Representations
Figure 8: Ratio of Average of Scaled Similarity (SimR|Syn /SimR|W ordsSyn ). English data are in
blue, Portuguese data in orange, values for naturalistic sentences in darker shade and for neutral
in lighter.
compounds from the dataset that have lexical overlaps with the NCSyn produced by the annotators
(see Table 15 in the appendix).
We have evaluated several vector models from different architectures in two languages, ranging
from static to contextual representations as well as monolingual and multilingual models. Although
the results are generally far from being satisfactory, in this section we highlight some differences
and similarities between models and languages.
Across models, the similarities are in general higher for Transformer-based models than for
static representations. In this respect, it is worth noting that the results of ELMo and mSBERT
are as similar to those of Word2Vec and GloVe than to the other BERT variants (for instance in
27
Computational Linguistics Volume 1, Number 1
EN-Nat PT-Nat
EN-Neut PT-Neut
Figure 9: Ratio of Average of Scaled Similarity (SimR|Syn /SimR|W ordsSyn ), per Compositionality
Class: green for Idiomatic (I), orange for Partly Compositional (PC) and blue for Compositional
(C), in English (EN) and Portuguese (PT), in Naturalistic (Nat) and Neutral (Neut) sentences.
Figure 2). Although further research would be needed to determine the precise factors, for ELMo
this behaviour could be due either to the different vector space constructed by LSTMs or due to
the smaller number of hidden layers when compared to the other models (2 vs. 6 and 12 layers),
which may imply lower contextualisation effects across the network (Ethayarajh 2019).
For the Transformer-based models, there are clear differences between the similarities
produced by the BERT-based models and those of the autoregressive models, which are lower
and with a wider range, especially for neutral sentences. When comparing monolingual and
multilingual models, namely BERT and BERT ML, similar tendencies are found both in
similarities and in correlations with the human judgments. In general, multilingual models seem to
place the vector representations in a more restricted space, implying higher degrees of similarity
and lower ranges of variation. Similar tendencies are found for DistilBERT-ML.
The proposed measures also suggest that the representations of the large autoregressive
models are more similar to those of the static embeddings than to the other Transformer-based
encoder models.
28
Wei He Investigating Idiomaticity in Word Representations
Although the results of the different models across languages follow very similar trends, they
also display two main differences. The first one is that when comparing the minimal pairs of
the naturalistic data, the representations in English seem to be closer and occupy less space than
those in Portuguese, in both monolingual and multilingual models of all types. The second is
that for neutral sentences, there are larger differences than for naturalistic sentences, especially
at the sentence level in both languages and similar results at the NC level, except for ELMo and
BERT embeddings in PW ordsSyn and PRand (Figure 2). The trends are even more aligned when
considering Affinities and Scaled Similarities for most models in both languages.
Indeed, high correlations were found among all models, reflected by the correlogram in
Figure 10. Correlations are particularly high for the expected congruent variants involving NCSyn ,
as reflected by the darker red shades: PSyn , ASyn|W ordsSyn and SimR|Syn . They are also higher
for Affinities and Scaled Similarities, indicating that taking into account the relative preferences
and random similarities within each model reveals how comparable they are in their ability to
represent idiomaticity. That is, regardless of any superiority of specific models for other tasks,
and in spite of their seemingly different individual performances in terms of cosines similarities
in terms of idiomaticity representation this sample of models has not revealed one that is clearly
better than the others. Moreover, high correlations with the static models also suggests that the
relevant contextual clues for idiomaticity representation are not yet adequately incorporated by
the contextualised models.
In sum, our results indicate that the different models evaluated in general follow the same
tendencies when representing idiomaticity in context, suggesting that they are not yet able to
adequately capture the semantics of the MWEs. More investigation is needed to determine how
to effectively achieve this with these architectures and training regimes, or whether a change in
paradigm is required. We will now discuss some representative cases, to give a flavour of how
these models handle a spectrum of idiomaticity.
For a more concrete qualitative overview of the ability of models in representing different levels of
idiomaticity, we now look at some representative English NCs evenly distributed among the three
levels of compositionality (compositional, partly compositional and idiomatic) in three naturalistic
sentences (Table 7). We start with the probes for 6 English NCs and then look at the highest and
lowest values for the ASyn|W ordsSyn affinity focusing on the relation between a given NC and its
N CSyn and N CW ordSyn variants.
Probes. Considering the probing measures in terms of the average scores of all sentences for each
of the 6 NCs (Table 8), we focus on the cosine similarities for the probes and whether they differ
from the expected behavior compatible with capturing idiomatic meaning.
First of all, for PSyn the similarities should be close to 1. Indeed, at sentence level all
similarities for all models are above 0.9, and tend to be higher for compositional NCs (0.98) than
for partly compositional (0.95) and than for idiomatic NCs (0.90). However, at NC level, they
display considerable variation, and while the similarities are high for all models for compositional
NCs, for idiomatic NCs, in particular, the similarities are the lowest and vary considerably per
model (from 0.27 for SBERT ML and Word2Vec to 0.81 for LLama2 for grey matter). For partly
compositional NCs, although some of the models assign the expected high similarities for some
NCs (0.94 for BERT for Dutch courage), other NCs have lower similarities (0.43 for Word2Vec
for eternal rest).
For PComp , lower similarities are expected for idiomatic NCs, as the idiomatic meaning may
be lost when one of the component words is missing (e.g. grey matter vs. grey or vs. matter).
However, at sentence level they are higher than 0.93 for all models. At NC level, although these
29
Computational Linguistics Volume 1, Number 1
Figure 10: Correlograms for all models for all sentences and languages, with only significant
values displayed (p < 0.05). Red indicates positive correlation and blue negative; darker shades
are for higher values, lighter for lower values.
30
Wei He Investigating Idiomaticity in Word Representations
eager beaver hard worker restless S1: Eric was being an eager beaver and left work
rodent late.
S2: Eager beavers willing to learn your job for
less pay are almost always waiting in the wings.
S3: If you are a really eager beaver you can
pre-order the DVD now from either of the below
retailers.
Dutch courage alcoholic Hollander S1: We had to go down to the pub to get some
courage bravery Dutch courage!
S2: We suggest you try the following cocktail to
work up a bit of Dutch courage to get you through
Partly Compositional
the match!
S2: After some Dutch courage(a few vodkas) in
the nightclub, and a nerve-racking conversation,
we kissed!
eternal rest death permanent S1: They have been called home to their eternal
break rest and we are left behind.
S2: These tolls announce the death of a nun and
call for prayers for her eternal rest.
S3: The passengers, with early morning porridge
complexions, don’t look far from eternal rest.
economic aid financial as- budgetary as- S1: We have already extended to Greece certain
sistance sistance types of relief and economic aid but these are
inadequate.
S2: The USSR was soon giving Cuba economic
aid, technical support and military ’advisers’ from
Compositional
the USSR.
S3: A government’s success in reducing population
movement should be a key factor in allocating
economic aid.
research lab research fa- investigation S1: The fourth year is spent doing a research
cility workplace project in a ’real’ research lab.
S2: Being part of a research lab provides at times
very exciting fieldwork experiences for individual
students.
S3: Bath operates several undergraduate degree
programmes that include a professional placement
year in industry or a research lab.
Table 7: Compositional NCs, N CSyn , N CW ordSyn and three sentences for qualitative analyses.
idiomatic NCs have lower similarities they are still high (from 0.77 for SBERT ML for grey matter
to 0.94 for BERT ML for eager beaver). For partly compositional and compositional NCs they
are mostly high for all models, except for LLama 2 for Dutch courage (0.67).
31
Computational Linguistics Volume 1, Number 1
Although lower PW ordsSyn were also expected for more idiomatic NCs, at sentence level
all idiomatic, partly compositional and compositional NCs, display similarities above 0.95, even
though the N CW ordSyn in PW ordsSyn does not preserve the idiomatic meaning (e.g. grey
matter vs. silvery material). At NC level, even if lower values were found for idiomatic NCs with
static models (Word2Vec and GloVe), high similarities were still found (e.g. 0.91 for BERT for
gray matter).
Finally, for PRand , there should be low similarities for all NCs and randomly generated
substitutions. However, most of the similarities are still high, regardless of the level of idiomaticity
(e.g. 0.92 for BERT for grey matter and for BERT ML for Dutch courage).
Overall, the expected high similarities for PSyn are not displayed by these models at NC
level, and for the other probes the perturbations to the idiomatic meaning are not reflected by
lower similarities.
Affinities. For the affinity measures, considering the examples with the highest and lowest values
for ASyn|W ordsSyn as a proxy for how a particular model represents an NC compared to its
synonym and to a word-by-word replacement (NCSyn|W ordsSyn ), we focus on the results for
BERT in the naturalistic sentences in English. As discussed in section 4.1, we expect higher
ASyn|W ordsSyn values for idiomatic NCs, since the model should display a stronger preference for
a semantically related synonym than to a potentially unrelated substitution, representing the former
as closely as possible from the NC in a vector space. In contrast, for more compositional cases,
both substitutions may be possible and close to one another (reflected by ASyn|W ordsSyn values
around 0). However, the NCs with the highest ASyn|W ordsSyn values were mostly compositional
(starting with video game with ASyn|W ordsSyn = 0.44, and parking lot with ASyn|W ordsSyn =
0.40), with the first partly compositional NC appearing at the position 16 (sparkling water with
ASyn|W ordsSyn = 0.32). The idiomatic NC with the highest ASyn|W ordsSyn value is at position
53 (box office, referring to the popularity of a movie with ASyn|W ordsSyn = 0.24).
At the other end of the ranking, we find mostly idiomatic cases. Among the top 10 examples
with the lowest values we find 7 idiomatic (e.g., agony aunt with ASyn|W ordsSyn = −0.29 and
the NC with the lowest value, grey matter, ASyn|W ordsSyn = −0.40), with 2 partly compositional
and only one compositional NC in position 10 (cooking stove with ASyn|W ordsSyn = −0.24).
In sum, these confirm that the models do not display the expected preference for representing
NCs closer to their synonyms than to distractors, even when these involve idiomatic NCs and/or
random items.
5. Conclusions
This paper presented an evaluation of the ability of widely available word representation models
to capture idiomatic meaning, focusing on noun compounds in two languages, English and
Portuguese. For evaluation we introduced the NCIMP dataset, containing NCS in English and
Portuguese in naturalistic and neutral sentences forming minimal pairs with idiomatic probes
using their component words, synonyms and other variant replacements, resulting in a dataset
containing 29,900 items, extending the datasets by Garcia et al. (2021a) and Garcia et al. (2021b).
These pairs can be used to measure the ability of models to detect the loss of the idiomatic
meaning in the presence of lexical substitutions and different contexts. We also propose two types
of measure for quantifying this ability: Affinities and Scaled Similarities. Affinity is a relative
measure of the proximity of the NC to two alternative probes, determining which of them is the
closest to the NC. Focusing on idiomaticity, we analysed if the models were able to generate a
representation for a given NC that was more similar to a semantically related paraphrase given by
the gold standard synonym than to an alternative possibly semantically unrelated representation.
The proposed measures of scaled similarities, SimR , take sample random similarities into account
32
Wei He Investigating Idiomaticity in Word Representations
for rescaling the space of a given model, to magnify high similarities and distinguish them from
those that are artifacts of the characteristics of the landscape of that model. As a consequence,
SimR also seems to abstract away from the particularities of the semantic space of each model
and provides a more direct way of comparing idiomaticity representation across models. The
results obtained indicate that models are not able to accurately capture idiomaticity, as they fail to
reflect actual similarities between NCs and their gold synonyms, especially for idiomatic cases,
while at the same time not displaying enough awareness of perturbations that lead to changes in
meaning, such as those involving the synonyms of the component words, and even random words.
It seems that the lexical clues provided by the component words are prioritised when representing
an NC over a more holistic combination of the relevant semantic clues needed for representing its
idiomatic meaning. Moreover, although the contexts could provide relevant information about
the idiomatic meanings, they do not seem to be adequately incorporated in these widely adopted
models, regardless of their degree of contextualisation. They also seem to fail to incorporate the
relevant context for idiomaticity, seeing as static and contextualised models show comparable
performances.
In this paper we evaluated the proposed measures focusing on idiomaticity, but they may be
applied to other tasks, and serve as basis to detect unwanted biases towards non-target meanings
more generally. Moreover, they may be informative when fine-tuning models to assess if the
changes are going towards the intended target representations.
In this paper, we inspected the similarities produced by a number of models to determine how
accurately they represent idiomatic expressions. The results obtained are that not even large models
like LLama2 seem to display the expected patterns that would confirm idiomatic understanding.
It is important to note that some of the difficulties in extracting information from cosine
similarity measures may be attributed to the presence of rogue dimensions (Timkey and van
Schijndel 2021) rather than anisotropy in semantic space. Measures like Affinity and Scaled
Similarity may not fully address this issue. We conducted a preliminary analysis using Timkey
and van Schijndel (2021) method to identify and standardise the top three rogue dimensions per
model/layer. After standardisation, we conducted an analysis focusing on PSyn measures and
found correlations mostly above 0.85, except for BERT-PT-Neut (0.79) and LLama2-EN-Neut
(0.65) (see Table 16 in the Appendix). Further investigation is needed to assess the impact of
standardising these dimensions and different approaches for standardisation, but given the high
correlations with our original results, we will leave this for future work.
Although our proposed assessment protocol and measures are model-independent, they
rely on access to the models and to their representations for subwords, words and multiwords.
Therefore, probing large generative AI chatbots for their understanding of idiomaticity, especially
closed-source models, presents additional challenges potentially requiring adaptation in the
application of the protocol, due to the restricted access to their base models and of the potential
variation in their answers. These warrant further investigations that are outside the scope of this
paper. However, one possible alternative would be to perform probing using question-answering,
following Zeng and Bhat (2022). We illustrate this question-answering approach with recent AI
Chatbots: GPT-3, Gemini pro29 (Team et al. 2023) and ChatGPT430 . For testing these models,
simple questions containing idiomatic expressions are used, after having instructed each model
to provide the shortest answer to each question before asking them. The assumption is that the
29 https://gemini.google.com/app
30 https://chat.openai.com/
33
Computational Linguistics Volume 1, Number 1
Table 8: Similarity, Affinity and Scaled Similarity values for the NCs selected in Table 7. Values
in parentheses represent the standard deviations among the three sentences. The static models are
independent of context, and for them, the variance is omitted, except in the case of eager beaver
where there is a sentence where the compound appears in plural form.
34
Wei He Investigating Idiomaticity in Word Representations
Gemini ChatGPT
Idiom Question GPT-3
Pro 4
I would
If I have a never say die attitude, would never run a Run the
Never say die Run.
I run the marathon injured or forfeit? marathon marathon.
injured.
If I am all at sea with my math You are
All at sea assignment, am I making progress or making Lost. Lost.
am I lost? progress.
If I use every trick in the book to
Every trick in A Determined
guarantee my grade, am I a magician or Cheater.
the book magician. student.
a determined student?
A multi-
If I wanted to kill two birds in one
Kill two birds A slaughter- tasking
stone, what kind of a workplace should Multiskilled.
in one stone house. environ-
I work in?
ment.
Ahead of the If I want to be ahead of the game,
Procrastinate. Early. Study early.
game would I study early or procrastinate?
Table 9: Questions used to probe the understanding of idioms and the answers provided by recent
generative models. The leftmost column lists idiomatic expressions, the second column presents
hypothetical questions using these expressions, and the following columns show the responses
from different models, including GPT-3, Gemini Pro, and ChatGPT 4. Answers by GPT-3 are
from (Zeng and Bhat 2022).
35
Computational Linguistics Volume 1, Number 1
questions could only be answered correctly if the model understood the meaning of the idioms in
context. The questions and answers are included in Table 9.
The responses from different systems vary in terms of correctly interpreting the idioms.
The responses from GPT-3 often miss the mark, while the responses from Gemini Pro and
ChatGPT 4 are mixed, with some answers suggesting correct interpretations and some incorrect.
For instance, for “Every trick in the book”, GPT-3 responds with “A magician”, which is a literal
interpretation, while ChatGPT 4 correctly identifies the figurative meaning with “Determined
student”. Gemini Pro response to the question is “Cheater”. The idiom “every trick in the book”
generally means to use all available means or strategies to achieve one’s goal, often implying
ingenuity or resourcefulness rather than dishonesty. The response from Gemini Pro could either
be due to “trick” or it could be seen as a misinterpretation. This table could also be seen as
indicative of the evolution of AI language models over time, with newer models potentially
being trained to better handle idiomatic expressions and context, as seen in the generally more
accurate responses from ChatGPT 4 compared to GPT 3. Although the questions in the table
are indeed useful for exemplifying the comprehension of idiomatic expressions by these models
they only cover a very limited and focused sample. In this paper, we propose the use of minimal
pairs containing synonyms and other distractors for a more in-depth assessment of idiomatic
understanding. Although their adaptation for a question-answering setting is left for future work,
our results for open models is in line with comparative analyses of the ability of some of these
models for idiomatic and figurative language (Phelps et al. 2024).
Moreover, as idiomatic expressions can be extremely diverse and nuanced, a comprehensive
evaluation of the ability of a model to understand them requires a controlled but extensive set of
idiomatic expressions and their variations. Therefore, we plan to extend the test items to contain
additional types of multiword expressions, including verb-noun combinations and phrasal verbs.
In addition for a larger crosslingual examination of idiomaticity, and in particular of whether
multilingual models capture language-specific realisations of idiomatic expressions, we plan to
extend the dataset with additional languages. These would also allow the investigation of factors
relevant to specific tasks, such as machine translation, for which the translatability of MWEs from
source into target languages for may also affect performance when processing MWEs (Dankers,
Lucas, and Titov 2022).
Possible next steps also include extending the probing strategy with additional measures
that go beyond similarities and correlations. Moreover, for ambiguous NCs in particular, we
intend to add sense-specific probes which could be used to measure and address training biases
towards particular senses. Finally, this paper has focused the evaluation on off-the-shelf pre-trained
models to provide an analysis of their ability to capture idiomaticity, and left the investigation
of fine-tuned models for future work. In particular, although fine-tuning can improve model
performance (Tayyar Madabushi et al. 2022) it is unclear to what extent the models are able to
generalise beyond the specific items seen to other unseen idiomatic expressions, or if each new
expression would have to be individually learned by the model. But these points are left for future
investigation.
Acknowledgments
This work was partly supported by UKRI EPSRC EP/T02450X/1 and NAF/R2/202209 (UK), by CNPq
311497/2021-7 and CAPES/PRINT 88887.583995/2020-00 (Brazil), by
MCIN/AEI/10.13039/501100011033 (grants PID2021-128811OA-I00 and TED2021-130295B-C33, the
latter also funded by “European Union Next Generation EU/PRTR”), by the Galician Government (ERDF
2014-2020: Call ED431G 2019/04, ED431F 2021/01, and ED431F 2021/01), and by a Ramón y Cajal grant
(RYC2019-028473-I), and by COST-Action UniDive.
36
Wei He Investigating Idiomaticity in Word Representations
37
Computational Linguistics Volume 1, Number 1
Cordeiro, Silvio, Aline Villavicencio, Marco Idiart, Fakharian, Samin and Paul Cook. 2021.
and Carlos Ramisch. 2019. Unsupervised Contextualized embeddings encode
compositionality prediction of nominal monolingual and cross-lingual knowledge of
compounds. Computational Linguistics, idiomaticity. In Proceedings of the 17th
45(1):1–57. Workshop on Multiword Expressions (MWE
Dankers, Verna, Christopher Lucas, and Ivan Titov. 2021), pages 23–32, Association for
2022. Can transformer be too compositional? Computational Linguistics, Online.
analysing idiom processing in neural machine Fazly, Afsaneh, Paul Cook, and Suzanne
translation. In Proceedings of the 60th Annual Stevenson. 2009. Unsupervised type and token
Meeting of the Association for Computational identification of idiomatic expressions.
Linguistics (Volume 1: Long Papers), pages Computational Linguistics, 35(1):61–103.
3608–3626, Association for Computational Frege, Gottlob. 1956. The thought: A logical
Linguistics, Dublin, Ireland. inquiry. Mind, 65(259):289–311.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Garcia, Marcos. 2021. Exploring the
Kristina Toutanova. 2019. BERT: Pre-training representation of word meanings in context: A
of deep bidirectional transformers for language case study on homonymy and synonymy. In
understanding. In Proceedings of the 2019 Proceedings of the 59th Annual Meeting of the
Conference of the North American Chapter of Association for Computational Linguistics and
the Association for Computational Linguistics: the 11th International Joint Conference on
Human Language Technologies, Volume 1 Natural Language Processing (Volume 1: Long
(Long and Short Papers), pages 4171–4186, Papers), pages 3625–3640, Association for
Association for Computational Linguistics, Computational Linguistics, Online.
Minneapolis, Minnesota. Garcia, Marcos, Tiago Kramer Vieira, Carolina
de Dios-Flores, Iria, Juan Garcia Amboage, and Scarton, Marco Idiart, and Aline Villavicencio.
Marcos Garcia. 2023. Dependency resolution at 2021a. Assessing the representations of
the syntax-semantics interface: psycholinguistic idiomaticity in vector models with a noun
and computational insights on control compound dataset labeled at type and token
dependencies. In Proceedings of the 61st levels. In Proceedings of the 59th Annual
Annual Meeting of the Association for Meeting of the Association for Computational
Computational Linguistics (Volume 1: Long Linguistics and the 11th International Joint
Papers), pages 203–222, Association for Conference on Natural Language Processing
Computational Linguistics, Toronto, Canada. (Volume 1: Long Papers), pages 2730–2741,
Erk, Katrin. 2012. Vector space models of word Association for Computational Linguistics,
meaning and phrase meaning: A survey. Online.
Language and Linguistics Compass, Garcia, Marcos, Tiago Kramer Vieira, Carolina
6(10):635–653. Scarton, Marco Idiart, and Aline Villavicencio.
Ethayarajh, Kawin. 2019. How contextual are 2021b. Probing for idiomaticity in vector space
contextualized word representations? models. In Proceedings of the 16th Conference
Comparing the geometry of BERT, ELMo, and of the European Chapter of the Association for
GPT-2 embeddings. In Proceedings of the 2019 Computational Linguistics: Main Volume, pages
Conference on Empirical Methods in Natural 3551–3564, Association for Computational
Language Processing and the 9th International Linguistics, Online.
Joint Conference on Natural Language Gow-Smith, Edward, Harish Tayyar Madabushi,
Processing (EMNLP-IJCNLP), pages 55–65, Carolina Scarton, and Aline Villavicencio. 2022.
Association for Computational Linguistics, Improving tokenisation by alternative treatment
Hong Kong, China. of spaces. In Proceedings of the 2022
Ethayarajh, Kawin and Dan Jurafsky. 2021. Conference on Empirical Methods in Natural
Attention flows are shapley value explanations. Language Processing, pages 11430–11443,
In Proceedings of the 59th Annual Meeting of Association for Computational Linguistics, Abu
the Association for Computational Linguistics Dhabi, United Arab Emirates.
and the 11th International Joint Conference on Gulordava, Kristina, Piotr Bojanowski, Edouard
Natural Language Processing (Volume 2: Short Grave, Tal Linzen, and Marco Baroni. 2018.
Papers), pages 49–54, Association for Colorless green recurrent networks dream
Computational Linguistics, Online. hierarchically. In Proceedings of the 2018
Ettinger, Allyson. 2020. What BERT is not: Conference of the North American Chapter of
Lessons from a new suite of psycholinguistic the Association for Computational Linguistics:
diagnostics for language models. Transactions Human Language Technologies, Volume 1
of the Association for Computational (Long Papers), pages 1195–1205, Association
Linguistics, 8:34–48. for Computational Linguistics, New Orleans,
38
Wei He Investigating Idiomaticity in Word Representations
39
Computational Linguistics Volume 1, Number 1
Mandera, Paweł, Emmanuel Keuleers, and Marc Mitchell, Jeff and Mirella Lapata. 2010.
Brysbaert. 2017. Explaining human Composition in distributional models of
performance in psycholinguistic tasks with semantics. Cognitive science, 34(8):1388–1429.
models of semantic similarity based on Montague, Richard. 1973. The proper treatment of
prediction and counting: A review and quantification in ordinary english. In
empirical validation. Journal of Memory and Approaches to natural language: Proceedings
Language, 92:57–78. of the 1970 Stanford workshop on grammar and
Manning, Christopher D., Kevin Clark, John semantics, pages 221–242, Springer.
Hewitt, Urvashi Khandelwal, and Omer Levy. Nandakumar, Navnita, Timothy Baldwin, and
2020. Emergent linguistic structure in artificial Bahar Salehi. 2019. How Well Do Embedding
neural networks trained by self-supervision. Models Capture Non-compositionality? A View
Proceedings of the National Academy of from Multiword Expressions. In Proceedings of
Sciences, 117(48):30046–30054. the 3rd Workshop on Evaluating Vector Space
Marvin, Rebecca and Tal Linzen. 2018. Targeted Representations for NLP, pages 27–34,
syntactic evaluation of language models. In Association for Computational Linguistics,
Proceedings of the 2018 Conference on Minneapolis, USA.
Empirical Methods in Natural Language Nedumpozhimana, Vasudevan and John Kelleher.
Processing, pages 1192–1202, Association for 2021. Finding BERT’s idiomatic key. In
Computational Linguistics, Brussels, Belgium. Proceedings of the 17th Workshop on
McCarthy, Diana, Bill Keller, and John Carroll. Multiword Expressions (MWE 2021), pages
2003. Detecting a continuum of 57–62, Association for Computational
compositionality in phrasal verbs. In Linguistics, Online.
Proceedings of the ACL 2003 Workshop on Neelakantan, Arvind, Tao Xu, Raul Puri, Alec
Multiword Expressions: Analysis, Acquisition Radford, Jesse Michael Han, Jerry Tworek,
and Treatment, pages 73–80, Association for Qiming Yuan, Nikolas Tezak, Jong Wook Kim,
Computational Linguistics, Sapporo, Japan. Chris Hallacy, et al. 2022. Text and code
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg embeddings by contrastive pre-training. arXiv
Corrado, and Jeffrey Dean. 2013. Distributed preprint arXiv:2201.10005.
representations of words and phrases and their Nunberg, Geoffrey, Ivan A. Sag, and Thomas
compositionality. In Proceedings of the 26th Wasow. 1994. Idioms. Language, 70:491 – 538.
International Conference on Neural Pennington, Jeffrey, Richard Socher, and
Information Processing Systems - Volume 2, Christopher Manning. 2014. GloVe: Global
NIPS’13, pages 3111–3119, Curran Associates vectors for word representation. In Proceedings
Inc., USA. of the 2014 Conference on Empirical Methods
Miletic, Filip and Sabine Schulte im Walde. 2023. in Natural Language Processing (EMNLP),
A systematic search for compound semantics in pages 1532–1543, Association for
pretrained BERT architectures. In Proceedings Computational Linguistics, Doha, Qatar.
of the 17th Conference of the European Chapter Peters, Matthew E., Mark Neumann, Mohit Iyyer,
of the Association for Computational Matt Gardner, Christopher Clark, Kenton Lee,
Linguistics, pages 1499–1512, Association for and Luke Zettlemoyer. 2018. Deep
Computational Linguistics, Dubrovnik, Croatia. contextualized word representations. In
Miletić, Filip and Sabine Schulte im Walde. 2024. Proceedings of the 2018 Conference of the
Semantics of multiword expressions in North American Chapter of the Association for
transformer-based models: A survey. Computational Linguistics: Human Language
Transactions of the Association for Technologies, Volume 1 (Long Papers), pages
Computational Linguistics, 12:593–612. 2227–2237, Association for Computational
Miller, George A. 1995. WordNet: a lexical Linguistics, New Orleans, Louisiana.
database for English. Communications of the Phelps, Dylan, Xuan-Rui Fan, Edward Gow-Smith,
ACM, 38(11):39–41. Harish Tayyar Madabushi, Carolina Scarton,
Misra, Kanishka, Julia Rayz, and Allyson Ettinger. and Aline Villavicencio. 2022. Sample efficient
2023. COMPS: Conceptual minimal pair approaches for idiomaticity detection. In
sentences for testing robust property knowledge Proceedings of the 18th Workshop on
and its inheritance in pre-trained language Multiword Expressions @LREC2022, pages
models. In Proceedings of the 17th Conference 105–111, European Language Resources
of the European Chapter of the Association for Association, Marseille, France.
Computational Linguistics, pages 2928–2949, Phelps, Dylan, Thomas Pickard, Maggie Mi,
Association for Computational Linguistics, Edward Gow-Smith, and Aline Villavicencio.
Dubrovnik, Croatia. 2024. Sign of the times: Evaluating the use of
large language models for idiomaticity
40
Wei He Investigating Idiomaticity in Word Representations
detection. In Proceedings of the Joint Workshop Sag, Ivan A., Timothy Baldwin, Francis Bond,
on Multiword Expressions and Universal Ann Copestake, and Dan Flickinger. 2002.
Dependencies (MWE-UD 2024), Association Multiword expressions: A pain in the neck for
for Computational Linguistics. NLP. In Proceedings of the Third International
Pires, Telmo, Eva Schlinger, and Dan Garrette. Conference on Computational Linguistics and
2019. How multilingual is multilingual bert? In Intelligent Text Processing (CICLing 2002),
Proceedings of the 57th Annual Meeting of the pages 1–15, Springer, Berlin, Heidelberg,
Association for Computational Linguistics, Mexico City, Mexico.
pages 4996–5001. Sanh, Victor, Lysandre Debut, Julien Chaumond,
Prasad, Grusha, Marten van Schijndel, and Tal and Thomas Wolf. 2019. DistilBERT, a distilled
Linzen. 2019. Using priming to uncover the version of BERT: smaller, faster, cheaper and
organization of syntactic representations in lighter. arXiv preprint arXiv:1910.01108.
neural language models. In Proceedings of the van Schijndel, Marten and Tal Linzen. 2018. A
23rd Conference on Computational Natural neural model of adaptation in reading. In
Language Learning (CoNLL), pages 66–76, Proceedings of the 2018 Conference on
Association for Computational Linguistics, Empirical Methods in Natural Language
Hong Kong, China. Processing, pages 4704–4710, Association for
Rademaker, Alexandre, Valeria de Paiva, Gerard Computational Linguistics, Brussels, Belgium.
de Melo, Livy Real, and Maira Gatti. 2014. Schuster, Tal, Ori Ram, Regina Barzilay, and Amir
OpenWordNet-PT: A project report. In Globerson. 2019. Cross-lingual alignment of
Proceedings of the Seventh Global Wordnet contextual word embeddings, with applications
Conference, pages 383–390, University of Tartu to zero-shot dependency parsing. In
Press, Tartu, Estonia. Proceedings of the 2019 Conference of the
Ramisch, Carlos. 2023. Multiword expressions in North American Chapter of the Association for
computational linguistics. Down the rabbit hole Computational Linguistics: Human Language
and through the looking glass. Aix Marseille Technologies, Volume 1 (Long and Short
University. Papers), pages 1599–1613, Association for
Reddy, Siva, Diana McCarthy, and Suresh Computational Linguistics, Minneapolis,
Manandhar. 2011. An empirical study on Minnesota.
compositionality in compound nouns. In Shwartz, Vered and Ido Dagan. 2019. Still a pain
Proceedings of 5th International Joint in the neck: Evaluating text representations on
Conference on Natural Language Processing, lexical composition. Transactions of the
pages 210–218, Asian Federation of Natural Association for Computational Linguistics,
Language Processing, Chiang Mai, Thailand. 7:403–419.
Reimers, Nils and Iryna Gurevych. 2019a. Sporleder, Caroline, Linlin Li, Philip Gorinski,
Alternative Weighting Schemes for ELMo and Xaver Koch. 2010. Idioms in context: The
Embeddings. CoRR, abs/1904.02954. IDIX corpus. In Proceedings of the Seventh
Reimers, Nils and Iryna Gurevych. 2019b. International Conference on Language
Sentence-BERT: Sentence embeddings using Resources and Evaluation (LREC’10),
Siamese BERT-networks. In Proceedings of the European Language Resources Association
2019 Conference on Empirical Methods in (ELRA), Valletta, Malta.
Natural Language Processing and the 9th Tayyar Madabushi, Harish, Edward Gow-Smith,
International Joint Conference on Natural Marcos Garcia, Carolina Scarton, Marco Idiart,
Language Processing (EMNLP-IJCNLP), pages and Aline Villavicencio. 2022. SemEval-2022
3982–3992, Association for Computational task 2: Multilingual idiomaticity detection and
Linguistics, Hong Kong, China. sentence embedding. In Proceedings of the 16th
Rogers, Anna, Olga Kovaleva, and Anna International Workshop on Semantic Evaluation
Rumshisky. 2020. A primer in BERTology: (SemEval-2022), pages 107–121, Association
What we know about how BERT works. for Computational Linguistics, Seattle, United
Transactions of the Association for States.
Computational Linguistics, 8:842–866. Tayyar Madabushi, Harish, Edward Gow-Smith,
Saakyan, Arkadiy, Tuhin Chakrabarty, Debanjan Carolina Scarton, and Aline Villavicencio. 2021.
Ghosh, and Smaranda Muresan. 2022. A report AStitchInLanguageModels: Dataset and
on the FigLang 2022 shared task on methods for the exploration of idiomaticity in
understanding figurative language. In pre-trained language models. In Findings of the
Proceedings of the 3rd Workshop on Figurative Association for Computational Linguistics:
Language Processing (FLP), pages 178–183, EMNLP 2021, pages 3464–3477, Association
Association for Computational Linguistics, Abu for Computational Linguistics, Punta Cana,
Dhabi, United Arab Emirates (Hybrid). Dominican Republic.
41
Computational Linguistics Volume 1, Number 1
42
Wei He Investigating Idiomaticity in Word Representations
43
Computational Linguistics Volume 1, Number 1
In this section we present the mean and standard deviation for the NCs in English and Portuguese
in naturalistic and neutral sentences, for the different probes at Sentence level (Table 10) and at
NC level (Table 11), for Affinities (Table 12) and for Scaled Similarities (Table 13).
PSyn
EN-Nat EN-Neut PT-Nat PT-Neut
Model Name
mean std mean std mean std mean std
Word2Vec 0.985 0.012 0.811 0.083 0.968 0.025 0.883 0.062
GloVe 0.990 0.008 0.868 0.063 0.980 0.018 0.931 0.054
ELMo 0.974 0.022 0.841 0.070 0.938 0.045 0.782 0.116
SBERT ML 0.974 0.022 0.810 0.101 0.955 0.035 0.833 0.096
BERT 0.988 0.011 0.927 0.035 0.980 0.017 0.915 0.041
BERT ML 0.992 0.007 0.924 0.040 0.984 0.012 0.929 0.044
DistilBERT ML 0.996 0.003 0.952 0.023 0.991 0.007 0.966 0.018
LLama2 0.992 0.010 0.955 0.020 0.981 0.018 0.903 0.065
PComp
Word2Vec 0.996 0.004 0.941 0.018 0.987 0.011 0.957 0.026
GloVe 0.996 0.003 0.955 0.011 0.993 0.006 0.982 0.012
ELMo 0.989 0.009 0.914 0.019 0.966 0.020 0.890 0.035
SBERT ML 0.990 0.007 0.922 0.021 0.982 0.013 0.929 0.029
BERT 0.992 0.007 0.951 0.016 0.986 0.013 0.933 0.025
BERT ML 0.996 0.003 0.957 0.016 0.993 0.005 0.962 0.016
DistilBERT ML 0.998 0.001 0.977 0.006 0.996 0.002 0.987 0.005
LLama2 0.995 0.008 0.986 0.007 0.991 0.008 0.964 0.020
PW ordsSyn
Word2Vec 0.983 0.013 0.797 0.049 0.958 0.031 0.845 0.060
GloVe 0.989 0.009 0.863 0.041 0.974 0.025 0.904 0.062
ELMo 0.975 0.020 0.861 0.048 0.930 0.042 0.760 0.088
SBERT ML 0.977 0.017 0.844 0.057 0.956 0.033 0.855 0.060
BERT 0.983 0.014 0.919 0.032 0.967 0.025 0.891 0.038
BERT ML 0.991 0.006 0.925 0.036 0.983 0.012 0.934 0.032
DistilBERT ML 0.995 0.003 0.952 0.016 0.990 0.006 0.963 0.014
LLama2 0.986 0.014 0.945 0.021 0.977 0.017 0.891 0.052
PRand
Word2Vec 0.984 0.012 0.799 0.043 0.960 0.033 0.851 0.099
GloVe 0.988 0.009 0.849 0.038 0.974 0.026 0.911 0.095
ELMo 0.966 0.025 0.829 0.040 0.912 0.048 0.725 0.115
SBERT ML 0.968 0.023 0.769 0.053 0.935 0.043 0.768 0.063
BERT 0.979 0.018 0.924 0.027 0.956 0.028 0.886 0.033
BERT ML 0.990 0.008 0.925 0.024 0.980 0.013 0.933 0.030
DistilBERT ML 0.995 0.004 0.951 0.012 0.990 0.007 0.967 0.016
LLama2 0.980 0.019 0.937 0.015 0.962 0.026 0.879 0.058
Table 10: Mean and standard deviation at Sentence level for PSyn , PComp , PW ordsSyn and PRand ,
for English (EN) and Portuguese (PT) for naturalistic (Nat) and neutral (Neut) sentences.
As the NCSyn were selected from the synonyms proposed by the human annotators, and chosen
according to frequency, this led to cases of lexical overlap. Removing the NCs with lexical overlap
with their NCSyn and analysing the correlations for Affinities and Scaled Similarities, the results
are as shown in Tables 14 and 15. The results are compatible with those of Tables 5 and 6 for
44
Wei He Investigating Idiomaticity in Word Representations
PSyn
EN-Nat EN-Neut PT-Nat PT-Neut
Model Name
mean std mean std mean std mean std
Word2Vec 0.517 0.209 0.517 0.207 0.498 0.251 0.488 0.258
GloVe 0.551 0.227 0.555 0.222 0.465 0.278 0.473 0.275
ELMo 0.714 0.147 0.646 0.155 0.629 0.166 0.551 0.192
SBERT ML 0.591 0.208 0.577 0.203 0.632 0.199 0.612 0.198
BERT 0.816 0.086 0.854 0.060 0.824 0.090 0.831 0.079
BERT ML 0.876 0.061 0.861 0.059 0.880 0.056 0.866 0.063
DistilBERT ML 0.867 0.058 0.864 0.057 0.868 0.059 0.870 0.056
LLama2 0.702 0.189 0.612 0.200 0.533 0.216 0.589 0.205
PComp
Word2Vec 0.840 0.039 0.838 0.039 0.714 0.269 0.703 0.280
GloVe 0.835 0.041 0.837 0.040 0.715 0.276 0.710 0.282
ELMo 0.859 0.042 0.823 0.040 0.781 0.080 0.733 0.093
SBERT ML 0.815 0.042 0.805 0.038 0.823 0.050 0.808 0.052
BERT 0.849 0.060 0.886 0.037 0.855 0.066 0.864 0.041
BERT ML 0.923 0.022 0.913 0.020 0.930 0.021 0.921 0.023
DistilBERT ML 0.922 0.015 0.922 0.013 0.929 0.018 0.931 0.014
LLama2 0.828 0.102 0.844 0.086 0.741 0.174 0.749 0.174
PW ordsSyn
Word2Vec 0.524 0.098 0.524 0.097 0.459 0.185 0.450 0.189
GloVe 0.569 0.119 0.572 0.116 0.356 0.196 0.357 0.198
ELMo 0.759 0.083 0.707 0.091 0.644 0.100 0.557 0.110
SBERT ML 0.659 0.112 0.645 0.112 0.670 0.119 0.662 0.122
BERT 0.780 0.105 0.850 0.064 0.783 0.077 0.820 0.054
BERT ML 0.881 0.035 0.867 0.040 0.887 0.035 0.877 0.039
DistilBERT ML 0.870 0.029 0.868 0.027 0.875 0.027 0.877 0.026
LLama2 0.668 0.148 0.601 0.151 0.490 0.137 0.560 0.118
PRand
Word2Vec 0.419 0.064 0.423 0.065 0.460 0.185 0.371 0.151
GloVe 0.413 0.108 0.419 0.108 0.356 0.196 0.293 0.219
ELMo 0.674 0.082 0.628 0.069 0.644 0.100 0.482 0.097
SBERT ML 0.479 0.067 0.473 0.067 0.670 0.119 0.479 0.072
BERT 0.746 0.117 0.855 0.061 0.783 0.077 0.808 0.037
BERT ML 0.872 0.031 0.872 0.028 0.887 0.035 0.883 0.032
DistilBERT ML 0.879 0.024 0.879 0.021 0.875 0.027 0.898 0.021
LLama2 0.631 0.100 0.568 0.105 0.490 0.137 0.544 0.102
Table 11: Mean and standard deviation at NC level for PSyn , PComp , PW ordsSyn and PRand , for
English (EN) and Portuguese (PT) for naturalistic (Nat) and neutral (Neut) sentences.
the complete set of NCs. As expected the correlations are smaller and less significant than those
obtained for the full set, as with the removal of the NCs with lexical overlap a smaller set was
used to calculate correlations. The ultimate test will be to redo the analysis with the full list of
NCs but only using NCSyn without lexical overlap, but this requires additional human annotation
and is left for future work.
45
Computational Linguistics Volume 1, Number 1
ASyn|W ordsSyn
EN-Nat EN-Neut PT-Nat PT-Neut
Model Name
mean std mean std mean std mean std
Word2Vec -0.002 0.149 0.004 0.156 0.025 0.152 0.038 0.160
GloVe -0.009 0.166 -0.006 0.170 0.058 0.193 0.072 0.193
ELMo -0.023 0.108 -0.040 0.134 -0.003 0.124 0.008 0.182
SBERT ML -0.036 0.160 -0.051 0.178 -0.019 0.154 -0.036 0.176
BERT 0.021 0.090 0.006 0.067 0.027 0.077 0.017 0.073
BERT ML -0.002 0.044 -0.003 0.056 -0.003 0.044 -0.008 0.059
DistilBERT ML -0.001 0.041 -0.003 0.045 -0.003 0.047 -0.002 0.049
LLama2 0.020 0.137 0.011 0.154 0.024 0.169 0.021 0.166
ASyn|Rand
Word2Vec 0.049 0.156 0.054 0.162 0.076 0.165 0.074 0.182
GloVe 0.070 0.177 0.077 0.179 0.110 0.213 0.100 0.219
ELMo 0.024 0.116 0.015 0.135 0.051 0.127 0.062 0.198
SBERT ML 0.059 0.173 0.072 0.186 0.081 0.165 0.099 0.171
BERT 0.040 0.103 0.001 0.065 0.057 0.085 0.026 0.070
BERT ML 0.003 0.048 -0.006 0.052 0.000 0.042 -0.011 0.054
DistilBERT ML -0.005 0.047 -0.007 0.049 -0.012 0.044 -0.015 0.046
LLama2 0.042 0.124 0.031 0.133 0.064 0.165 0.034 0.162
Table 12: Mean and standard deviation at NC level for ASyn|W ordsSyn and ASyn|Rand , for English
(EN) and Portuguese (PT) for naturalistic (Nat) and neutral (Neut) sentences.
SimR|Syn
EN-Nat EN-Neut PT-Nat PT-Neut
Model Name
mean std mean std mean std mean std
Word2Vec 0.164 0.365 0.159 0.362 0.221 0.356 0.183 0.373
GloVe 0.221 0.407 0.220 0.406 0.264 0.391 0.225 0.424
ELMo 0.076 0.512 0.012 0.470 0.154 0.384 0.104 0.412
SBERT ML 0.190 0.441 0.172 0.429 0.259 0.419 0.244 0.395
BERT 0.075 0.735 -0.166 0.659 0.289 0.486 0.098 0.437
BERT ML -0.024 0.533 -0.128 0.525 -0.057 0.510 -0.194 0.566
DistilBERT ML -0.147 0.566 -0.166 0.544 -0.257 0.589 -0.320 0.618
LLama2 0.194 0.506 0.095 0.466 0.129 0.389 0.056 0.448
SimR|W ordsSyn
Word2Vec 0.173 0.182 0.167 0.181 0.165 0.187 0.124 0.204
GloVe 0.245 0.236 0.243 0.233 0.113 0.237 0.061 0.246
ELMo 0.231 0.276 0.193 0.272 0.185 0.234 0.118 0.262
SBERT ML 0.336 0.231 0.315 0.230 0.339 0.258 0.340 0.256
BERT 0.092 0.307 -0.058 0.272 0.169 0.275 0.057 0.238
BERT ML 0.034 0.294 -0.068 0.357 0.007 0.296 -0.098 0.366
DistilBERT ML -0.105 0.284 -0.104 0.246 -0.196 0.300 -0.244 0.304
LLama2 0.094 0.380 0.058 0.368 0.099 0.240 0.039 0.255
Table 13: Mean and standard deviation at NC level for SimR|Syn and SimR|W ordsSyn , for English
(EN) and Portuguese (PT) for naturalistic (Nat) and neutral (Neut) sentences.
46
Wei He Investigating Idiomaticity in Word Representations
Table 15: Spearman ρ correlation between the Scaled Similarities and human judgments, for
SimR|Syn and SimR|W ordsSyn in both English and Portuguese after removing NCs with lexical
overlap between NC and NCSyn . Non-significant (p > 0.05) results were omitted from the table.
To mitigate the impact of rogue dimensions, a standardisation process using z-scores31 was applied
as proposed by Timkey and van Schijndel (2021). The mean vector µ was calculated across the
NC sentences and subtracted from each embedding vector to center the data. Each dimension of
the embedding was divided by its standard deviation σ.
To assess the impact of standardisation, Spearman correlation was calculated between the PSyn
cosine similarities before and after standardisation:
• Pre-standardisation: Cosine similarities calculated using the original representations.
• Post-standardisation: Cosine similarities recalculated after standardisation.
The results are reported on Table 16.
31 z = (x − µ)/σ.
47
Computational Linguistics Volume 1, Number 1
48