0% found this document useful (0 votes)

37 views48 pages

Investigating Idiomaticity in Word Representations: Wei He Tiago Kramer Vieira

Uploaded by

geek.bill.0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views48 pages

Investigating Idiomaticity in Word Representations: Wei He Tiago Kramer Vieira

Uploaded by

geek.bill.0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Investigating Idiomaticity in Word

Representations

Wei He∗ Tiago Kramer Vieira∗∗

University of Sheffield Federal University of Rio Grande do Sul
arXiv:2411.02610v1 [cs.CL] 4 Nov 2024

School of Computer Science Institute of Informatics

Marcos Garcia† Carolina Scarton∗

University of Santiago de Compostela University of Sheffield
CiTIUS Research Center School of Computer Science

Marco Idiart∗∗‡ Aline Villavicencio∗§

Federal University of Rio Grande do Sul University of Exeter
Institute of Physics Institute for Data Science and Artificial
Intelligence
The Alan Turing Institute
University of Sheffield
School of Computer Science

Idiomatic expressions are an integral part of human languages, often used to express complex
ideas in compressed or conventional ways (e.g. eager beaver as a keen and enthusiastic person).
However, their interpretations may not be straightforwardly linked to the meanings of their
individual components in isolation and this may have an impact for compositional approaches.
In this paper, we investigate to what extent word representation models are able to go beyond
compositional word combinations and capture multiword expression idiomaticity and some of
the expected properties related to idiomatic meanings. We focus on noun compounds of varying
levels of idiomaticity in two languages (English and Portuguese), presenting a dataset of minimal
pairs containing human idiomaticity judgments for each noun compound at both type and token
levels, their paraphrases and their occurrences in naturalistic and sense-neutral contexts, totalling
32,200 sentences. We propose this set of minimal pairs for evaluating how well a model captures
idiomatic meanings, and define a set of fine-grained metrics of Affinity and Scaled Similarity, to
determine how sensitive the models are to perturbations that may lead to changes in idiomaticity.
Affinity is a comparative measure of the similarity between an experimental item, a target and
a potential distractor, and Scaled Similarity incorporates a rescaling factor to magnify the
meaningful similarities within the spaces defined by each specific model. The results obtained with

∗ 211 Portobello, Sheffield, S1 4DP, UK.

∗∗ Av. Bento Gonçalves, 9500, Porto Alegre, 90000 Brazil.
† Rua de Jenaro de la Fuente Dominguez s/n, 15782 Santiago de Compostela, Galiza, Spain.
‡ Av. Bento Gonçalves, 9500, Porto Alegre, 90000, Brazil.
§ Innovation Centre, Rennes Drive, Exeter, EX4 4RN, UK. E-mail: [email protected].

Action editor: Tal Linzen. Submission received: 22 April 2024; revised version received: 29 August 2024; accepted for
publication: 22 October 2024.

© 2024 Association for Computational Linguistics

Computational Linguistics Volume 1, Number 1

a variety of representative and widely used models indicate that, despite superficial indications
to the contrary in the form of high similarities, idiomaticity is not yet accurately represented in
current models. Moreover, the performance of models with different levels of contextualisation
suggests that their ability to capture context is not yet able to go beyond more superficial lexical
clues provided by the words and to actually incorporate the relevant semantic clues needed
for idiomaticity. By proposing model-agnostic measures for assessing the ability of models to
capture idiomaticity, this paper contributes to determining limitations in the handling of non-
compositional structures, which is one of the directions that needs to be considered for more
natural, accurate and robust language understanding. The source code and additional materials
related to this paper are available at our GitHub repository1 .

1. Introduction

The evolution of word representation models has resulted in models with seemingly remarkable
language abilities. Not surprisingly these models have been found to store a wealth of linguistic
information (Henderson 2020; Manning et al. 2020; Vulić et al. 2020; Lenci et al. 2022), displaying
high levels of performance on various tasks ranging from the abilities of even the static models of
detecting semantic similarities between different words (Lin 1999; Mikolov et al. 2013; Baroni,
Dinu, and Kruszewski 2014) to those of contextualised models of grouping representations in
clusters which seem to be related to the various senses of the word (Schuster et al. 2019) and can be
matched to specific sense definitions (Chang and Chen 2019). While substantial evaluation efforts
have concentrated on word and subword units and on larger compositional combinations derived
from them, there is less understanding about their ability for handling less compositional structures,
such as those found on multiword expressions (MWEs), like noun compounds (NCs) (Garcia et al.
2021a), verb-noun combinations (King and Cook 2018; Hashempour and Villavicencio 2020) and
idioms (Yu and Ettinger 2020; Dankers, Lucas, and Titov 2022). Indeed, MWEs include a variety
of distinct phenomena and have been described as interpretations that cross word boundaries (Sag
et al. 2002), whose meanings are not always straightforwardly derivable from the meanings of
their individual components. Moreover, although they include, on the one hand, more transparent
and compositional expressions (like salt and pepper) or expressions with implicit relations (like
olive oil as oil made from olives), on the other hand they also include more idiomatic expressions
(like eager beaver as a person who is willing to work very hard2 ), falling into a continuum
of idiomaticity3 (Sag et al. 2002; Fazly, Cook, and Stevenson 2009). This leads to potential
problems for models if they follow the Principle of Compositionality (Frege 1956; Montague
1973), building the meaning of a larger unit (like a sentence or an expression) from a combination
of the individual meanings of the words that are contained in it, as this would result in potentially
incomplete or incorrect interpretation for more idiomatic cases (e.g. the idiomatic eager beaver
interpreted literally as impatient rodent). Although understanding the meaning of an MWE may
require knowledge that goes beyond that of the meanings of these individual words in isolation
(Nunberg, Sag, and Wasow 1994), failure to take idiomaticity into account can affect the quality
of downstream tasks (Sag et al. 2002; Constant et al. 2017; Cordeiro et al. 2019) such as reasoning
and inference (Chakrabarty, Choi, and Shwartz 2022; Chakrabarty et al. 2022; Saakyan et al.
2022), information retrieval (Acosta, Villavicencio, and Moreira 2011) and machine translation
(Dankers, Lucas, and Titov 2022). For machine translation, for example, the degree of idiomaticity

1 https:
//github.com/risehnhew/Finding-Idiomaticity-in-Word-Representations
2 Definition from the Cambridge dictionary.
3 We understand idiomaticity as semantic opacity, and its continuum as different degrees of opacity.

2
Wei He Investigating Idiomaticity in Word Representations

and ambiguity of MWEs (literal vs. idiomatic usages) were found to have an impact on the quality
of the results obtained (Dankers, Lucas, and Titov 2022). Due to their non-compositional nature,
idiomatic expressions result in lower quality translations than literal expressions, as evidenced
by lower BLEU scores for translations that are paraphrased rather than translated word-for-word.
In this paper, we investigate to what extent widely used word representation models are able to
capture idiomaticity in MWEs. We focus, in particular, on their initial abilities for representing
idiomaticity, looking at noun compounds of varying degrees of idiomaticity.4 In addition to the
complex interactions between MWEs, their component words and their contexts (Sag et al. 2002),
characteristics of languages and of word representation models may affect how accurately MWEs
can be represented and processed, and we investigate the impact of some of these factors for
compounds in two different languages (English and Portuguese).
One of the challenges is that uncovering how word representation models capture a specific
type of knowledge is a non-trivial problem (Vulić et al. 2020), and may depend on factors like the
particular model and the way it encodes different types of linguistic information (Yu and Ettinger
2020). For instance, while in Transformer-based models, the initial layers seem to represent
more lexical level knowledge and the final layers seem to capture more semantic and pragmatic
information (Rogers, Kovaleva, and Rumshisky 2020), determining where phenomena which sit
at the interface of various levels are encoded, like multiword expressions (Sag et al. 2002), is
challenging since they could potentially involve information distributed across different layers.
Moreover, the possible findings from an investigation about where in the architecture of a given
model idiomaticity is encoded, or about the role of particular components in representing it
may not generalise to other models and architectures. In this paper we propose instead a set of
model-agnostic idiomatic probes for assessing the representation of idiomaticity. These probes
contain NCs of different levels of idiomaticity, ranging from idiomatic to compositional cases,
which form the basis for minimal pairs. In these pairs one of them contains an NC and the other
contains a semantically related item (such as a synonym ) or a distractor. The hypothesis is that if a
model is able to accurately represent an NC, higher similarities will be observed for minimal pairs
involving NCs and their synonyms (e.g. for the idiomatic eager beaver and hardworking person).
Conversely, for minimal pairs with variants that may incorporate changes in meaning, such as
those containing NCs and synonyms of their individual component words (e.g. the idiomatic eager
beaver and impatient rodent) or other distractors, lower similarities should be observed.
As word representation models may form spaces that are anisotropic (Ethayarajh and Jurafsky
2021) with representations concentrating on parts of the space, or may have rogue dimensions
that dominate similarity measures (Timkey and van Schijndel 2021), these could lead to high
similarities overall (Liu et al. 2020), affecting the ability to distinguishing meaningful similarities
from spurious ones arising from specific characteristics of a given space. In this paper, we propose
two new measures to assess idiomaticity within a model while taking into account its potential for
high similarities. The first, Assessment of Feature Familiarity and Idiomatic Nuance by Interpreting
Target Yielding (Affinity), takes two representations of different levels of relatedness to a given
target, and can be used to determine if a model accurately reflects their degree of similarity to
the target. Focusing on idiomaticity, we use Affinity to assess if greater similarities are observed
for NCs and related words (in this case their synonyms), than for NCs and other potentially less
related alternatives including distractors. The second measure, Scaled Similarity, determines a

4 We use the off-the-shelf publicly available pre-trained versions of widely adopted word representation models,
standard operations and common similarity measures. Even in scenarios in which adopting additional optimisations,
more complex operations or fine-tuning could lead to improvements in performance, this may depend on the
availability of comprehensive training data for the target model, domain and language. Measuring the initial
idiomaptic abilities of models can help understand the potential loss of idiomatic meaning that could be propagated to
the downstream tasks that use them off-the-shelf.

3
Computational Linguistics Volume 1, Number 1

new lowerbound for a given space in terms of similarities for unrelated representations, rescaling
the space to help distinguish them from the meaningful similarities for related representations.
For idiomaticity, we analyse the similarities between the NCs and their synonyms adopting the
similarities between the NCs and random items as a new lowerbound. These measures of Affinity
and Scaled Similarity do not directly address the problem of rogue dimensions, and we discuss
this further in the Conclusions section.
Using these metrics and minimal pairs for evaluation, this paper presents a fine-grained
analysis of the ability of a model to capture idiomaticity, looking at the following questions:

Q1 To what extent is idiomaticity captured by word representation models? We assess

this by comparing the predictions of models for NCs and their synonyms against human
judgements about idiomaticity in the same sentences, analysing how sensitive these models
are to potential changes in meaning resulting from the lexical variations in the minimal
pairs.
Q2 Is this ability affected by the degree of idiomaticity of the NCs, the informativeness
of the contexts, or the languages involved? To determine if more idiomatic expressions
are more challenging for models, we present an analysis of the impact of the level of
idiomaticity of the NCs. We also analyse more informative contexts provided by naturalistic
sentences against uninformative neutral contexts to determine their impact on idiomaticity
representation. These evaluations include two languages, to measure the potential language
dependence of these results.
Q3 Do contextualised models (from transformer-based models) perform better compared
to static models in idiomaticity representation? In addressing this question, we conduct
a comparative analysis across different static and contextualised models, focusing on their
ability to capture idiomatic expressions. This involves examining how each model represents
idiomatic NCs of varying levels of idiomaticity, in sentences that contain more (or less)
informative contexts, and the accuracy with which they reflect the nuanced meanings that
idiomaticity often entails. The analyses consider various linguistic scenarios that can change
the idiomatic meaning to comprehensively assess the accuracy of contextualised models
over their static counterparts.

The main contributions of this work include:

• The Noun Compound Idiomaticity Minimal Pairs (NCIMP) Dataset, a dataset of minimal
pair sentences containing NCs of varying levels of idiomaticity, along with human judgments
about the degree of NC idiomaticity and gold standard paraphrases, at both type and token
level. In total, the dataset contains 32,200 sentences for two languages (19,600 in English
and 12,600 in Portuguese).5
• A comparative measure of Affinity to help determine how accurately idiomaticity is
incorporated in these representations contrasting similarities for semantically related and
unrelated representations.
• A novel model-agnostic measure of Scaled Similarity, which rescales a space in relation to a
new lowerbound taking into account expected similarities among random items to magnify
meaningful similarities among semantically related representations.
• In-depth analyses of the representation of idiomaticity in widely used word representation
models, examining their ability to display sensitivity to changes in idiomaticity.

5 This work extends the idiomatic probes proposed by Garcia et al. (2021b) and the type and token annotations by
Garcia et al. (2021a), also introducing new measures, additional tests and substantially expanding the analyses with
new baselines and results from a larger set of models.

4
Wei He Investigating Idiomaticity in Word Representations

The remainder of this paper is organised as follows: Section 2 presents related work, and
Section 3 the NCIMP dataset (Section 3.1), the models (Section 3.2) and the proposed idiomatic
probes and measures (Section 3.3). Finally, in Section 4 we discuss the results of our experiments
and draw conclusions in Section 5.

2. Representing Multiword Expressions and Idiomaticity

2.1 Static and contextualised models for representing MWEs

A variety of vector models have been used to investigate the representation of multiword
expressions (MWEs), ranging from static to contextualised representations, each with its own
set of challenges (Contreras Kallens and Christiansen 2022; Garcia et al. 2021a; Liu and Neubig
2022). The former include models like Word2Vec (Mikolov et al. 2013), GloVe (Pennington,
Socher, and Manning 2014) and fastText (Bojanowski et al. 2017), which represent words at
type-level, producing a single vector for each word that conflates all its senses. At this level,
MWEs are often represented based on their overall syntactic and semantic properties as they are
generally understood, without taking into account the variability of contexts. For example, both
the literal and the idiomatic meaning of gold mine6 would be represented jointly in a single vector
regardless of its use in any specific sentence. At the other end of the scale are the contextualised
models, from ELMo (Peters et al. 2018), BERT (Devlin et al. 2019) and GPT-3 (Brown et al. 2020)
to LLaMA (Touvron et al. 2023) and other large language models, which produce token-level
dynamic representations dedicated to capturing specific usages of a word in a particular context,
resulting in several vectors for each word (Lenci et al. 2022; Apidianaki 2022). Token-level
representations focus on the specific occurrences of words or subwords within contexts, and how
their meaning or function may vary or be influenced by the surrounding text. Therefore, they have
the potential for accurately representing MWEs, capturing the interdependence of the idiomatic
meaning on a particular configuration of words, while also anchoring the MWEs in relation to their
immediate linguistic environment. The primary challenge at token-level is accurately determining
the presence, meaning and role of MWEs in specific contexts, especially when they have possibly
multiple literal and idiomatic readings or when they are part of complex syntactic structures (Zeng
and Bhat 2021).
Evaluation of successive generations of word representation models, ranging from static
(Landauer and Dumais 1997; Lin 1999; Baroni and Lenci 2010; Mikolov et al. 2013; Bojanowski
et al. 2017) to contextualised models (Peters et al. 2018; Devlin et al. 2019; Brown et al.
2020; Touvron et al. 2023), has devoted considerable attention to their linguistic abilities
(Mandera, Keuleers, and Brysbaert 2017; Wang et al. 2018; Henderson 2020; Rogers, Kovaleva,
and Rumshisky 2020; Lenci et al. 2022). On lexical semantics, the representations extracted
from contextualised models seem to be able to reflect word senses in clusters of vectors (e.g.,
Wiedemann et al. (2019) for BERT) including in cross-lingual alignments involving polysemous
words (e.g., Schuster et al. (2019) for ELMo). However, controlled uniform evaluations of different
generations of word representation models settings have also reported strong performances from
static models, which were able to outperform contextualised models in most tasks (Lenci et al.
2022).

6 “Opportunity for making a lot of money” (definition from the Cambridge dictionary).

5
Computational Linguistics Volume 1, Number 1

2.2 Vector models evaluation on idiomaticity

Regarding idiomaticity, uniform assessment of the performance of different models on the

processing of MWEs are particularly important, as independent evaluations have reported mixed
results (King and Cook 2018; Nandakumar, Baldwin, and Salehi 2019; Cordeiro et al. 2019;
Hashempour and Villavicencio 2020; Garcia et al. 2021b; Klubička, Nedumpozhimana, and
Kelleher 2023). For instance, for the task of identifying the degree of idiomaticity of MWEs at
type level (i.e. the potential of an MWE to be idiomatic in general), good performances have
been obtained with static word embeddings (Mitchell and Lapata 2010; Reddy, McCarthy, and
Manandhar 2011; Cordeiro et al. 2019), and they have even been reported as obtaining better
performance than contextualised models for capturing idiomaticity in MWEs in some evaluations
(King and Cook 2018; Nandakumar, Baldwin, and Salehi 2019). Likewise, BERT-based models
obtained similar results to those of static vector representations for predicting the degree of
compositionality of a given NC (Miletic and Schulte im Walde 2023).7
However, a potential limitation of static models is that in representing different word senses
in the same vector, the literal usage of an expression may differ considerably from its idiomatic
usage (e.g. a brass ring as an idiomatic prize or as a literal ring made of brass), and complex
operations may be required to deal with semantic phenomena like polysemy (Erk 2012). In this
sense, contextualised models may provide the means for distinguishing literal from idiomatic
usages, along with fine-grained sense distinctions. In this respect, Garcia et al. (2021a) proposed
probing metrics to investigate and understand the linguistic information encoded in the models’
representations. Similarly, using a method of probing with noise and a repurposed idiomatic
usage probing task revealed better performance by BERT in encoding idiomaticity compared to
GloVe (Klubička, Nedumpozhimana, and Kelleher 2023). These types of intrinsic evaluations
have also been framed as shared tasks, like SemEval-2022 task 2B (Tayyar Madabushi et al.
2022) which proposed the assessment of idiomaticity representation in multilingual texts (English,
Portuguese and Galician) while also requiring models to predict the semantic text similarity (STS)
scores between sentence pairs, regardless of whether or not either sentence contains an idiomatic
expression.
Extrinsic evaluations have measured how well the representation of idiomaticity in a
model impacts downstream tasks, e.g., sentence generation (Zhou, Gong, and Bhat 2021), or
conversational systems (Adewumi, Liwicki, and Liwicki 2022). For instance, evaluations of
different classifiers initialised with static and contextualised embeddings in five tasks related
to lexical composition (including the literality of NCs) found that contextualised models led
to better performance across all tasks (Shwartz and Dagan 2019), and supervised methods that
used contextualised models also outperformed alternatives on the classification of potentially
idiomatic expressions in both monolingual and cross-lingual (English and Russian) scenarios
(Kurfalı and Östling 2020; Fakharian and Cook 2021). Alternatively, both types of representations
can be combined, as for example, in a supervised neural architecture to identify and classify
potentially idiomatic expressions combining contextualised and static embeddings in an attention
flow (Zeng and Bhat 2021). Regarding machine translation, a recent evaluation of compositional
generalisation in transformer models found that they tend to perform too compositional translations
even for idiomatic expressions (Dankers, Lucas, and Titov 2022). Furthermore, an analysis of
GPT-3 (Brown et al. 2020) reported 50.7% accuracy in idiom comprehension (Zeng and Bhat
2022), suggesting that the models’ ability to deal with idiomaticity is not yet adequate.

7 See Miletić and Walde (2024) for a recent survey on the representation of MWEs in Transformer-based models.

6
Wei He Investigating Idiomaticity in Word Representations

2.3 Vector operations and idiomatic knowledge induction

In addition to the level of contextualisation, the performance of vector space models may also
be affected by the way the target words of an expression are composed, with functions like sum,
concatenation and multiplication used for combining the words of static models (Cordeiro et al.
2019; Mitchell and Lapata 2010; Reddy, McCarthy, and Manandhar 2011) or the subwords of
contextualised models (Garcia et al. 2021b). For the embeddings extracted from language models,
other potential sources of variation include which input is given to the model (e.g., one vs. several
sentences including the target MWE in evaluations at the type level), or the number of layers that
will be taken into account to obtain the vector representation (Miletić and Walde 2024). In this
regard, the intermediate and last layers seem to encode more semantic information at the token
level (Tenney, Das, and Pavlick 2019; Garcia 2021), while other evaluations at the type level found
that the averaging the initial layers of the target expressions achieved the best results (e.g., Miletic
and Schulte im Walde (2023) for NCs and Vulić et al. (2020) for single word semantic tasks). With
respect to semantic composition, Yu and Ettinger (2020) explored the type level representation
of two word phrases (which in many cases correspond to NCs as the ones used in our study) in
various contextualised models, showing that phrase representations miss compositionality effects
as they heavily rely on word content. Similar conclusions, for neural machine translation, can
be inferred from Dankers, Lucas, and Titov (2022). While some of these evaluations rely on
substitutivity and the changes to a larger phrase representation caused by substitutions to its
constituents (Garcia et al. 2021b; Yu and Ettinger 2020), alternatively, the notion of localism has
also been analysed (Liu and Neubig 2022) focusing on whether the operations of a model are
local (Hupkes et al. 2021), that is, the extent to which the representation of a phrase is derivable
from its local structure.
Crucially, a substantial amount of the discussed studies evaluate idiomaticity at the type-
level, i.e., they obtain the embedding of a given MWE by averaging its representation in several
sentences that have been previously extracted in an automatic way. A more detailed controlled
comparison of type-level and token-level idiomaticity reported compatible results for both levels,
with type-level being a close approximation for token-level (Garcia et al. 2021a) in sentences
where the NC occurs with the same sense. Further analysis of the occurrences of these NCs in
fine-grained sense annotations of literal and idiomatic usages (Tayyar Madabushi et al. 2021)
provided additional confirmation that the ability of contextualised models to capture idiomaticity
during pre-training was limited, with approaches for building single token representations (Phelps
et al. 2022) and for fine-tuning leading to more accurate representations (Tayyar Madabushi
et al. 2022). Recent alternatives for representing idiomatic expressions also include adding a
new adapter module which has been developed and trained to recognise idioms (Zeng and Bhat
2022). This module functions as a language expert for idioms, augmenting the learning process of
BART (Lewis et al. 2019) with additional information, and this approach effectively improves the
representation of idiomatic expressions in off-the-shelf pre-trained language models, equipping
them with greater ability to navigate the intricacies of natural language. Zeng and Bhat (2023) also
proposed PIER+, a language model improvement for handling both literal and figurative language.
This is achieved by combining a base model with an additional curriculum learning framework
that gradually introduces more complex potentially idiomatic expressions. Compared to other
models, PIER+ demonstrates better performance at identifying, understanding, and maintaining
proficiency in both types of expressions. Finally, Zeng et al. (2023) introduce a knowledge graph
designed to enhance the understanding of idiomatic expressions, which integrates commonsense
knowledge to aid in deciphering the non-literal meanings of idioms. This work demonstrates
how to inject MWE-related knowledge into pre-trained language models effectively. However,
it is still unclear to what extent the context and its representation in contextualised models are

7
Computational Linguistics Volume 1, Number 1

contributing to a more accurate representation of MWEs according to their idiomaticity level

(Nedumpozhimana and Kelleher 2021; Miletic and Schulte im Walde 2023).

2.4 Towards a more controlled assessment of idiomaticity in vector space models

Shedding some light on these questions requires a more controlled evaluation setup and measures
that can abstract away from the particularities of these word representation spaces. In this effort,
we take inspiration in psycholinguistic methodologies, which have been traditionally used to
examine how humans process language in controlled experimental setups, to allow the removal
of obvious biases and potentially confounding factors from evaluations (Linzen, Dupoux, and
Goldberg 2016; Gulordava et al. 2018). They also enable comparative analyses of performance in
artificially constructed but controlled sentences and in naturally occurring sentences.
Setups like these have been used, for instance, to investigate how models represent syntax,
if they understand negation (van Schijndel and Linzen 2018; Prasad, van Schijndel, and Linzen
2019; Ettinger 2020; Kassner and Schütze 2020), and if they are aware of which properties are
relevant for which concepts (Misra, Rayz, and Ettinger 2023). Adopting evaluation protocols that
use minimal pair sentences (e.g.,Warstadt et al. (2020); Misra, Rayz, and Ettinger (2023)) allows
for a controlled comparison of the target item against carefully selected distractors that may share
linguistic properties with them. For instance, a dataset of Conceptual Minimal Pair Sentences
(COMPS) was used to compare the performance of 22 large language models including both
masked language models (like BERT) and autoregressive language models (like GPT-2), where
the models have to validate which of two concepts a given property belongs to (e.g. stripes for
zebras vs. oaks). Although the models seem to obtain relatively high accuracies for attributing
properties to concepts, when semantically related concepts are involved or distractors are included,
performances drop substantially, and go below chance even for models like GPT-3 (Misra, Rayz,
and Ettinger 2023). Similarly, in targeted syntactic evaluation (Marvin and Linzen 2018), models
are assessed using minimal pairs datasets focused on specific syntactic phenomena, such as those
included in the BLiMP dataset for English (Warstadt et al. 2020). Analyses like these highlight
the importance of adding controls to the experimental setup to distinguish seemingly sophisticated
behaviour with high performances that give the illusion of knowledge from robust understanding
with access to meaning (Misra, Rayz, and Ettinger 2023; de Dios-Flores, Garcia Amboage, and
Garcia 2023). With this is mind, we follow Garcia et al. (2021b) and use minimal pairs to propose
a set of intrinsic evaluations including probes and affinity measures aimed at gaining a better
understanding of how vector space models represent MWEs with different degrees of semantic
compositionality in context.

2.5 Datasets for exploring idiomaticity in computational models

Concerning experimental data, the first datasets to evaluate computational models were composed
of different types of multiword expressions annotated at the type-level (McCarthy, Keller, and
Carroll 2003; Venkatapathy and Joshi 2005). Further studies released annotations of MWEs in
context, such as the VNC-tokens dataset (Cook, Fazly, and Stevenson 2008), which includes
60 English verb-noun combinations occurring in almost 3,000 sentences annotated as idiomatic
or literal, or the IDIX corpus (Sporleder et al. 2010), with almost 6,000 labeled sentences of
78 expressions extracted from the BNC. Using a crowdsourcing platform, Reddy, McCarthy,
and Manandhar (2011) released a dataset with numerical ratings of the compositionality degree
of 90 noun compounds in English, which also includes the contribution of each component to
the meaning of the MWEs. Similar efforts were carried out for other languages, such as the
GhoSt-NN dataset for German (Schulte im Walde et al. 2016), or the NC Compositionality (NCC)
dataset (Cordeiro et al. 2019), which expanded the resource provided by Reddy, McCarthy, and

8
Wei He Investigating Idiomaticity in Word Representations

Manandhar (2011) with additional NCs for English, and new data for Portuguese and French.
Semi-automatic techniques combined with crowdsourced annotations were used to compile
MAGPIE (Haagsma, Bos, and Nissim 2020), a large resource of more than 50,000 sentences
with binary annotations at the token level of potentially idiomatic expressions. Similarly, the
AStitchInLanguageModels dataset (Tayyar Madabushi et al. 2021), used in the “SemEval-2022
Task 2: Multilingual Idiomaticity Detection and Sentence Embedding” (Tayyar Madabushi et al.
2022), also contains potentially idiomatic expressions annotated in naturalistic sentences.
Recently, Garcia et al. (2021a) and Garcia et al. (2021b) enriched the English and Portuguese
data of the NCC dataset with crowdsourced annotations of the compositionality degree of noun
compounds and their components at the token level, paraphrases of the NCs in context, and
different types of controlled replacements. These variants compose a large set of minimal pairs
which allow for the systematic exploration of the representation of idiomaticity in vector space
models.8
In this paper, we adopt the minimal pairs paradigm as one of the bases for the evaluation
and present the Noun Compound Idiomaticity Minimal Pairs dataset, which contains a set of
idiomatic probes to explore to what extent idiomaticity is captured in word representation models.
To do so, we rely on the datasets for English and Portuguese by Garcia et al. (2021a) and Garcia
et al. (2021b) and extend them with new semantically related variants and distractors and sets of
minimal pairs as discussed in the next section to conduct in-depth intrinsic evaluations.

3. Materials and Methods

3.1 Noun Compound Idiomaticity Minimal Pairs Dataset

The Noun Compound Idiomaticity Minimal Pairs (NCIMP) dataset contains 32,200 sentences
targeting two-word NCs in two languages, 280 in English (EN) and 180 in Portuguese (PT), with
idiomatic (e.g. gravy train9 ), partly compositional (e.g., grandfather clock10 ), and compositional
(e.g., research project) NCs.11 For each NC, the dataset contains minimal pairs formed by a first
sentence with the target NC and a second sentence where the NC was replaced by an experimental
item. These experimental items were selected on the basis of MWE properties, like more limited
substitutability (or greater lexical fixedness), and can be used to determine if models are sensitive
to perturbations to these properties, and if this is affected by how idiomatic the NCs are. For
example, depending on the degree of lexical fixedness of an NC, the variants generated may not
fully retain its original meaning (e.g. panda car12 and ?bear automobile). In particular, we analyse
the following:

• N CSyn : the minimal pairs are formed by the NC being replaced by one of the gold
standard synonyms provided holistically for the NC by the annotators (e.g. brain for grey
matter). In this case, we adopted the synonyms provided by the Noun Compound Senses
(NCS) dataset (Garcia et al. 2021b), which were selected on the basis of the most frequent
paraphrases given by native speaker annotators. These pairs are used to assess if the models

8 We refer to Ramisch (2023) for a recent review on MWEs processing, including datasets, and to Schulte im Walde
(2023) for a comprehensive overview on compositionality ratings for MWEs.
9 Referring to an easy way of making money without doing much work (Cambridge Dictionary).
10 A type of tall free-standing clock.
11 The NCIMP dataset is based on the Noun Compound Senses (Garcia et al. 2021b), the Noun Compound Type and
Token Idiomaticity (Garcia et al. 2021a) and the NC Compositionality (Cordeiro et al. 2019) datasets, significantly
extending them with new data.
12 Referring to a police car.

9
Computational Linguistics Volume 1, Number 1

provide similar representations for NCs and their synonyms, even if they involve lexically
diverse surface forms.
• N CW ordsSyn : minimal pairs where each component word of the NC is replaced individually
by a synonym generating new two-word compositional replacements (e.g. forming
alligator sobs for the NC crocodile tears by replacing alligator for crocodile and sobs for
tears). The synonyms were manually selected from WordNet (Miller 1995) for English,
and OpenWordNet (Rademaker et al. 2014) for Portuguese, and from online dictionaries of
synonyms where additional coverage was required. In case of ambiguity (due to polysemy
or homonymy), the most common meaning of each component was selected. For each
NC, 5 compositional replacements were generated. These pairs are used to evaluate how
sensitive a model is to the conventionality and lexical fixedness of these NCs, especially the
more idiomatic ones, and if it can detect when the (idiomatic) meaning changes with the
replacements.
• N CComp : the minimal pairs are formed by replacing the NC by only one of its component
words i.e., replacing the NC by its head in one minimal pair, and by the modifier in the other
pair (e.g. crocodile for crocodile tears and tears for crocodile tears). These pairs are used
to explore if the models can detect when the meaning of an NC is related to the meaning of
a component (in more compositional cases) from when it is not (in more idiomatic cases).
• N CRand : the random replacement controlled by frequency is a two word expression
in which the words are chosen to match the frequencies of the components of the target
NC. The frequency values were extracted from corpora (in this case ukWaC and brWaC)
as follows: we averaged the frequency of each NC and of its components (favg = (fN C +
fw1 + fw2 )/3), and extracted the compound with the closest average value (e.g. police car
and supermarket city). For each NC, 5 random replacements were used for each sentence.
These pairs are used as controls to determine the lowerbound similarities for the target NCs,
avoiding the potential impact of any differences in frequency.
The NCs were pre-selected by experts trying to maintain a balance between the 3 classes
(idiomatic, partial, and compositional)13 , and they appear in the context of three naturalistic
sentences (N at) from corpora that exemplify the same compound sense (Garcia et al. 2021a).
Using Amazon Mechanical Turk (for English) and a dedicated custom built online platform (for
Portuguese) compositionality scores for each NC and its components were obtained following the
procedure of Reddy, McCarthy, and Manandhar (2011) and Cordeiro et al. (2019). A Likert scale
from 0 (idiomatic) to 5 (compositional) was used for the human judgements, and the resulting
scores were aggregated from the average of the different annotators (Garcia et al. 2021a).14
The annotators also provided synonyms or paraphrases for the NCs in these sentences, which
were used by language experts to manually generate the N CSyn variants (Garcia et al. 2021a).
These annotations, including the synonyms, were collected at two levels of granularity: a more
fine-grained token level, where annotations for each sentence are collected individually, and a
more rough-grained type level, where a single annotation for each NC is collected considering
all three sentences at once (Garcia et al. 2021b). This allows for analyses of the impact of each
individual context in the interpretation of the NC. A total of 8,725 annotations was obtained for
English (421 annotators, each labelling an average of 21 sentences, resulting in a 10.4 annotations

13 The two-word compounds were selected to be representative cases of compositional NCs (meaning related to the two
words), partly idiomatic (meaning related to one of the words) and idiomatic (meaning unrelated to either of the two
words), as our aim is to investigate to what extent the degree of idiomaticity affects the ability of models to generate an
accurate representation. For English, the dataset contains 103, 88, and 89 idiomatic, partial, and compositional
expressions respectively, while for Portuguese it has 60 NCs per class.
14 On average, the compositionality scores were of 0.95/2.34/4.13 for English, and of 1.52/2.46/3.61 for Portuguese
(idiomatic/partial/compositional).

10
Wei He Investigating Idiomaticity in Word Representations

per sentence). In Portuguese, 5,091 annotations were provided by 33 annotators (with an average
of 154 annotated sentences per annotator, and 9.4 annotations per sentence).

# NC Sentence
1 Original John Paul II was an effective front man for the catholic church.
2 N CSyn John Paul II was an effective representative for the catholic church.
3 N CW ordsSyn John Paul II was an effective forepart woman for the catholic church.
4 N CComp John Paul II was an effective man for the catholic church.
John Paul II was an effective front for the catholic church.
John Paul II was an effective battlefront serviceman for the catholic
5 N CRand
church.
Table 1: Naturalistic sentence containing the NC front man (in row 1) forming minimal pairs with
sentences in rows 2-4, and with control random baselines in row 5.

In addition, NCIMP also contains sense-neutral sentences (N eut) in which the NCs appear
in uninformative contexts containing only 5 words and following the pattern This is a/an <NC>
for English (e.g. “This is an eager beaver”) and the Portuguese equivalent Este/a é um(a) <NC>.15
These neutral contexts can be used to examine how much contextual information is added to a
representation in the more informative naturalistic contexts. Moreover, as some NCs may have
more than one meaning (e.g. fish story as either the literal aquatic tale or the idiomatic big lie),
they can also be used to determine the default usage elicited for the NC in the absence of any
informative contextual clues, in particular, whether it leans towards an idiomatic or a literal sense,
potentially serving as an indication of the predominant sense sampled during training.
Experts (native or near-native speakers with background in Linguistics) reviewed both
the naturalistic and the sense-neutral sentences in the minimal pairs, editing them if needed
for preserving grammaticality after substitution (e.g. revising gender, number and definiteness
agreement with determiners and adjectives). However, some of the variants generated may be
semantically nonsensical, especially those involving random replacements. Table 1 displays an
example with the original sentence in the first row and the relevant sentences for each of the
minimal pairs in the other rows.
Finally, each NC was also annotated with frequency, Pointwise Mutual Information (Church
and Hanks 1989) and Positive Pointwise Mutual Information values, calculated from the ukWaC
(2.25B tokens, Baroni et al. (2009)) and brWaC corpora (2.7B tokens, Wagner Filho et al. (2018)),
which can serve as approximations for their familiarity and conventionality.

3.2 Word Representation Models

We evaluate representative static and contextualised models. For the former, we compare GloVe
and Word2Vec, using the official models for English, and the 300 dimensions vectors for
Portuguese (Hartmann et al. 2017).
For the latter, we evaluate a large set of models, including the Bi-LSTM-based ELMo (Peters
et al. 2018), and several Transformer-based language models: BERT (Devlin et al. 2019) and

15 NCIMP also contains a second longer pattern of uninformative neutral sentences (10 words in English and 9 in
Portuguese) following the patterns This is what a/an <NC> is supposed to be and the Portuguese equivalent Isto é o
que um/uma <NC> deveria ser, to measure the potential impact of the length of the neutral context and of the
position of the NC in the sentence. As the two types of neutral sentences elicit similar results, in the paper we only
present the results for the short neutral sentences.

11
Computational Linguistics Volume 1, Number 1

some of its variants, such as multilingual BERT (mBERT16 ) (Pires, Schlinger, and Garrette
2019), multilingual DistilBERT (mDistilB17 ) (Sanh et al. 2019) and multilingual Sentence-BERT
(mSBERT18 ) (Reimers and Gurevych 2019b). The recent flagship model LLaMA2 (Touvron et al.
2023) is also included in our experiments. OpenAI text embeddings (Neelakantan et al. 2022) are
included in the evaluations at sentence-level as they can only be accessed by the API 19 rather
than by direct inspection of the whole model, which would be required for analyses at NC-level.
Therefore, the latter are not conducted for OpenAI text embeddings. For ELMo, we use the small
model provided by Peters et al. (2018), and for Portuguese we adopt the weights released by
Quinta de Castro, Félix Felipe da Silva, and da Silva Soares (2018). For LLaMA2 and OpenAI’s
embeddings, we use the 13B version and text-embedding-ada-002 version, respectively. For all
other contextualised models, we use the pre-trained weights publicly available through Flair20
(Akbik et al. 2019) and HuggingFace21 (Wolf et al. 2020). For BERT-based models (and for
DistilB in English), we report the results obtained both by the multilingual uncased (ML) and
by monolingual models for English (large, uncased) and Portuguese (large, cased), all available
through HuggingFace.

3.2.1 Sentence and NC Embeddings. Embeddings for the whole sentence as well as for the NCs
are generated by averaging the (sub)word embeddings22 of the relevant tokens involved, according
to the model:
• for static models, the word embeddings are derived directly from the vocabulary, with
missing out-of-vocabulary words being ignored;
• for ELMo the output word embeddings are averaged, and the concatenation of its three
layers is adopted;
• for Transformer-based models, the word embeddings are generated by averaging the
representations of the sub-tokens and we report results using the last four layers.23
In general we adopt standard widely used configurations to determine what the landscape of
results is before any task optimisation, even if alternative tokenisation approaches (Gow-Smith
et al. 2022), dedicated representations for MWEs as single-tokens (Cordeiro et al. 2019; Phelps
et al. 2022) and different combinations of layers and weighting schemes (Reimers and Gurevych
2019a; Vulić, Korhonen, and Glavaš 2020; Rogers, Kovaleva, and Rumshisky 2020) may generate
better results in downstream tasks. Additional configurations were also extensively analysed and
as they produced qualitatively similar results, they are not included in the paper.

3.3 Measuring idiomatic meaning

The general premises of this work, shared by many similar investigations, are that:

16 https://huggingface.co/google-bert/bert-base-multilingual-cased
17 https://huggingface.co/distilbert/distilbert-base-multilingual-cased
18 https://huggingface.co/sentence-transformers/
distiluse-base-multilingual-cased
19 https:
//platform.openai.com/docs/guides/embeddings/what-are-embeddings
20 https://github.com/flairNLP/flair
21 https://github.com/huggingface/transformers
22 In our preliminary experiments, we tested various pooling strategies, including max pooling, min pooling, the CLS
token from BERT, concatenation, and mean pooling. The performance was similar across these methods, but to
maintain simplicity and avoid complications from variable vector lengths, we chose mean pooling for the reported
experiments.
23 Extensive evaluation of the individual layers and their combination were performed, but as the results follow the trend
of those reported here, they are not included in the paper.

12
Wei He Investigating Idiomaticity in Word Representations

1. Vector embeddings approximate meaning. We assume that the vector embeddings

produced by the models are representations of usages in a semantic space that can
approximate meaning. Since there is no absolute reference frame for meaning in that
space, the meaning of a word/sentence is always relative and it is evaluated in terms of its
similarity to other relevant words/sentences in the same semantic space.
2. Word/multiword/sentence representations are the combinations of the (sub)word
representations. We adopt as the meaning of a word, multiword expression, or of a
sentence, the compositional combination of its components. In this paper we focus on the
additive combination, summing/averaging the vector embeddings of each token in the word,
expression or sentence, and summing/averaging the vector embeddings of the relevant
layers, when more than one layer is used.
3. Similarity of meanings can be approximated by similarity of vectors. Similarity is a
measure of the proximity between two vector embeddings. Throughout this paper we adopt
cosine similarity as the similarity metric.24 As contextualised models provide different
vector representations for the same linguistic expression in different contexts, its vector
representations would be found among different clusters of meaning as it transitions between
its meanings in different sentences, and this would be reflected by the similarity measures.

3.3.1 The probing strategies. To evaluate how word representation models deal with idiomaticity,
we propose a probing strategy where a target item in a sentence, in this case an NC, is
systematically replaced by a set of different paraphrases or probes (P), forming the minimal
pairs discussed in section 3.1. We then use similarity measures to compare the representation for
the sentence before and after replacing NC by P. Given the focus on idiomaticity we select a set of
probes specifically for the expected changes in meaning they would induce in a sentence, and we
refer to these potential changes in meaning as Linguistic Predictions (LPs). If the representations
generated by a model reflect these predictions, passing the probing tests, then we consider that
particular model as capturing to some extent the idiomatic meaning in NCs. The idiomatic probes
are defined as follows, where Comp is the average human annotation compositionality score:
• PSyn - The true synonym. The replacement is a single word or a two word compositional
noun compound that represents closely the meaning of the target NC, forming the minimal
pair N CSyn . Linguistic Prediction: after the replacement, the resulting sentence should be
a near perfect paraphrase of the original sentence. Therefore high similarities are expected
for all minimal pairs independently of the degree of compositionality of the target NC, from
the more idiomatic grey matter (and brain) to the more literal economic aid (and financial
assistance), with no correlation expected with Comp.
• PComp - The partial expression. The replacement is one of the component words of the
target compound, and in particular we consider the one that preserves most of the meaning,
forming the minimal pair N CComp . Linguistic Prediction: the resulting sentence may
preserve some of the original meaning for more compositional cases, but not for idiomatic
cases. Therefore, high similarities are only expected between minimal pairs involving
compositional and partly compositional cases (e.g. economic aid and aid, crocodile tears
and tears, but not for wet blanket and blanket or wet), with some correlation expected with
Comp.
• PW ordsSyn - The literal synonyms of the individual NC components. The replacement
is a two-word expression formed from frequent out-of-context synonyms for each of the
component words of an NC when considered independently, forming the minimal pair

24 Other compositional operations and measures of distance were also used during these analyses, but with qualitatively
similar results, and have been omitted from the paper.

13
Computational Linguistics Volume 1, Number 1

N CW ordsSyn . Linguistic Prediction: after replacement, the resulting sentence may not pre-
serve the meaning of the original sentence, especially for more idiomatic cases. Therefore,
higher similarities are only expected for minimal pairs involving more compositional NCs
(e.g. wedding day and marriage date but not eager beaver and restless rodent), with a high
correlation expected with Comp.
• PRand - The random replacement controlled by frequency. The replacement is a two word
expression where the words are chosen to match the frequencies of the components of the
target NC, forming the N CRand minimal pair. Linguistic Prediction: after replacement, the
resulting sentence should not preserve the meaning of the original sentence, independently
of the level of idiomaticity of the original NC (e.g. for police car and supermarket city),
with no correlation expected with Comp.

For a more in-depth analysis of expected changes in meaning, we follow Garcia et al. (2021b)
comparing representations both at a macro sentence level and also at a micro NC level, analysing
the representations of NC (and its variants P) extracted from the context of the sentence. Although
any differences in meaning should be reflected both at sentence and at NC representation levels
(only magnified in the latter), this comparison aims to highlight the impact of the level of
granularity used when analysing idiomaticity.25

3.4 Metrics
3.4.1 The Human Compositionality score (Comp). Assuming a list of N NCs, chosen to provide
balanced test scenarios of different levels of idiomaticity, we denote NCα with α = 1, ..., N the
different NCs to be evaluated. The meaning of theses NCs is exemplified by a set of N × M
sentences Sentαβ with α = 1, ..., N and β = 1, ..., M the sentence index. The dataset contains
M =3 naturalistic sentences to exemplify the use of each NC (see section 3.1), with each sentence
annotated by human judges according to the compositionality of the target NC in the sentence.
The resulting scores are denoted Compαβj , with α = 1, ..., N , β = 1, ..., M , and j = 1, ..., Aαβ
where Aαβ is the number of annotators for sentence Sentαβ . Compαβj are integer values derived
from a Likert scale and range from 0 (totally idiomatic) to 5 (totally compositional). We define the
compositionality score for a specific NCα as the average of the annotations for sentences Sentαβ ,
D E
Comp(NCα ) = Compαβj Annot Sent
(1)

where ⟨· · · ⟩Sent are averages on sentences and ⟨· · · ⟩Annot averages on annotations. These average
values are the gold standard in this work.

3.4.2 The Similarity score (Sim). Probing the meaning of a compound NCα in a sentence Sentαβ
requires the generation of a new set of modified sentences SentP i βγ where NCα is replaced by a
probe Pi (discussed in section 3.3.1). We measure the effect of the probe substitution directly from
the similarity between the representation of the original expression, X, and the representation of
the new expression after substitution, Y, adopting, throughout this paper, cosine similarity as a
measure of the similarity of meaning between two vector embeddings.

ϵX · ϵY
cossim(X, Y ) = (2)
||ϵX || ||ϵY ||

25 Our prior work reveals that only looking at similarities at sentence level when comparing the representations of the
original and the resulting sentences may not accurately reflect their differences (Garcia et al. 2021b).

14
Wei He Investigating Idiomaticity in Word Representations

where ϵX and ϵY are vector embeddings of D components, ϵX · ϵY their inner products, and
||ϵX ||, ||ϵY || are their L2 norms. Therefore the average similarity between the original expression
and the probe-modified expression for a given NC can be defined as

Sim(Pi, T arget) = ⟨ cossim(expr(Pi), expr(NC)) ⟩P i (3)

where expr(NC) is the target NC expression, and expr(Pi) is the expression where NC is replaced
by a probe of the type Pi, and ⟨· · · ⟩P i means the average over possible substitutions of this type.
We use more than one substitution only for random probes (PRand ), for all other probes a single
substitution is reported.

3.4.3 The Affinity score (Aff). Cosine similarity measures are not sensitive enough to capture
subtle meaning differences, especially in anisotropic representation spaces (Ethayarajh and
Jurafsky 2021). Additionally, there may be a ‘horizon of interest,’ beyond which word connections
lose meaningful inference (Karlgren and Kanerva 2021), which may be a challenge for representing
idiomatic expressions, as the necessary context may lie within this critical boundary. Investigating
measures that account for anisotropic spaces and for a horizon of interest are interesting avenues
for future research for improving idiomaticity detection. In this paper, we propose a comparative
measure that we refer to as Affinity (Assessment of Feature Familiarity and Idiomatic Nuance by
Interpreting Target Yielding), that identifies which between two representations is the closest to a
given target representation.
Given a target representation T arget and two possible probes Pi and Pj, the affinity is defined
as:

Aff(Pi, Pj | T arget) = Sim (Pi, T arget) − Sim (Pj, T arget) . (4)

Affinities closer to 1 or larger indicate a greater similarity between the target and the first probe
Pi, values closer to -1 or lower indicate the opposite situation where the target is more similar to
the second probe Pj, and values near zero indicate no preference. Given the focus of this paper on
detecting idiomaticity in representations, we measure the affinities involving the minimal pairs
defined in section 3.3.1, analysing if, as expected, the target NCs have higher similarities with
probes with substitutions that maintain the original meaning as Pi than with probes that involve
potential changes in meaning as Pj. In particular:
• Affinity ASyn|W ordsSyn = Aff(PSyn , PW ordsSyn |NC) measures if the target NCs have
greater similarities with their gold synonyms than with synonyms of the individual
components (e.g. eager beaver with hardworking person than with restless rodent).
• Affinity ASyn|Rand = Aff(PSyn , PRand |NC) compare if the target NCs display greater
similarities to their gold synonyms than to random substitutions.
Our Affinity measure extends traditional forced-choice evaluations (Warstadt et al. 2020)
by quantifying the degree of similarity preference between two options. Unlike binary choices,
Affinity provides a continuous measure of relative similarity, offering a more detailed assessment
of how well models capture idiomatic meanings. This nuanced analysis reveals subtle differences
in model performance, providing deeper insights into the representation of idiomatic expressions.

3.4.4 The Scaled Similarity score (SimR ). Even though Affinity is an advance over the simple
similarity measure, additional measures may still need to be adopted for models if the average
similarity between two random embeddings is larger than zero, as affinities will tend to have small

15
Computational Linguistics Volume 1, Number 1

values even for very dissimilar probes (see discussion). To address this issue, we propose a scaled
version of the similarity:

Sim(Pi, T arget) − Sim(PRand , T arget)
SimR (Pi|T arget) = (5)
1 − Sim(PRand , T arget) Sent

where ⟨· · · ⟩Sent denotes the average over the M sentences that illustrate the meaning of a
particular NC and PRand is a random substitution. The scaled similarity is defined such that if
replacing the target with a probe Pi results in cosine similarities close to one (Sim(Pi, T arget) ≈
1), the scaled similarity is also close to one, SimR ≈ 1. Conversely, if the replacement is similar to
a random replacement (Sim(Pi, Target) ≈ Sim(PRand , T arget)), then SimR ≈ 0. This approach
is equivalent to a max-min normalisation26 in the anisotropic space of a model.
In particular, given the focus on idiomaticity, we focus as before on two similarities:
• SimR|Syn = SimR (PSyn |NC), where the NCs are replaced by gold synonyms and no
changes in meaning are expected, therefore SimR|Syn should be close to 1.
• SimR|W ordsSyn = SimR (PW ordsSyn |NC), where the NCs are replaced by synonyms of
the individual components and greater changes in meaning, and therefore small values
(∼ 0) of SimR , are expected for more idiomatic cases.

3.4.5 The Correlation measure (ρ). Finally, to assess the impact of idiomaticity for the probe
substitutions we use Spearman Correlation between the different measurements and the gold
standard human annotations of compositionality (Comp) given by Eq. 1.

4. Probing for Idiomaticity

4.1 Are the representations of the NCs and their synonyms similar?

A first indication of the successful modeling of idiomaticity is if a model assigns similar represen-
tations for the target NCs and for their synonyms, regardless of their level of compositionality.
We measure this using the minimal pairs of probe PSyn and compare it with less appropriate
substitutions represented by the other probes Pj. The distribution of similarities obtained for
each of the probes is shown in Figure 1, along with the correlations of these similarities with the
human compositionality scores for the NCs at sentence (ρSent ) and NC (ρN C ) levels, in Tables 2
and 3. Considered in isolation, the high similarity scores for PSyn at sentence level (close to 1
for naturalistic sentences, and mostly above 0.75 for neutral sentences, Figure 1(PSyn )) seem to
suggest that these models are able to capture idiomaticity. However, when compared against the
scores for the minimal pairs of the other probes a different story emerges.
When the components of a target NC are replaced with one of their component words
(Figure 1(PComp )) or with the synonyms of their component words (Figure 1(PW ordsSyn )), lower
similarities should be observed between the minimal pairs since, although these substitutions
could preserve some of the meaning of the more compositional cases, they would not do so for
the more idiomatic cases. Moreover, random substitutions should lead to even lower similarities
for all NCs (Figure 1(PRand )), since they could result in nonsensical sentences. This expected

26 Given a value x in a dataset, the max-min normalisation of x is calculated as follows:

x − min(x)
x′ = .
max(x) − min(x)

16
Wei He Investigating Idiomaticity in Word Representations

PSyn PComp

PW ordsSyn PRand

Ideal Values

Figure 1: Distribution of cosine similarities between the minimal pairs at sentence level, with the
original NC and the probe-modified substitution for English (EN, in blue) and Portuguese (PT, in
orange), with naturalistic (Nat) sentences in darker shade and neutral (Neut) in lighter. The lower
panel (Ideal Values) is an illustration of similarity values ideally expected for the different probes.
The means and standard deviations are in Table 10 in the Appendix.

17
Computational Linguistics Volume 1, Number 1

staggered pattern of similarities, highest for PSyn , moderate for PComp and PW ordsSyn , and lower
for PRand , illustrated in Figure 1(Ideal Values) does not seem to be reflected by a visible reduction
of the similarities at sentence level, in Figures 1. In fact, even random substitutions seem to result
in high sentence similarities, even if they are not as high the other substitutions.
Another important point relates to the correlation of these similarities for the different
NCs with human judgments for compositionality. It is expected that there would be almost no
correlation for the similarities derived from PSyn and PRand , and a moderate correlation for
PComp and PW ordsSyn , as they may be more acceptable for compositional NCs than for idiomatic
ones. However, this expected pattern is not observed in the results presented in Table 2. For most
models, ρSent (PSyn ) shows moderate correlation, while ρSent (PComp ) and ρSent (PW ordsSyn )
are either weak or non-significant.

ρSent Word2Vec GloVe ELMo SBERT BERT BERT DistilB LLaMA2 OpenAI
ML ML ML
PSyn
EN-Nat 0.30 0.31 0.43 0.47 0.39 0.51 0.38 0.15 0.41
EN-Neut 0.60 0.58 0.55 0.60 0.51 0.53 0.56 0.37 0.54
PT-Nat 0.18 0.13 0.33 0.31 0.32 0.29 0.20 0.27 0.46
PT-Neut 0.31 0.22 0.37 0.46 0.35 0.30 0.31 0.31 0.51
PComp
EN-Nat - - - - 0.17 - - - 0.37
EN-Neut 0.19 0.29 - - - - -0.12 - 0.51
PT-Nat - -0.12 0.12 - 0.16 - -0.15 - 0.21
PT-Neut 0.13 - 0.17 - - -0.14 - - 0.27
PW ordsSyn
EN-Nat - - - - - - - - 0.21
EN-Neut 0.19 - - -0.13 -0.15 - - 0.20 0.13
PT-Nat -0.12 -0.19 - - - - -0.14 - 0.11
PT-Neut - -0.13 - - - - - - 0.17
PRand
EN-Nat - -0.11 -0.13 -0.16 -0.27 -0.11 -0.18 -0.11 -
EN-Neut 0.11 - -0.31 -0.36 -0.29 - -0.13 - -
PT-Nat -0.17 -0.20 -0.13 -0.11 -0.14 -0.12 - -0.18 -
PT-Neut 0.13 -0.17 -0.14 -0.11 -0.22 -0.11 - - -
Table 2: Spearman ρ correlation between cosine similarities and human compositionality
judgments (Comp) at sentence level. Only significant results (p≤0.05) are displayed, for PSyn ,
PComp , PW ordsSyn and PRand , for English (EN) and Portuguese (PT), naturalistic (Nat) and
neutral (Neut) sentences.

Since in these minimal pairs only the target NCs and their substitutions change, the high
similarities found may be an effect of the lexical overlap between the sentences of a minimal
pair. Indeed, comparing the output of the models in relation to sentence lengths for naturalistic
sentences, there is a significant moderate to strong positive correlation between the lexical overlap
and the cosine similarity of a pair, for both English and Portuguese (Table 4), where the greater
the overlap between the sentences, the higher their similarity. This can also explain the higher
similarities observed for naturalistic than for neutral sentences, since the former are longer than
the latter with a higher lexical overlap proportional to the length of the sentence: average sentence
length for naturalistic sentences is 23.4 words for English (lexical overlap > 91%) and 13.0
words for Portuguese (overlap > 84%), while for the neutral sentences it is five words (overlap >
60%) for both languages.27 It could be argued that the influence of lexical overlap is expected,
given that a compositional representation is used for sentences, where the embeddings for each

27 We also compared longer neutral contexts with 10 words for English (> 80%), and 9 words for Portuguese (> 77%),
and found similar results.

18
Wei He Investigating Idiomaticity in Word Representations

ρN C Word2Vec GloVe ELMo SBERT BERT BERT DistilB LLaMA2

ML ML ML
PSyn
EN-Nat 0.62 0.62 0.60 0.66 0.39 0.67 0.58 0.36
EN-Neut 0.62 0.61 0.60 0.65 0.34 0.58 0.54 0.37
PT-Nat 0.45 0.40 0.47 0.48 0.57 0.44 0.39 0.37
PT-Neut 0.43 0.41 0.47 0.48 0.48 0.35 0.37 0.31
PComp
EN-Nat 0.20 0.45 0.17 0.34 0.17 0.35 0.26 0.15
EN-Neut 0.20 0.44 0.23 0.28 -0.31 - - 0.12
PT-Nat 0.30 0.20 0.29 0.11 0.43 0.16 - 0.22
PT-Neut 0.27 0.18 0.21 - 0.24 - - 0.13
PW ordsSyn
EN-Nat - 0.18 - - -0.40 0.21 0.15 0.29
EN-Neut 0.11 0.18 - - -0.40 - - 0.22
PT-Nat - - 0.13 - 0.17 0.11 - -
PT-Neut - - - - 0.14 - - -
PRand
EN-Nat 0.11 0.18 -0.18 -0.23 -0.58 -0.22 -0.29 -
EN-Neut 0.12 0.18 -0.21 -0.20 -0.49 - -0.24 0.13
PT-Nat - - - - - - - -
PT-Neut - - - - - -0.11 - -
Table 3: Spearman ρ correlation between cosine similarities and human compositionality
judgments (Comp) at NC level. Only significant results (p≤0.05) are displayed, for PSyn , PComp ,
PW ordsSyn and PRand , for English (EN) and Portuguese (PT), naturalistic (Nat) and neutral
(Neut) sentences.

token are added. However, while this holds true for static models, it may not necessarily apply
to contextualised models. In contextualised models, it is expected that each token/word would
interact with others via attention heads, and if the model accurately captures semantics, all
tokens/words will adjust to the context of the sentence as a whole. Ideally, even with the simple
compositional representation of the sentence, we would anticipate that a correct sentence would
exhibit low similarity with the mostly nonsensical sentences produced by random probes. Even
though similarities coming from contextualised model seems to present lower correlations with
sentence size, still lexical overlap appears to dominate across all types of models.

Word2Vec GloVe ELMo SBERT BERT BERT DistilB LLaMA2 OpenAI

ML ML ML
PSyn
EN-Nat 0.71 0.71 0.49 0.49 0.58 0.53 0.67 0.46 0.44
PT-Nat 0.66 0.59 0.42 0.46 0.48 0.57 0.65 0.51 0.26
PComp
EN-Nat 0.82 0.86 0.74 0.80 0.75 0.78 0.89 0.55 0.52
PT-Nat 0.72 0.80 0.67 0.58 0.63 0.72 0.83 0.59 0.50
PW ordsSyn
EN-Nat 0.86 0.87 0.70 0.74 0.75 0.81 0.87 0.54 0.60
PT-Nat 0.74 0.70 0.58 0.62 0.69 0.67 0.78 0.60 0.46
PRand
EN-Nat 0.87 0.88 0.77 0.85 0.82 0.85 0.87 0.62 0.80
PT-Nat 0.77 0.79 0.73 0.74 0.87 0.76 0.81 0.62 0.68
Table 4: Spearman ρ correlation between naturalistic sentence length and cosine similarity, p ≤
0.05, for PSyn , PComp , PW ordsSyn and PRand .

To minimise the effect of the lexical overlap in the similarities, we now focus our analyses
only on the similarities among the tokens representing the NCs and their substitutions in the

19
Computational Linguistics Volume 1, Number 1

PSyn PComp

PW ordsSyn PRand

Ideal Values

Figure 2: Distribution of cosine similarities between the minimal pairs at NC level, with the
original NC and the probe-modified substitution for English (blue) and Portuguese (orange), with
naturalistic sentences in darker shade and neutral in lighter. The lower panel (Ideal Values) is an
illustration of similarity values ideally expected for the different probes. The means and standard
deviations are in Table 10 in the Appendix.

20
Wei He Investigating Idiomaticity in Word Representations

context of the target sentences. In this case, lower similarities were obtained for all probes and
all models compared to those at sentence level (Figure 2 vs. Figure 1). This is even the case for
similarities for the NCs and their synonyms (PSyn ), which are centred around the same values
as those for the NCs and synonyms of the individual components (PW ordsSyn ) and those for the
random replacements (PRand ) follow this trend, and they are all lower than those for the NCs and
only one NC component (PComp ). In fact, similarities for the gold standard synonyms are lower
than for many of the other probes, regardless of the extent to which the original NC meaning
is changed, as probes PComp to PRand involve some change in meaning while PSyn does not.
Finally, there is more variation displayed among the models, as there are lower similarities for
static than for most contextualised models. Overall, the resulting similarities at NC level do not
follow the expected patterns for representing idiomaticity, illustrated in Figure 2 (Ideal Values).
The same holds true for their correlations with human judgements. In line with what occurs
at the sentence level, the similarities at NC level exhibit correlations that contradict linguistic
expectations. In particular, it is expected that true synonymous substitutions work well across
the idiomatic-compositionality spectrum. Therefore, no correlation should be expected for PSyn ,
while for PComp and PW ordsSyn , a moderate correlation is expected and no correlation for PRand .
However, Table 3 indicates that for most models, ρN C (PSyn ) > ρN C (PComp or PW ordsSyn )
with the latter being either weak or not significant.
In the next section we analyse if, at least at a detailed level, the similarities between NCs and
their synonyms are mostly higher than of other alternatives.

4.2 Are the representation of the NCs and their synonyms relatively more similar when
compared to other alternatives?

If a model accurately represents idiomaticity, the representation of a given NC should be more

similar to its synonym than to other alternatives, including distractors and random representations.
Using the proposed comparative measures of Affinity (introduced in section 3.4.3), we now assess
whether the models we are evaluating are able to reliably distinguish between a substitution that
preserves meaning (PSyn ) from those that do not (PW ordsSyn for more idiomatic NCs, PRand
for all NCs). The results from the previous section demonstrated that, on average, the models
do not seem to represent idiomaticity correctly. For instance, Figure 2 shows that probe PComp
yields larger average similarities than probe PSyn , and that PSyn and PW ordsSyn have similar
averages, but with PSyn exhibiting more variance. Both results are incompatible with a good
idiomatic representation. Affinity will allow us to verify this on a per-NC basis. In particular,
we compare the Affinities for NCs and their synonyms against the synonyms of their individual
components (ASyn|W ordsSyn ), and against random substitutions (ASyn|Rand ), with the expected
affinity ranges shown in Figure 3(Ideal Values).
First of all, comparing against synonyms of the individual components (Figure
3((ASyn|W ordsSyn )) on the whole the models display comparable abilities in term of averages,
around 0 for all models, but differ to some extent in their variances. As the Affinities obtained are
mostly neutral (around 0) the models do not display the higher similarities between the NCs and
their gold synonyms to the extent that would be expected. Moreover, this holds even for random
replacements (Figure 3(ASyn|Rand )) where some models display small positive averages, but are
far from the expected ideal (Figure 3(Ideal Values)).
The relatively important variances in Figure 3 call for an analysis of the Affinities according
to idiomaticity level. This is displayed in Figure 4 for English naturalistic sentences where the
classification of NCs as compositional (C), partly compositional (PC) and idiomatic (I) from
Garcia et al. (2021a) is adopted. A striking pattern emerges in all the figures. For each model, its
distribution of Affinities splits into three distinct distributions with similar variances but different
averages ordered according to compositionality. Compositional NCs exhibit higher Affinities

21
Computational Linguistics Volume 1, Number 1

ASyn|W ordsSyn ASyn|Rand

Ideal Values

Figure 3: Affinity at the NC level for English (blue) and Portuguese (orange), with naturalistic
sentences in darker shade and neutral in lighter. The lower panel (Ideal Values) is an illustration
of values ideally expected for the different affinities. The means and standard deviations are in
Table 12 in the Appendix.

than partly compositional NCs, and the latter show higher Affinities than idiomatic NCs. This is
confirmed by the correlation analysis in Table 5, with most models displaying significant weak
to moderate correlations between Affinities and human compositionality judgements, for all
Affinities types, including neutral sentences and Portuguese data.28 This contradicts what was
generally expected: Affinity ASyn|W ordsSyn values should exhibit a negative correlation with
compositionality, while Affinity ASyn|Rand should show no correlation at all.
These results suggest that representations of idiomatic NCs may not be accurately incorporat-
ing their meanings, since NCs are not closer to their synonyms than to other alternatives, even if
they are random. Moreover, the more idiomatic NCs seem to be more similar to synonyms of their
individual components, which suggests that the surface clues about their individual components

28 We omitted the equivalent of Figure 4 for neutral sentences and Portuguese data due their visual similarity to the
English naturalistic version.

22
Wei He Investigating Idiomaticity in Word Representations

ASyn|W ordsSyn EN ASyn|Rand EN

ASyn|W ordsSyn PT ASyn|Rand PT

Figure 4: Affinity by idiomaticity Class at NC level for English (EN) and Portuguese (PT) natural-
istic sentences. Idiomatic (I) in green, partly compositional (PC) in yellow and Compositional
NCs (C) in blue.

may be playing a greater role in driving these similarities, even in contextualised models. This
result remains valid even after removing compounds from the dataset that have lexical overlaps
with the NCSyn produced by the annotators (see Table 14 in the appendix).

4.3 Can a more meaningful similarity measure be found to unveil NC meaning?

If random substitutions that should result in Affinities around 1 (ASyn|Rand in Figure 3 (Ideal
Values)) result instead in values mostly below 0.5, the latter may represent the de facto upperbound
for Affinity for these models. In this case, a rescaling factor may need to be adopted that could
magnify meaningful similarity values. To implement this, we propose the Scaled Similarity (eq.5),
which takes into account the threshold defined by random replacements when calculating the
cosine similarities between the target representation and a given probe. In this section we explore
the behavior of SimR|Syn and SimR|W ordsSyn defined in section 3.4.

23
Computational Linguistics Volume 1, Number 1

Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLaMA2

ML ML ML
ASyn|W ordsSyn
EN-Nat 0.58 0.52 0.55 0.58 0.57 0.55 0.51 0.17
EN-Neut 0.58 0.52 0.54 0.58 0.53 0.49 0.50 0.23
PT-Nat 0.43 0.37 0.39 0.38 0.35 0.34 0.35 0.27
PT-Neut 0.44 0.37 0.41 0.40 0.34 0.32 0.36 0.31
ASyn|Rand
EN-Nat 0.60 0.54 0.63 0.66 0.69 0.68 0.61 0.39
EN-Neut 0.59 0.53 0.63 0.64 0.59 0.55 0.57 0.36
PT-Nat 0.48 0.41 0.54 0.49 0.51 0.48 0.42 0.36
PT-Neut 0.44 0.41 0.50 0.47 0.46 0.46 0.41 0.33
Table 5: Spearman ρ correlation between the Affinity and human judgments for English and
Portuguese for naturalistic (Nat) and neutral (Neut) sentences. Non-significant (p > 0.05) results
omitted from the table. Although these values shown are for the correlations at the NC level, the
correlations at Sentence level are comparable.

SimR|Syn SimR|W ordsSyn

Figure 5: Average Scaled Similarity when the original NCs are replaced by gold synonyms
(SimR|Syn ) or by the synonyms of component words (SimR|W ordsSyn ), in relation to random
substitutions. English (blue) and Portuguese (orange), with naturalistic sentences in darker shades
and for neutral in lighter. The means and standard deviations are in Table 13 in the Appendix.

The Scaled Similarity values (Figure 5) reveal, even more than the Affinities, the equivalences
displayed by the behavior of these models, with SimR being able to abstract away from the
particularities of the spaces defined by each of these models. Interestingly, comparing different
levels of contextualisation (e.g. static models on the left and contextualised on the right half
of Figure 5) the Scaled Similarities produced by static models like Word2Vec and GloVe are
comparable to those by a contextualised large language model like LLaMA2. These results seems
to hold independently of how informative the context is (naturalistic vs. neutral sentences), with
NC representations from naturalistic sentences displaying no real advantage over those from
neutral sentences. Overall, these results suggest that the informative contexts provided by the
naturalistic sentences may not yet be adequately incorporated even by the larger contextualised
models.

24
Wei He Investigating Idiomaticity in Word Representations

Figure 6: SimRP 1 per Compositionality Class: green for Idiomatic (I), yellow for Partly Composi-
tional (PC) and blue for Compositional (C), in English (EN) and Portuguese (PT), in Naturalistic
and Neutral sentences.

Inspecting the SimR|Syn values according to idiomaticity level (Figure 6), the models display
lower Scaled Similarities for the more idiomatic than for the more compositional NCs, confirming
what was already indicated by the Affinities that the models are less able to capture the idiomatic
meanings and as a consequence the expected high similarities with their gold standard synonyms
are not observed. This is further confirmed by analysing the values obtained for the synonyms
of the individual components (Figures 5 and 7) with the distributions of SimR|W ordsSyn values
having similar averages but considerably lower variances when compared to SimR|Syn , whereas
the expected result would be the opposite: lower averages and variances for SimR|Syn . In fact the
average and standard deviation for the ratio SimR|Syn /SimR|W ordsSyn (Figure 8) show that the
ratio oscillates around 1, which indicates that as a whole the models respond similarly to PSyn
and PW ordsSyn substitutions. In addition, the average values and variances for SimR|W ordsSyn
do not depend on the degree of compositionality of the target NC (Figure 7 for SimR|W ordsSyn
and Figure 9 for the average and standard deviation for the ratio SimR|Syn /SimR|W ordsSyn ,
according to idiomaticity level). The whole picture indicates that for all models (contextualised or

25
Computational Linguistics Volume 1, Number 1

Figure 7: SimRP 3 per Compositionality Class: green for Idiomatic (I), yellow for Partly Composi-
tional (PC) and blue for Compositional (C), in English (EN) and Portuguese (PT), in Naturalistic
and Neutral sentences.

not) replacing the NC by literal synonyms of the component words is more effective (produces
higher similarities) than using their gold synonyms. In particular for idiomatic NCs we observe
that SimR|Syn < SimR|W ordsSyn which indicates that the lexical similarity (as opposed to the
similarity of meaning) is still a dominant factor in the representations even for the contextualised
models, and provides additional confirmation for the possibility that the component words of an
idiomatic NC may be represented individually and combined compositionally by these models.
Table 6 summarises these results in terms of the Spearman correlations between SimR values
and the human judgments for compositionality. It shows that, considering the different models,
SimR|Syn is almost always moderately correlated with the compositionality score: the higher the
compositionality score, the higher the value SimR|Syn , and consequently the more the meaning
is preserved with a PSyn substitution. SimR|W ordsSyn , in contrast, rarely displays significant
correlation with compositionality score. As discussed above this is a demonstration that the
idiomatic meaning is not captured by these models, not even by those that are contextualised. As
with Affinities, this discrepancy in the behavior of Scaled Similarities persists even after removing

26
Wei He Investigating Idiomaticity in Word Representations

Figure 8: Ratio of Average of Scaled Similarity (SimR|Syn /SimR|W ordsSyn ). English data are in
blue, Portuguese data in orange, values for naturalistic sentences in darker shade and for neutral
in lighter.

compounds from the dataset that have lexical overlaps with the NCSyn produced by the annotators
(see Table 15 in the appendix).

SimR|Syn Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLaMA2

ML ML ML
EN-Nat 0.61 0.57 0.61 0.66 0.63 0.67 0.61 0.38
EN-Neut 0.60 0.56 0.62 0.64 0.60 0.54 0.57 0.37
PT-Nat 0.46 0.39 0.47 0.48 0.51 0.45 0.37 0.40
PT-Neut 0.44 0.41 0.45 0.48 0.46 0.41 0.38 0.32
SimR|Syn Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLaMA2
ML ML ML
EN-Nat - - 0.17 0.18 0.19 0.36 0.33 0.25
EN-Neut - - 0.14 0.12 - - 0.29 0.17
PT-Nat - - 0.10 0.14 0.25 0.16 - -
PT-Neut - - - - 0.12 - - -
Table 6: Spearman ρ correlation between the Scale Similarities and human judgments, for
SimR|Syn and SimR|W ordsSyn in both English and Portuguese. Non-significant (p > 0.05) results
were omitted from the table. Although these values shown are for the correlations at the NC level,
the correlations at the Sentence level are comparable.

4.4 How are the results across models and languages?

We have evaluated several vector models from different architectures in two languages, ranging
from static to contextual representations as well as monolingual and multilingual models. Although
the results are generally far from being satisfactory, in this section we highlight some differences
and similarities between models and languages.
Across models, the similarities are in general higher for Transformer-based models than for
static representations. In this respect, it is worth noting that the results of ELMo and mSBERT
are as similar to those of Word2Vec and GloVe than to the other BERT variants (for instance in

27
Computational Linguistics Volume 1, Number 1

EN-Nat PT-Nat

EN-Neut PT-Neut

Figure 9: Ratio of Average of Scaled Similarity (SimR|Syn /SimR|W ordsSyn ), per Compositionality
Class: green for Idiomatic (I), orange for Partly Compositional (PC) and blue for Compositional
(C), in English (EN) and Portuguese (PT), in Naturalistic (Nat) and Neutral (Neut) sentences.

Figure 2). Although further research would be needed to determine the precise factors, for ELMo
this behaviour could be due either to the different vector space constructed by LSTMs or due to
the smaller number of hidden layers when compared to the other models (2 vs. 6 and 12 layers),
which may imply lower contextualisation effects across the network (Ethayarajh 2019).
For the Transformer-based models, there are clear differences between the similarities
produced by the BERT-based models and those of the autoregressive models, which are lower
and with a wider range, especially for neutral sentences. When comparing monolingual and
multilingual models, namely BERT and BERT ML, similar tendencies are found both in
similarities and in correlations with the human judgments. In general, multilingual models seem to
place the vector representations in a more restricted space, implying higher degrees of similarity
and lower ranges of variation. Similar tendencies are found for DistilBERT-ML.
The proposed measures also suggest that the representations of the large autoregressive
models are more similar to those of the static embeddings than to the other Transformer-based
encoder models.

28
Wei He Investigating Idiomaticity in Word Representations

Although the results of the different models across languages follow very similar trends, they
also display two main differences. The first one is that when comparing the minimal pairs of
the naturalistic data, the representations in English seem to be closer and occupy less space than
those in Portuguese, in both monolingual and multilingual models of all types. The second is
that for neutral sentences, there are larger differences than for naturalistic sentences, especially
at the sentence level in both languages and similar results at the NC level, except for ELMo and
BERT embeddings in PW ordsSyn and PRand (Figure 2). The trends are even more aligned when
considering Affinities and Scaled Similarities for most models in both languages.
Indeed, high correlations were found among all models, reflected by the correlogram in
Figure 10. Correlations are particularly high for the expected congruent variants involving NCSyn ,
as reflected by the darker red shades: PSyn , ASyn|W ordsSyn and SimR|Syn . They are also higher
for Affinities and Scaled Similarities, indicating that taking into account the relative preferences
and random similarities within each model reveals how comparable they are in their ability to
represent idiomaticity. That is, regardless of any superiority of specific models for other tasks,
and in spite of their seemingly different individual performances in terms of cosines similarities
in terms of idiomaticity representation this sample of models has not revealed one that is clearly
better than the others. Moreover, high correlations with the static models also suggests that the
relevant contextual clues for idiomaticity representation are not yet adequately incorporated by
the contextualised models.
In sum, our results indicate that the different models evaluated in general follow the same
tendencies when representing idiomaticity in context, suggesting that they are not yet able to
adequately capture the semantics of the MWEs. More investigation is needed to determine how
to effectively achieve this with these architectures and training regimes, or whether a change in
paradigm is required. We will now discuss some representative cases, to give a flavour of how
these models handle a spectrum of idiomaticity.

4.5 Analysing example cases

For a more concrete qualitative overview of the ability of models in representing different levels of
idiomaticity, we now look at some representative English NCs evenly distributed among the three
levels of compositionality (compositional, partly compositional and idiomatic) in three naturalistic
sentences (Table 7). We start with the probes for 6 English NCs and then look at the highest and
lowest values for the ASyn|W ordsSyn affinity focusing on the relation between a given NC and its
N CSyn and N CW ordSyn variants.

Probes. Considering the probing measures in terms of the average scores of all sentences for each
of the 6 NCs (Table 8), we focus on the cosine similarities for the probes and whether they differ
from the expected behavior compatible with capturing idiomatic meaning.
First of all, for PSyn the similarities should be close to 1. Indeed, at sentence level all
similarities for all models are above 0.9, and tend to be higher for compositional NCs (0.98) than
for partly compositional (0.95) and than for idiomatic NCs (0.90). However, at NC level, they
display considerable variation, and while the similarities are high for all models for compositional
NCs, for idiomatic NCs, in particular, the similarities are the lowest and vary considerably per
model (from 0.27 for SBERT ML and Word2Vec to 0.81 for LLama2 for grey matter). For partly
compositional NCs, although some of the models assign the expected high similarities for some
NCs (0.94 for BERT for Dutch courage), other NCs have lower similarities (0.43 for Word2Vec
for eternal rest).
For PComp , lower similarities are expected for idiomatic NCs, as the idiomatic meaning may
be lost when one of the component words is missing (e.g. grey matter vs. grey or vs. matter).
However, at sentence level they are higher than 0.93 for all models. At NC level, although these

29
Computational Linguistics Volume 1, Number 1

Figure 10: Correlograms for all models for all sentences and languages, with only significant
values displayed (p < 0.05). Red indicates positive correlation and blue negative; darker shades
are for higher values, lighter for lower values.

30
Wei He Investigating Idiomaticity in Word Representations

NC N CSyn N CW ordsSyn Examples

grey matter brain silvery mate- S1: Give your grey matter the workout that it needs
rial to stay sharp and focused.
S2: More ideas will follow when I get the grey
matter functioning.
S3: These youngsters can be encouraged to use
their grey matter if the presentation is right.
Idiomatic

eager beaver hard worker restless S1: Eric was being an eager beaver and left work
rodent late.
S2: Eager beavers willing to learn your job for
less pay are almost always waiting in the wings.
S3: If you are a really eager beaver you can
pre-order the DVD now from either of the below
retailers.
Dutch courage alcoholic Hollander S1: We had to go down to the pub to get some
courage bravery Dutch courage!
S2: We suggest you try the following cocktail to
work up a bit of Dutch courage to get you through
Partly Compositional

the match!
S2: After some Dutch courage(a few vodkas) in
the nightclub, and a nerve-racking conversation,
we kissed!
eternal rest death permanent S1: They have been called home to their eternal
break rest and we are left behind.
S2: These tolls announce the death of a nun and
call for prayers for her eternal rest.
S3: The passengers, with early morning porridge
complexions, don’t look far from eternal rest.
economic aid financial as- budgetary as- S1: We have already extended to Greece certain
sistance sistance types of relief and economic aid but these are
inadequate.
S2: The USSR was soon giving Cuba economic
aid, technical support and military ’advisers’ from
Compositional

the USSR.
S3: A government’s success in reducing population
movement should be a key factor in allocating
economic aid.
research lab research fa- investigation S1: The fourth year is spent doing a research
cility workplace project in a ’real’ research lab.
S2: Being part of a research lab provides at times
very exciting fieldwork experiences for individual
students.
S3: Bath operates several undergraduate degree
programmes that include a professional placement
year in industry or a research lab.
Table 7: Compositional NCs, N CSyn , N CW ordSyn and three sentences for qualitative analyses.

idiomatic NCs have lower similarities they are still high (from 0.77 for SBERT ML for grey matter
to 0.94 for BERT ML for eager beaver). For partly compositional and compositional NCs they
are mostly high for all models, except for LLama 2 for Dutch courage (0.67).

31
Computational Linguistics Volume 1, Number 1

Although lower PW ordsSyn were also expected for more idiomatic NCs, at sentence level
all idiomatic, partly compositional and compositional NCs, display similarities above 0.95, even
though the N CW ordSyn in PW ordsSyn does not preserve the idiomatic meaning (e.g. grey
matter vs. silvery material). At NC level, even if lower values were found for idiomatic NCs with
static models (Word2Vec and GloVe), high similarities were still found (e.g. 0.91 for BERT for
gray matter).
Finally, for PRand , there should be low similarities for all NCs and randomly generated
substitutions. However, most of the similarities are still high, regardless of the level of idiomaticity
(e.g. 0.92 for BERT for grey matter and for BERT ML for Dutch courage).
Overall, the expected high similarities for PSyn are not displayed by these models at NC
level, and for the other probes the perturbations to the idiomatic meaning are not reflected by
lower similarities.

Affinities. For the affinity measures, considering the examples with the highest and lowest values
for ASyn|W ordsSyn as a proxy for how a particular model represents an NC compared to its
synonym and to a word-by-word replacement (NCSyn|W ordsSyn ), we focus on the results for
BERT in the naturalistic sentences in English. As discussed in section 4.1, we expect higher
ASyn|W ordsSyn values for idiomatic NCs, since the model should display a stronger preference for
a semantically related synonym than to a potentially unrelated substitution, representing the former
as closely as possible from the NC in a vector space. In contrast, for more compositional cases,
both substitutions may be possible and close to one another (reflected by ASyn|W ordsSyn values
around 0). However, the NCs with the highest ASyn|W ordsSyn values were mostly compositional
(starting with video game with ASyn|W ordsSyn = 0.44, and parking lot with ASyn|W ordsSyn =
0.40), with the first partly compositional NC appearing at the position 16 (sparkling water with
ASyn|W ordsSyn = 0.32). The idiomatic NC with the highest ASyn|W ordsSyn value is at position
53 (box office, referring to the popularity of a movie with ASyn|W ordsSyn = 0.24).
At the other end of the ranking, we find mostly idiomatic cases. Among the top 10 examples
with the lowest values we find 7 idiomatic (e.g., agony aunt with ASyn|W ordsSyn = −0.29 and
the NC with the lowest value, grey matter, ASyn|W ordsSyn = −0.40), with 2 partly compositional
and only one compositional NC in position 10 (cooking stove with ASyn|W ordsSyn = −0.24).
In sum, these confirm that the models do not display the expected preference for representing
NCs closer to their synonyms than to distractors, even when these involve idiomatic NCs and/or
random items.

5. Conclusions

This paper presented an evaluation of the ability of widely available word representation models
to capture idiomatic meaning, focusing on noun compounds in two languages, English and
Portuguese. For evaluation we introduced the NCIMP dataset, containing NCS in English and
Portuguese in naturalistic and neutral sentences forming minimal pairs with idiomatic probes
using their component words, synonyms and other variant replacements, resulting in a dataset
containing 29,900 items, extending the datasets by Garcia et al. (2021a) and Garcia et al. (2021b).
These pairs can be used to measure the ability of models to detect the loss of the idiomatic
meaning in the presence of lexical substitutions and different contexts. We also propose two types
of measure for quantifying this ability: Affinities and Scaled Similarities. Affinity is a relative
measure of the proximity of the NC to two alternative probes, determining which of them is the
closest to the NC. Focusing on idiomaticity, we analysed if the models were able to generate a
representation for a given NC that was more similar to a semantically related paraphrase given by
the gold standard synonym than to an alternative possibly semantically unrelated representation.
The proposed measures of scaled similarities, SimR , take sample random similarities into account

32
Wei He Investigating Idiomaticity in Word Representations

for rescaling the space of a given model, to magnify high similarities and distinguish them from
those that are artifacts of the characteristics of the landscape of that model. As a consequence,
SimR also seems to abstract away from the particularities of the semantic space of each model
and provides a more direct way of comparing idiomaticity representation across models. The
results obtained indicate that models are not able to accurately capture idiomaticity, as they fail to
reflect actual similarities between NCs and their gold synonyms, especially for idiomatic cases,
while at the same time not displaying enough awareness of perturbations that lead to changes in
meaning, such as those involving the synonyms of the component words, and even random words.
It seems that the lexical clues provided by the component words are prioritised when representing
an NC over a more holistic combination of the relevant semantic clues needed for representing its
idiomatic meaning. Moreover, although the contexts could provide relevant information about
the idiomatic meanings, they do not seem to be adequately incorporated in these widely adopted
models, regardless of their degree of contextualisation. They also seem to fail to incorporate the
relevant context for idiomaticity, seeing as static and contextualised models show comparable
performances.
In this paper we evaluated the proposed measures focusing on idiomaticity, but they may be
applied to other tasks, and serve as basis to detect unwanted biases towards non-target meanings
more generally. Moreover, they may be informative when fine-tuning models to assess if the
changes are going towards the intended target representations.

5.1 Future Work

In this paper, we inspected the similarities produced by a number of models to determine how
accurately they represent idiomatic expressions. The results obtained are that not even large models
like LLama2 seem to display the expected patterns that would confirm idiomatic understanding.
It is important to note that some of the difficulties in extracting information from cosine
similarity measures may be attributed to the presence of rogue dimensions (Timkey and van
Schijndel 2021) rather than anisotropy in semantic space. Measures like Affinity and Scaled
Similarity may not fully address this issue. We conducted a preliminary analysis using Timkey
and van Schijndel (2021) method to identify and standardise the top three rogue dimensions per
model/layer. After standardisation, we conducted an analysis focusing on PSyn measures and
found correlations mostly above 0.85, except for BERT-PT-Neut (0.79) and LLama2-EN-Neut
(0.65) (see Table 16 in the Appendix). Further investigation is needed to assess the impact of
standardising these dimensions and different approaches for standardisation, but given the high
correlations with our original results, we will leave this for future work.
Although our proposed assessment protocol and measures are model-independent, they
rely on access to the models and to their representations for subwords, words and multiwords.
Therefore, probing large generative AI chatbots for their understanding of idiomaticity, especially
closed-source models, presents additional challenges potentially requiring adaptation in the
application of the protocol, due to the restricted access to their base models and of the potential
variation in their answers. These warrant further investigations that are outside the scope of this
paper. However, one possible alternative would be to perform probing using question-answering,
following Zeng and Bhat (2022). We illustrate this question-answering approach with recent AI
Chatbots: GPT-3, Gemini pro29 (Team et al. 2023) and ChatGPT430 . For testing these models,
simple questions containing idiomatic expressions are used, after having instructed each model
to provide the shortest answer to each question before asking them. The assumption is that the

29 https://gemini.google.com/app
30 https://chat.openai.com/

33
Computational Linguistics Volume 1, Number 1

gray Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLama2

matter ML ML ML
PSyn 0.27 0.37 0.45(0.03) 0.27(0.01) 0.68(0.14) 0.78(0.05) 0.77(0.01) 0.81(0.04)
PComp 0.86 0.84 0.80(0.01) 0.77(0.00) 0.89(0.01) 0.90(0.02) 0.92(0.01) 0.89(0.01)
PW ordsSyn 0.58 0.59 0.66(0.01) 0.59(0.02) 0.91(0.02) 0.88(0.03) 0.85(0.02) 0.69(0.00)
PRand 0.47 0.52 0.61(0.02) 0.49(0.00) 0.92(0.02) 0.87(0.02) 0.87(0.02) 0.68(0.02)
ASyn|W ordsSyn -0.31 -0.22 -0.21(0.02) -0.32(0.02) -0.23(0.15) -0.10(0.02) -0.08(0.01) 0.12(0.04)
ASyn|Rand -0.20 -0.15 -0.16(0.02) -0.22(0.01) -0.25(0.16) -0.09(0.04) -0.09(0.01) 0.14(0.04)
SimR|Syn 0.11 -0.13 -0.37 -0.39 -1.54 -1.64 -1.72 0.42
SimR|W ordsSyn -0.02 0.25 0.15 0.15 0.04 0.02 -0.08 0.04
eager Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLama2
beaver ML ML ML
PSyn 0.34(0.02) 0.41(0.04) 0.68(0.05) 0.40(0.02) 0.78(0.06) 0.82(0.02) 0.83(0.01) 0.66(0.05)
PComp 0.87 (0.01) 0.85(0.01) 0.79(0.05) 0.82(0.02) 0.89(0.00) 0.94(0.01) 0.93(0.01) 0.78(0.16)
PW ordsSyn 0.45(0.03) 0.49(0.00) 0.84(0.03) 0.58(0.04) 0.86(0.04) 0.87(0.03) 0.85(0.01) 0.53(0.03)
PRand 0.41(0.01) 0.33(0.04) 0.72(0.14) 0.51(0.02) 0.92(0.03) 0.86(0.00) 0.89(0.01) 0.60(0.10).
ASyn|W ordsSyn -0.10 (0.00) -0.08(0.04) -0.16(0.04) -0.18(0.02) -0.07(0.04) -0.05(0.00) -0.02(0.00) 0.13(0.04)
ASyn|Rand -0.06 (0.03) 0.08 (0.00) -0.04 (0.15) -0.11(0.01) -0.13(0.08) -0.04(0.02) -0.07(0.00) 0.06(0.05)
SimR|Syn 0.09 0.08 -0.10 -0.14 -0.79 -0.78 -0.90 0.15
SimR|W ordsSyn -0.13 0.15 0.23 0.26 -0.14 -0.24 -0.42 -0.20
Dutch Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLama2
courage ML ML ML
PSyn 0.63 0.64 0.88(0.02) 0.75(0.01) 0.94(0.03) 0.93(0.01) 0.90(0.00) 0.77(0.00)
PComp 0.77 0.76 0.91(0.01) 0.82(0.00) 0.88(0.04) 0.90(0.01) 0.90(0.01) 0.67(0.14)
PW ordsSyn 0.57 0.53 0.82(0.01) 0.69(0.02) 0.88(0.03) 0.87(0.01) 0.89(0.01) 0.54(0.03)
PRand 0.41 0.35 0.79(0.01) 0.51(0.00) 0.90(0.04) 0.92(0.01) 0.90(0.00) 0.59(0.00)
ASyn|W ordsSyn 0.07 0.11 0.06(0.02) 0.06(0.02) 0.06(0.04) 0.05(0.02) 0.00(0.00) 0.23(0.03)
ASyn|Rand 0.22 0.28 0.09(0.02) 0.24(0.01) 0.04(0.06) 0.01(0.01) -0.01(0.00) 0.18(0.00)
SimR|Syn 0.32 0.37 0.42 0.46 0.38 0.28 0.09 0.44
SimR|W ordsSyn 0.21 0.32 0.22 0.26 0.06 -0.17 -0.33 -0.12
eternal Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLama2
rest ML ML ML
PSyn 0.43 0.53 0.48(0.01) 0.53(0.02) 0.74(0.04) 0.86(0.01) 0.76(0.02) 0.71(0.04)
PComp 0.92 0.89 0.80(0.02) 0.82(0.01) 0.87(0.03) 0.93(0.01) 0.09(0.00) 0.75(0.06)
PW ordsSyn 0.43 0.53 0.60(0.02) 0.71(0.02) 0.83(0.05) 0.83(0.05) 0.82(0.01) 0.66(0.03)
PRand 0.44 0.38 0.64(0.01) 0.43(0.01) 0.80(0.08) 0.87(0.03) 0.89(0.00) 0.62(0.04)
ASyn|W ordsSyn -0.00 -0.01 -0.12(0.02) -0.18(0.01) -0.09(0.08) 0.03(0.06) -0.06(0.01) 0.05(0.01)
ASyn|Rand -0.00 0.14 -0.15(0.00) 0.10(0.02) -0.06(0.08) -0.01(0.03) -0.13(0.02) 0.09(0.02)
SimR|Syn 0.15 0.14 -0.06 -0.00 -0.25 -0.15 -0.62 0.23
SimR|W ordsSyn -0.34 0.16 0.05 0.22 0.17 0.10 -0.29 0.11
economic Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLama2
aid ML ML ML
PSyn 0.65 0.80 0.90(0.01) 0.77(0.01) 0.89(0.02) 0.96(0.01) 0.94(0.01) 0.95(0.02)
PComp 0.80 0.88 0.89(0.00) 0.78(0.00) 0.90(0.04) 0.93(0.03) 0.92(0.00) 0.90(0.02)
PW ordsSyn 0.64 0.74 0.92(0.00) 0.78(0.01) 0.89(0.00) 0.95(0.01) 0.92(0.01) 0.90(0.02)
PRand 0.65 0.73 0.70(0.07) 0.58(0.01) 0.76(0.05) 0.90(0.02) 0.90(0.01) 0.70(0.03)
ASyn|W ordsSyn 0.02 0.06 -0.01(0.01) -0.00(0.00) -0.00(0.02) 0.01(0.01) 0.02(0.00) 0.05(0.01)
ASyn|Rand 0.01 0.07 0.20(0.06) 0.20(0.01) 0.14(0.05) 0.06(0.01) 0.04(0.01) 0.25(0.02)
SimR|Syn 0.27 0.22 0.32 0.47 0.56 0.53 0.52 0.84
SimR|W ordsSyn 0.42 0.28 0.24 0.41 0.58 0.49 0.42 0.66
research Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLama2
lab ML ML ML
PSyn 0.71 0.82 0.91(0.03) 0.78(0.02) 0.93(0.01) 0.97(0.00) 0.95(0.00) 0.96(0.01)
PComp 0.86 0.88 0.90(0.00) 0.88(0.01) 0.89(0.04) 0.94(0.01) 0.94(0.01) 0.87(0.05)
PW ordsSyn 0.47 0.51 0.68(0.02) 0.72(0.02) 0.67(0.09) 0.90(0.01) 0.90(0.01) 0.78(0.11)
PRand 0.39 0.40 0.72(0.04) 0.40(0.01) 0.73(0.07) 0.88(0.01) 0.90(0.01) 0.68(0.08)
ASyn|W ordsSyn 0.23 0.30 0.23(0.01) 0.06(0.02) 0.26(0.08) 0.07(0.01) 0.05(0.01) 0.18(0.11)
ASyn|Rand 0.32 0.41 0.19(0.01) 0.38(0.03) 0.20(0.06) 0.09(0.01) 0.06(0.01) 0.28(0.07)
SimR|Syn 0.52 0.69 0.68 0.63 0.73 0.72 0.55 0.87
SimR|W ordsSyn 0.14 0.18 -0.17 0.53 -0.20 0.17 0.06 0.35

Table 8: Similarity, Affinity and Scaled Similarity values for the NCs selected in Table 7. Values
in parentheses represent the standard deviations among the three sentences. The static models are
independent of context, and for them, the variance is omitted, except in the case of eager beaver
where there is a sentence where the compound appears in plural form.
34
Wei He Investigating Idiomaticity in Word Representations

Gemini ChatGPT
Idiom Question GPT-3
Pro 4
I would
If I have a never say die attitude, would never run a Run the
Never say die Run.
I run the marathon injured or forfeit? marathon marathon.
injured.
If I am all at sea with my math You are
All at sea assignment, am I making progress or making Lost. Lost.
am I lost? progress.
If I use every trick in the book to
Every trick in A Determined
guarantee my grade, am I a magician or Cheater.
the book magician. student.
a determined student?
A multi-
If I wanted to kill two birds in one
Kill two birds A slaughter- tasking
stone, what kind of a workplace should Multiskilled.
in one stone house. environ-
I work in?
ment.
Ahead of the If I want to be ahead of the game,
Procrastinate. Early. Study early.
game would I study early or procrastinate?
Table 9: Questions used to probe the understanding of idioms and the answers provided by recent
generative models. The leftmost column lists idiomatic expressions, the second column presents
hypothetical questions using these expressions, and the following columns show the responses
from different models, including GPT-3, Gemini Pro, and ChatGPT 4. Answers by GPT-3 are
from (Zeng and Bhat 2022).

35
Computational Linguistics Volume 1, Number 1

questions could only be answered correctly if the model understood the meaning of the idioms in
context. The questions and answers are included in Table 9.
The responses from different systems vary in terms of correctly interpreting the idioms.
The responses from GPT-3 often miss the mark, while the responses from Gemini Pro and
ChatGPT 4 are mixed, with some answers suggesting correct interpretations and some incorrect.
For instance, for “Every trick in the book”, GPT-3 responds with “A magician”, which is a literal
interpretation, while ChatGPT 4 correctly identifies the figurative meaning with “Determined
student”. Gemini Pro response to the question is “Cheater”. The idiom “every trick in the book”
generally means to use all available means or strategies to achieve one’s goal, often implying
ingenuity or resourcefulness rather than dishonesty. The response from Gemini Pro could either
be due to “trick” or it could be seen as a misinterpretation. This table could also be seen as
indicative of the evolution of AI language models over time, with newer models potentially
being trained to better handle idiomatic expressions and context, as seen in the generally more
accurate responses from ChatGPT 4 compared to GPT 3. Although the questions in the table
are indeed useful for exemplifying the comprehension of idiomatic expressions by these models
they only cover a very limited and focused sample. In this paper, we propose the use of minimal
pairs containing synonyms and other distractors for a more in-depth assessment of idiomatic
understanding. Although their adaptation for a question-answering setting is left for future work,
our results for open models is in line with comparative analyses of the ability of some of these
models for idiomatic and figurative language (Phelps et al. 2024).
Moreover, as idiomatic expressions can be extremely diverse and nuanced, a comprehensive
evaluation of the ability of a model to understand them requires a controlled but extensive set of
idiomatic expressions and their variations. Therefore, we plan to extend the test items to contain
additional types of multiword expressions, including verb-noun combinations and phrasal verbs.
In addition for a larger crosslingual examination of idiomaticity, and in particular of whether
multilingual models capture language-specific realisations of idiomatic expressions, we plan to
extend the dataset with additional languages. These would also allow the investigation of factors
relevant to specific tasks, such as machine translation, for which the translatability of MWEs from
source into target languages for may also affect performance when processing MWEs (Dankers,
Lucas, and Titov 2022).
Possible next steps also include extending the probing strategy with additional measures
that go beyond similarities and correlations. Moreover, for ambiguous NCs in particular, we
intend to add sense-specific probes which could be used to measure and address training biases
towards particular senses. Finally, this paper has focused the evaluation on off-the-shelf pre-trained
models to provide an analysis of their ability to capture idiomaticity, and left the investigation
of fine-tuned models for future work. In particular, although fine-tuning can improve model
performance (Tayyar Madabushi et al. 2022) it is unclear to what extent the models are able to
generalise beyond the specific items seen to other unseen idiomatic expressions, or if each new
expression would have to be individually learned by the model. But these points are left for future
investigation.

Acknowledgments
This work was partly supported by UKRI EPSRC EP/T02450X/1 and NAF/R2/202209 (UK), by CNPq
311497/2021-7 and CAPES/PRINT 88887.583995/2020-00 (Brazil), by
MCIN/AEI/10.13039/501100011033 (grants PID2021-128811OA-I00 and TED2021-130295B-C33, the
latter also funded by “European Union Next Generation EU/PRTR”), by the Galician Government (ERDF
2014-2020: Call ED431G 2019/04, ED431F 2021/01, and ED431F 2021/01), and by a Ramón y Cajal grant
(RYC2019-028473-I), and by COST-Action UniDive.

36
Wei He Investigating Idiomaticity in Word Representations

References Christopher Berner, Sam McCandlish, Alec

Acosta, Otavio, Aline Villavicencio, and Viviane Radford, Ilya Sutskever, and Dario Amodei.
Moreira. 2011. Identification and treatment of 2020. Language models are few-shot learners.
multiword expressions applied to information In Advances in Neural Information Processing
retrieval. In Proceedings of the Workshop on Systems, volume 33, pages 1877–1901, Curran
Multiword Expressions: from Parsing and Associates, Inc.
Generation to the Real World, pages 101–109, Quinta de Castro, Pedro Vitor, Nádia Félix
Association for Computational Linguistics, Felipe da Silva, and Anderson da Silva Soares.
Portland, Oregon, USA. 2018. Portuguese Named Entity Recognition
Adewumi, Tosin, Foteini Liwicki, and Marcus Using LSTM-CRF. In Proceedings of the 13th
Liwicki. 2022. Vector representations of idioms International Conference on the Computational
in conversational systems. Sci, 4(4):37. Processing of the Portuguese Language
Akbik, Alan, Tanja Bergmann, Duncan Blythe, (PROPOR 2018), pages 83–92, Springer, Cham,
Kashif Rasul, Stefan Schweter, and Roland Canela–RS, Brazil.
Vollgraf. 2019. FLAIR: An easy-to-use Chakrabarty, Tuhin, Yejin Choi, and Vered
framework for state-of-the-art NLP. In Shwartz. 2022. It’s not rocket science:
Proceedings of the 2019 Conference of the Interpreting figurative language in narratives.
North American Chapter of the Association for Transactions of the Association for
Computational Linguistics (Demonstrations), Computational Linguistics, 10:589–606.
pages 54–59, Association for Computational Chakrabarty, Tuhin, Arkadiy Saakyan, Debanjan
Linguistics, Minneapolis, Minnesota. Ghosh, and Smaranda Muresan. 2022. FLUTE:
Apidianaki, Marianna. 2022. From Word Types to Figurative language understanding through
Tokens and Back: A Survey of Approaches to textual explanations. In Proceedings of the
Word Meaning Representation and 2022 Conference on Empirical Methods in
Interpretation. Computational Linguistics, Natural Language Processing, pages
49(2):465–523. 7139–7159, Association for Computational
Baroni, Marco, Silvia Bernardini, Adriano Linguistics, Abu Dhabi, United Arab Emirates.
Ferraresi, and Eros Zanchetta. 2009. The Chang, Ting-Yun and Yun-Nung Chen. 2019.
WaCky wide web: a collection of very large What does this word mean? explaining
linguistically processed web-crawled corpora. contextualized embeddings with natural
Language resources and evaluation, language definition. In Proceedings of the 2019
43(3):209–226. Conference on Empirical Methods in Natural
Baroni, Marco, Georgiana Dinu, and Germán Language Processing and the 9th International
Kruszewski. 2014. Don’t count, predict! a Joint Conference on Natural Language
systematic comparison of context-counting vs. Processing (EMNLP-IJCNLP), pages
context-predicting semantic vectors. In 6064–6070, Association for Computational
Proceedings of the 52nd Annual Meeting of the Linguistics, Hong Kong, China.
Association for Computational Linguistics Church, Kenneth Ward and Patrick Hanks. 1989.
(Volume 1: Long Papers), pages 238–247, Word association norms, mutual information,
Association for Computational Linguistics, and lexicography. In 27th Annual Meeting of
Baltimore, Maryland. the Association for Computational Linguistics,
Baroni, Marco and Alessandro Lenci. 2010. pages 76–83, Association for Computational
Distributional memory: A general framework Linguistics, Vancouver, British Columbia,
for corpus-based semantics. Computational Canada.
Linguistics, 36(4):673–721. Constant, Mathieu, Gülşen Eryiǧit, Johanna Monti,
Bojanowski, Piotr, Edouard Grave, Armand Joulin, Lonneke van der Plas, Carlos Ramisch, Michael
and Tomas Mikolov. 2017. Enriching word Rosner, and Amalia Todirascu. 2017. Survey:
vectors with subword information. Transactions Multiword expression processing: A Survey.
of the Association for Computational Computational Linguistics, 43(4):837–892.
Linguistics, 5:135–146. Contreras Kallens, Pablo and Morten H
Brown, Tom, Benjamin Mann, Nick Ryder, Christiansen. 2022. Models of language and
Melanie Subbiah, Jared D Kaplan, Prafulla multiword expressions. Frontiers in Artificial
Dhariwal, Arvind Neelakantan, Pranav Shyam, Intelligence, 5:781962.
Girish Sastry, Amanda Askell, Sandhini Cook, Paul, Afsaneh Fazly, and Suzanne
Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Stevenson. 2008. The VNC-tokens dataset. In
Tom Henighan, Rewon Child, Aditya Ramesh, Proceedings of the LREC Workshop Towards a
Daniel Ziegler, Jeffrey Wu, Clemens Winter, Shared Task for Multiword Expressions (MWE
Chris Hesse, Mark Chen, Eric Sigler, Mateusz 2008), pages 19–22.
Litwin, Scott Gray, Benjamin Chess, Jack Clark,

37
Computational Linguistics Volume 1, Number 1

Cordeiro, Silvio, Aline Villavicencio, Marco Idiart, Fakharian, Samin and Paul Cook. 2021.
and Carlos Ramisch. 2019. Unsupervised Contextualized embeddings encode
compositionality prediction of nominal monolingual and cross-lingual knowledge of
compounds. Computational Linguistics, idiomaticity. In Proceedings of the 17th
45(1):1–57. Workshop on Multiword Expressions (MWE
Dankers, Verna, Christopher Lucas, and Ivan Titov. 2021), pages 23–32, Association for
2022. Can transformer be too compositional? Computational Linguistics, Online.
analysing idiom processing in neural machine Fazly, Afsaneh, Paul Cook, and Suzanne
translation. In Proceedings of the 60th Annual Stevenson. 2009. Unsupervised type and token
Meeting of the Association for Computational identification of idiomatic expressions.
Linguistics (Volume 1: Long Papers), pages Computational Linguistics, 35(1):61–103.
3608–3626, Association for Computational Frege, Gottlob. 1956. The thought: A logical
Linguistics, Dublin, Ireland. inquiry. Mind, 65(259):289–311.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Garcia, Marcos. 2021. Exploring the
Kristina Toutanova. 2019. BERT: Pre-training representation of word meanings in context: A
of deep bidirectional transformers for language case study on homonymy and synonymy. In
understanding. In Proceedings of the 2019 Proceedings of the 59th Annual Meeting of the
Conference of the North American Chapter of Association for Computational Linguistics and
the Association for Computational Linguistics: the 11th International Joint Conference on
Human Language Technologies, Volume 1 Natural Language Processing (Volume 1: Long
(Long and Short Papers), pages 4171–4186, Papers), pages 3625–3640, Association for
Association for Computational Linguistics, Computational Linguistics, Online.
Minneapolis, Minnesota. Garcia, Marcos, Tiago Kramer Vieira, Carolina
de Dios-Flores, Iria, Juan Garcia Amboage, and Scarton, Marco Idiart, and Aline Villavicencio.
Marcos Garcia. 2023. Dependency resolution at 2021a. Assessing the representations of
the syntax-semantics interface: psycholinguistic idiomaticity in vector models with a noun
and computational insights on control compound dataset labeled at type and token
dependencies. In Proceedings of the 61st levels. In Proceedings of the 59th Annual
Annual Meeting of the Association for Meeting of the Association for Computational
Computational Linguistics (Volume 1: Long Linguistics and the 11th International Joint
Papers), pages 203–222, Association for Conference on Natural Language Processing
Computational Linguistics, Toronto, Canada. (Volume 1: Long Papers), pages 2730–2741,
Erk, Katrin. 2012. Vector space models of word Association for Computational Linguistics,
meaning and phrase meaning: A survey. Online.
Language and Linguistics Compass, Garcia, Marcos, Tiago Kramer Vieira, Carolina
6(10):635–653. Scarton, Marco Idiart, and Aline Villavicencio.
Ethayarajh, Kawin. 2019. How contextual are 2021b. Probing for idiomaticity in vector space
contextualized word representations? models. In Proceedings of the 16th Conference
Comparing the geometry of BERT, ELMo, and of the European Chapter of the Association for
GPT-2 embeddings. In Proceedings of the 2019 Computational Linguistics: Main Volume, pages
Conference on Empirical Methods in Natural 3551–3564, Association for Computational
Language Processing and the 9th International Linguistics, Online.
Joint Conference on Natural Language Gow-Smith, Edward, Harish Tayyar Madabushi,
Processing (EMNLP-IJCNLP), pages 55–65, Carolina Scarton, and Aline Villavicencio. 2022.
Association for Computational Linguistics, Improving tokenisation by alternative treatment
Hong Kong, China. of spaces. In Proceedings of the 2022
Ethayarajh, Kawin and Dan Jurafsky. 2021. Conference on Empirical Methods in Natural
Attention flows are shapley value explanations. Language Processing, pages 11430–11443,
In Proceedings of the 59th Annual Meeting of Association for Computational Linguistics, Abu
the Association for Computational Linguistics Dhabi, United Arab Emirates.
and the 11th International Joint Conference on Gulordava, Kristina, Piotr Bojanowski, Edouard
Natural Language Processing (Volume 2: Short Grave, Tal Linzen, and Marco Baroni. 2018.
Papers), pages 49–54, Association for Colorless green recurrent networks dream
Computational Linguistics, Online. hierarchically. In Proceedings of the 2018
Ettinger, Allyson. 2020. What BERT is not: Conference of the North American Chapter of
Lessons from a new suite of psycholinguistic the Association for Computational Linguistics:
diagnostics for language models. Transactions Human Language Technologies, Volume 1
of the Association for Computational (Long Papers), pages 1195–1205, Association
Linguistics, 8:34–48. for Computational Linguistics, New Orleans,

38
Wei He Investigating Idiomaticity in Word Representations

Louisiana. for idiomaticity in vector space. In Proceedings

Haagsma, Hessel, Johan Bos, and Malvina Nissim. of the 19th Workshop on Multiword Expressions
2020. MAGPIE: A large corpus of potentially (MWE 2023), pages 45–57.
idiomatic expressions. In Proceedings of the Kurfalı, Murathan and Robert Östling. 2020.
Twelfth Language Resources and Evaluation Disambiguation of potentially idiomatic
Conference, pages 279–287, European expressions with contextual embeddings. In
Language Resources Association, Marseille, Proceedings of the Joint Workshop on
France. Multiword Expressions and Electronic Lexicons,
Hartmann, Nathan, Erick Fonseca, Christopher pages 85–94, Association for Computational
Shulby, Marcos Treviso, Jéssica Silva, and Linguistics, online.
Sandra Aluísio. 2017. Portuguese word Landauer, Thomas K and Susan T Dumais. 1997.
embeddings: Evaluating on word analogies and A solution to Plato’s problem: The latent
natural language tasks. In Proceedings of the semantic analysis theory of acquisition,
11th Brazilian Symposium in Information and induction, and representation of knowledge.
Human Language Technology, pages 122–131, Psychological review, 104(2):211.
Sociedade Brasileira de Computação, Lenci, Alessandro, Magnus Sahlgren, Patrick
Uberlândia, Brazil. Jeuniaux, Amaru Cuba Gyllensten, and Martina
Hashempour, Reyhaneh and Aline Villavicencio. Miliani. 2022. A comparative evaluation and
2020. Leveraging contextual embeddings and analysis of three generations of distributional
idiom principle for detecting idiomaticity in semantic models. Language Resources and
potentially idiomatic expressions. In Evaluation, 56(4):1269–1313.
Proceedings of the Workshop on the Cognitive Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan
Aspects of the Lexicon, pages 72–80, Ghazvininejad, Abdelrahman Mohamed, Omer
Association for Computational Linguistics, Levy, Ves Stoyanov, and Luke Zettlemoyer.
Online. 2019. Bart: Denoising sequence-to-sequence
Henderson, James. 2020. The unstoppable rise of pre-training for natural language generation,
computational linguistics in deep learning. In translation, and comprehension. arXiv preprint
Proceedings of the 58th Annual Meeting of the arXiv:1910.13461.
Association for Computational Linguistics, Lin, Dekang. 1999. Automatic identification of
pages 6294–6306, Association for non-compositional phrases. In Proceedings of
Computational Linguistics, Online. the 37th Annual Meeting of the Association for
Hupkes, Dieuwke, Verna Dankers, Mathijs Mul, Computational Linguistics, pages 317–324,
and Elia Bruni. 2021. Compositionality Association for Computational Linguistics,
decomposed: how do neural networks College Park, Maryland, USA.
generalise? (extended abstract). In Proceedings Linzen, Tal, Emmanuel Dupoux, and Yoav
of the Twenty-Ninth International Joint Goldberg. 2016. Assessing the ability of
Conference on Artificial Intelligence, IJCAI’20. LSTMs to learn syntax-sensitive dependencies.
Karlgren, Jussi and Pentti Kanerva. 2021. Transactions of the Association for
Semantics in high-dimensional space. Frontiers Computational Linguistics, 4:521–535.
in artificial intelligence, 4:698809. Liu, Chunxi, Qiaochu Zhang, Xiaohui Zhang,
Kassner, Nora and Hinrich Schütze. 2020. Kritika Singh, Yatharth Saraf, and Geoffrey
Negated and misprimed probes for pretrained Zweig. 2020. Multilingual graphemic hybrid
language models: Birds can talk, but cannot fly. ASR with massive data augmentation. In
In Proceedings of the 58th Annual Meeting of Proceedings of the 1st Joint Workshop on
the Association for Computational Linguistics, Spoken Language Technologies for
pages 7811–7818, Association for Under-resourced languages (SLTU) and
Computational Linguistics, Online. Collaboration and Computing for
King, Milton and Paul Cook. 2018. Leveraging Under-Resourced Languages (CCURL), pages
distributed representations and lexico-syntactic 46–52, European Language Resources
fixedness for token-level prediction of the association, Marseille, France.
idiomaticity of english verb-noun combinations. Liu, Emmy and Graham Neubig. 2022. Are
In Proceedings of the 56th Annual Meeting of representations built from the ground up? an
the Association for Computational Linguistics, empirical examination of local composition in
ACL 2018, Melbourne, Australia, July 15-20, language models. In Proceedings of the 2022
2018, Volume 2: Short Papers, pages 345–350, Conference on Empirical Methods in Natural
Association for Computational Linguistics. Language Processing, pages 9053–9073,
Klubička, Filip, Vasudevan Nedumpozhimana, and Association for Computational Linguistics, Abu
John Kelleher. 2023. Idioms, probing and Dhabi, United Arab Emirates.
dangerous things: Towards structural probing

39
Computational Linguistics Volume 1, Number 1

Mandera, Paweł, Emmanuel Keuleers, and Marc Mitchell, Jeff and Mirella Lapata. 2010.
Brysbaert. 2017. Explaining human Composition in distributional models of
performance in psycholinguistic tasks with semantics. Cognitive science, 34(8):1388–1429.
models of semantic similarity based on Montague, Richard. 1973. The proper treatment of
prediction and counting: A review and quantification in ordinary english. In
empirical validation. Journal of Memory and Approaches to natural language: Proceedings
Language, 92:57–78. of the 1970 Stanford workshop on grammar and
Manning, Christopher D., Kevin Clark, John semantics, pages 221–242, Springer.
Hewitt, Urvashi Khandelwal, and Omer Levy. Nandakumar, Navnita, Timothy Baldwin, and
2020. Emergent linguistic structure in artificial Bahar Salehi. 2019. How Well Do Embedding
neural networks trained by self-supervision. Models Capture Non-compositionality? A View
Proceedings of the National Academy of from Multiword Expressions. In Proceedings of
Sciences, 117(48):30046–30054. the 3rd Workshop on Evaluating Vector Space
Marvin, Rebecca and Tal Linzen. 2018. Targeted Representations for NLP, pages 27–34,
syntactic evaluation of language models. In Association for Computational Linguistics,
Proceedings of the 2018 Conference on Minneapolis, USA.
Empirical Methods in Natural Language Nedumpozhimana, Vasudevan and John Kelleher.
Processing, pages 1192–1202, Association for 2021. Finding BERT’s idiomatic key. In
Computational Linguistics, Brussels, Belgium. Proceedings of the 17th Workshop on
McCarthy, Diana, Bill Keller, and John Carroll. Multiword Expressions (MWE 2021), pages
2003. Detecting a continuum of 57–62, Association for Computational
compositionality in phrasal verbs. In Linguistics, Online.
Proceedings of the ACL 2003 Workshop on Neelakantan, Arvind, Tao Xu, Raul Puri, Alec
Multiword Expressions: Analysis, Acquisition Radford, Jesse Michael Han, Jerry Tworek,
and Treatment, pages 73–80, Association for Qiming Yuan, Nikolas Tezak, Jong Wook Kim,
Computational Linguistics, Sapporo, Japan. Chris Hallacy, et al. 2022. Text and code
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg embeddings by contrastive pre-training. arXiv
Corrado, and Jeffrey Dean. 2013. Distributed preprint arXiv:2201.10005.
representations of words and phrases and their Nunberg, Geoffrey, Ivan A. Sag, and Thomas
compositionality. In Proceedings of the 26th Wasow. 1994. Idioms. Language, 70:491 – 538.
International Conference on Neural Pennington, Jeffrey, Richard Socher, and
Information Processing Systems - Volume 2, Christopher Manning. 2014. GloVe: Global
NIPS’13, pages 3111–3119, Curran Associates vectors for word representation. In Proceedings
Inc., USA. of the 2014 Conference on Empirical Methods
Miletic, Filip and Sabine Schulte im Walde. 2023. in Natural Language Processing (EMNLP),
A systematic search for compound semantics in pages 1532–1543, Association for
pretrained BERT architectures. In Proceedings Computational Linguistics, Doha, Qatar.
of the 17th Conference of the European Chapter Peters, Matthew E., Mark Neumann, Mohit Iyyer,
of the Association for Computational Matt Gardner, Christopher Clark, Kenton Lee,
Linguistics, pages 1499–1512, Association for and Luke Zettlemoyer. 2018. Deep
Computational Linguistics, Dubrovnik, Croatia. contextualized word representations. In
Miletić, Filip and Sabine Schulte im Walde. 2024. Proceedings of the 2018 Conference of the
Semantics of multiword expressions in North American Chapter of the Association for
transformer-based models: A survey. Computational Linguistics: Human Language
Transactions of the Association for Technologies, Volume 1 (Long Papers), pages
Computational Linguistics, 12:593–612. 2227–2237, Association for Computational
Miller, George A. 1995. WordNet: a lexical Linguistics, New Orleans, Louisiana.
database for English. Communications of the Phelps, Dylan, Xuan-Rui Fan, Edward Gow-Smith,
ACM, 38(11):39–41. Harish Tayyar Madabushi, Carolina Scarton,
Misra, Kanishka, Julia Rayz, and Allyson Ettinger. and Aline Villavicencio. 2022. Sample efficient
2023. COMPS: Conceptual minimal pair approaches for idiomaticity detection. In
sentences for testing robust property knowledge Proceedings of the 18th Workshop on
and its inheritance in pre-trained language Multiword Expressions @LREC2022, pages
models. In Proceedings of the 17th Conference 105–111, European Language Resources
of the European Chapter of the Association for Association, Marseille, France.
Computational Linguistics, pages 2928–2949, Phelps, Dylan, Thomas Pickard, Maggie Mi,
Association for Computational Linguistics, Edward Gow-Smith, and Aline Villavicencio.
Dubrovnik, Croatia. 2024. Sign of the times: Evaluating the use of
large language models for idiomaticity

40
Wei He Investigating Idiomaticity in Word Representations

detection. In Proceedings of the Joint Workshop Sag, Ivan A., Timothy Baldwin, Francis Bond,
on Multiword Expressions and Universal Ann Copestake, and Dan Flickinger. 2002.
Dependencies (MWE-UD 2024), Association Multiword expressions: A pain in the neck for
for Computational Linguistics. NLP. In Proceedings of the Third International
Pires, Telmo, Eva Schlinger, and Dan Garrette. Conference on Computational Linguistics and
2019. How multilingual is multilingual bert? In Intelligent Text Processing (CICLing 2002),
Proceedings of the 57th Annual Meeting of the pages 1–15, Springer, Berlin, Heidelberg,
Association for Computational Linguistics, Mexico City, Mexico.
pages 4996–5001. Sanh, Victor, Lysandre Debut, Julien Chaumond,
Prasad, Grusha, Marten van Schijndel, and Tal and Thomas Wolf. 2019. DistilBERT, a distilled
Linzen. 2019. Using priming to uncover the version of BERT: smaller, faster, cheaper and
organization of syntactic representations in lighter. arXiv preprint arXiv:1910.01108.
neural language models. In Proceedings of the van Schijndel, Marten and Tal Linzen. 2018. A
23rd Conference on Computational Natural neural model of adaptation in reading. In
Language Learning (CoNLL), pages 66–76, Proceedings of the 2018 Conference on
Association for Computational Linguistics, Empirical Methods in Natural Language
Hong Kong, China. Processing, pages 4704–4710, Association for
Rademaker, Alexandre, Valeria de Paiva, Gerard Computational Linguistics, Brussels, Belgium.
de Melo, Livy Real, and Maira Gatti. 2014. Schuster, Tal, Ori Ram, Regina Barzilay, and Amir
OpenWordNet-PT: A project report. In Globerson. 2019. Cross-lingual alignment of
Proceedings of the Seventh Global Wordnet contextual word embeddings, with applications
Conference, pages 383–390, University of Tartu to zero-shot dependency parsing. In
Press, Tartu, Estonia. Proceedings of the 2019 Conference of the
Ramisch, Carlos. 2023. Multiword expressions in North American Chapter of the Association for
computational linguistics. Down the rabbit hole Computational Linguistics: Human Language
and through the looking glass. Aix Marseille Technologies, Volume 1 (Long and Short
University. Papers), pages 1599–1613, Association for
Reddy, Siva, Diana McCarthy, and Suresh Computational Linguistics, Minneapolis,
Manandhar. 2011. An empirical study on Minnesota.
compositionality in compound nouns. In Shwartz, Vered and Ido Dagan. 2019. Still a pain
Proceedings of 5th International Joint in the neck: Evaluating text representations on
Conference on Natural Language Processing, lexical composition. Transactions of the
pages 210–218, Asian Federation of Natural Association for Computational Linguistics,
Language Processing, Chiang Mai, Thailand. 7:403–419.
Reimers, Nils and Iryna Gurevych. 2019a. Sporleder, Caroline, Linlin Li, Philip Gorinski,
Alternative Weighting Schemes for ELMo and Xaver Koch. 2010. Idioms in context: The
Embeddings. CoRR, abs/1904.02954. IDIX corpus. In Proceedings of the Seventh
Reimers, Nils and Iryna Gurevych. 2019b. International Conference on Language
Sentence-BERT: Sentence embeddings using Resources and Evaluation (LREC’10),
Siamese BERT-networks. In Proceedings of the European Language Resources Association
2019 Conference on Empirical Methods in (ELRA), Valletta, Malta.
Natural Language Processing and the 9th Tayyar Madabushi, Harish, Edward Gow-Smith,
International Joint Conference on Natural Marcos Garcia, Carolina Scarton, Marco Idiart,
Language Processing (EMNLP-IJCNLP), pages and Aline Villavicencio. 2022. SemEval-2022
3982–3992, Association for Computational task 2: Multilingual idiomaticity detection and
Linguistics, Hong Kong, China. sentence embedding. In Proceedings of the 16th
Rogers, Anna, Olga Kovaleva, and Anna International Workshop on Semantic Evaluation
Rumshisky. 2020. A primer in BERTology: (SemEval-2022), pages 107–121, Association
What we know about how BERT works. for Computational Linguistics, Seattle, United
Transactions of the Association for States.
Computational Linguistics, 8:842–866. Tayyar Madabushi, Harish, Edward Gow-Smith,
Saakyan, Arkadiy, Tuhin Chakrabarty, Debanjan Carolina Scarton, and Aline Villavicencio. 2021.
Ghosh, and Smaranda Muresan. 2022. A report AStitchInLanguageModels: Dataset and
on the FigLang 2022 shared task on methods for the exploration of idiomaticity in
understanding figurative language. In pre-trained language models. In Findings of the
Proceedings of the 3rd Workshop on Figurative Association for Computational Linguistics:
Language Processing (FLP), pages 178–183, EMNLP 2021, pages 3464–3477, Association
Association for Computational Linguistics, Abu for Computational Linguistics, Punta Cana,
Dhabi, United Arab Emirates (Hybrid). Dominican Republic.

41
Computational Linguistics Volume 1, Number 1

Team, Gemini, Rohan Anil, Sebastian Borgeaud, (ELRA), Miyazaki, Japan.

Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Schulte im Walde, Sabine. 2023. Collecting and
Radu Soricut, Johan Schalkwyk, Andrew M investigating features of compositionality
Dai, Anja Hauth, et al. 2023. Gemini: a family ratings. Multiword Expressions in Lexical
of highly capable multimodal models. arXiv Resources. Linguistic, Lexicographic and
preprint arXiv:2312.11805. Computational Perspectives, Phraseology and
Tenney, Ian, Dipanjan Das, and Ellie Pavlick. Multiword Expressions. Language Science
2019. BERT rediscovers the classical NLP Press, Berlin, Germany.
pipeline. In Proceedings of the 57th Annual Schulte im Walde, Sabine, Anna Hätty, Stefan Bott,
Meeting of the Association for Computational and Nana Khvtisavrishvili. 2016. GhoSt-NN: A
Linguistics, pages 4593–4601, Association for representative gold standard of German
Computational Linguistics, Florence, Italy. noun-noun compounds. In Proceedings of the
Timkey, William and Marten van Schijndel. 2021. Tenth International Conference on Language
All bark and no bite: Rogue dimensions in Resources and Evaluation (LREC’16), pages
transformer language models obscure 2285–2292, European Language Resources
representational quality. In Proceedings of the Association (ELRA), Portorož, Slovenia.
2021 Conference on Empirical Methods in Wang, Alex, Amanpreet Singh, Julian Michael,
Natural Language Processing, pages Felix Hill, Omer Levy, and Samuel Bowman.
4527–4546. 2018. GLUE: A multi-task benchmark and
Touvron, Hugo, Thibaut Lavril, Gautier Izacard, analysis platform for natural language
Xavier Martinet, Marie-Anne Lachaux, understanding. In Proceedings of the 2018
Timothée Lacroix, Baptiste Rozière, Naman EMNLP Workshop BlackboxNLP: Analyzing
Goyal, Eric Hambro, Faisal Azhar, Aurelien and Interpreting Neural Networks for NLP,
Rodriguez, Armand Joulin, Edouard Grave, and pages 353–355, Association for Computational
Guillaume Lample. 2023. LLaMA: Open and Linguistics, Brussels, Belgium.
Efficient Foundation Language Models. arXiv Warstadt, Alex, Alicia Parrish, Haokun Liu,
preprint arXiv:2302.13971. Anhad Mohananey, Wei Peng, Sheng-Fu Wang,
Venkatapathy, Sriram and Aravind Joshi. 2005. and Samuel R. Bowman. 2020. BLiMP: The
Measuring the relative compositionality of benchmark of linguistic minimal pairs for
verb-noun (V-n) collocations by integrating English. Transactions of the Association for
features. In Proceedings of Human Language Computational Linguistics, 8:377–392.
Technology Conference and Conference on Wiedemann, Gregor, Steffen Remus, Avi Chawla,
Empirical Methods in Natural Language and Chris Biemann. 2019. Does BERT Make
Processing, pages 899–906, Association for Any Sense? Interpretable Word Sense
Computational Linguistics, Vancouver, British Disambiguation with Contextualized
Columbia, Canada. Embeddings. In Proceedings of the 15th
Vulić, Ivan, Anna Korhonen, and Goran Glavaš. Conference on Natural Language Processing
2020. Improving bilingual lexicon induction (KONVENS 2019): Long Papers, pages
with unsupervised post-processing of 161–170, German Society for Computational
monolingual word vector spaces. In Linguistics & Language Technology, Erlangen,
Proceedings of the 5th Workshop on Germany.
Representation Learning for NLP, pages 45–54, Wolf, Thomas, Lysandre Debut, Victor Sanh,
Association for Computational Linguistics, Julien Chaumond, Clement Delangue, Anthony
Online. Moi, Pierric Cistac, Tim Rault, Remi Louf,
Vulić, Ivan, Edoardo Maria Ponti, Robert Litschko, Morgan Funtowicz, Joe Davison, Sam Shleifer,
Goran Glavaš, and Anna Korhonen. 2020. Patrick von Platen, Clara Ma, Yacine Jernite,
Probing pretrained language models for lexical Julien Plu, Canwen Xu, Teven Le Scao, Sylvain
semantics. In Proceedings of the 2020 Gugger, Mariama Drame, Quentin Lhoest, and
Conference on Empirical Methods in Natural Alexander Rush. 2020. Transformers:
Language Processing (EMNLP), pages State-of-the-art natural language processing. In
7222–7240, Association for Computational Proceedings of the 2020 Conference on
Linguistics, Online. Empirical Methods in Natural Language
Wagner Filho, Jorge A., Rodrigo Wilkens, Marco Processing: System Demonstrations, pages
Idiart, and Aline Villavicencio. 2018. The 38–45, Association for Computational
brWaC corpus: A new open resource for Linguistics, Online.
Brazilian Portuguese. In Proceedings of the Yu, Lang and Allyson Ettinger. 2020. Assessing
Eleventh International Conference on Language phrasal representation and composition in
Resources and Evaluation (LREC 2018), transformers. In Proceedings of the 2020
European Language Resources Association Conference on Empirical Methods in Natural

42
Wei He Investigating Idiomaticity in Word Representations

Language Processing (EMNLP), pages

4896–4907, Association for Computational
Linguistics, Online.
Zeng, Ziheng and Suma Bhat. 2021. Idiomatic
expression identification using semantic
compatibility. Transactions of the Association
for Computational Linguistics, 9:1546–1562.
Zeng, Ziheng and Suma Bhat. 2022. Getting bart
to ride the idiomatic train: Learning to represent
idiomatic expressions. Transactions of the
Association for Computational Linguistics,
10:1120–1137.
Zeng, Ziheng and Suma Bhat. 2023. Unified
representation for non-compositional and
compositional expressions. In Findings of the
Association for Computational Linguistics:
EMNLP 2023, pages 11696–11710.
Zeng, Ziheng, Kellen Cheng, Srihari Nanniyur,
Jianing Zhou, and Suma Bhat. 2023. Iekg: A
commonsense knowledge graph for idiomatic
expressions. In Proceedings of the 2023
Conference on Empirical Methods in Natural
Language Processing, pages 14243–14264.
Zhou, Jianing, Hongyu Gong, and Suma Bhat.
2021. Pie: A parallel idiomatic expression
corpus for idiomatic sentence generation and
paraphrasing. In Proceedings of the 17th
Workshop on Multiword Expressions (MWE
2021), pages 33–48.

43
Computational Linguistics Volume 1, Number 1

1. Appendix A: Measures for English and Portuguese

In this section we present the mean and standard deviation for the NCs in English and Portuguese
in naturalistic and neutral sentences, for the different probes at Sentence level (Table 10) and at
NC level (Table 11), for Affinities (Table 12) and for Scaled Similarities (Table 13).

PSyn
EN-Nat EN-Neut PT-Nat PT-Neut
Model Name
mean std mean std mean std mean std
Word2Vec 0.985 0.012 0.811 0.083 0.968 0.025 0.883 0.062
GloVe 0.990 0.008 0.868 0.063 0.980 0.018 0.931 0.054
ELMo 0.974 0.022 0.841 0.070 0.938 0.045 0.782 0.116
SBERT ML 0.974 0.022 0.810 0.101 0.955 0.035 0.833 0.096
BERT 0.988 0.011 0.927 0.035 0.980 0.017 0.915 0.041
BERT ML 0.992 0.007 0.924 0.040 0.984 0.012 0.929 0.044
DistilBERT ML 0.996 0.003 0.952 0.023 0.991 0.007 0.966 0.018
LLama2 0.992 0.010 0.955 0.020 0.981 0.018 0.903 0.065
PComp
Word2Vec 0.996 0.004 0.941 0.018 0.987 0.011 0.957 0.026
GloVe 0.996 0.003 0.955 0.011 0.993 0.006 0.982 0.012
ELMo 0.989 0.009 0.914 0.019 0.966 0.020 0.890 0.035
SBERT ML 0.990 0.007 0.922 0.021 0.982 0.013 0.929 0.029
BERT 0.992 0.007 0.951 0.016 0.986 0.013 0.933 0.025
BERT ML 0.996 0.003 0.957 0.016 0.993 0.005 0.962 0.016
DistilBERT ML 0.998 0.001 0.977 0.006 0.996 0.002 0.987 0.005
LLama2 0.995 0.008 0.986 0.007 0.991 0.008 0.964 0.020
PW ordsSyn
Word2Vec 0.983 0.013 0.797 0.049 0.958 0.031 0.845 0.060
GloVe 0.989 0.009 0.863 0.041 0.974 0.025 0.904 0.062
ELMo 0.975 0.020 0.861 0.048 0.930 0.042 0.760 0.088
SBERT ML 0.977 0.017 0.844 0.057 0.956 0.033 0.855 0.060
BERT 0.983 0.014 0.919 0.032 0.967 0.025 0.891 0.038
BERT ML 0.991 0.006 0.925 0.036 0.983 0.012 0.934 0.032
DistilBERT ML 0.995 0.003 0.952 0.016 0.990 0.006 0.963 0.014
LLama2 0.986 0.014 0.945 0.021 0.977 0.017 0.891 0.052
PRand
Word2Vec 0.984 0.012 0.799 0.043 0.960 0.033 0.851 0.099
GloVe 0.988 0.009 0.849 0.038 0.974 0.026 0.911 0.095
ELMo 0.966 0.025 0.829 0.040 0.912 0.048 0.725 0.115
SBERT ML 0.968 0.023 0.769 0.053 0.935 0.043 0.768 0.063
BERT 0.979 0.018 0.924 0.027 0.956 0.028 0.886 0.033
BERT ML 0.990 0.008 0.925 0.024 0.980 0.013 0.933 0.030
DistilBERT ML 0.995 0.004 0.951 0.012 0.990 0.007 0.967 0.016
LLama2 0.980 0.019 0.937 0.015 0.962 0.026 0.879 0.058
Table 10: Mean and standard deviation at Sentence level for PSyn , PComp , PW ordsSyn and PRand ,
for English (EN) and Portuguese (PT) for naturalistic (Nat) and neutral (Neut) sentences.

2. Appendix B: Results After Removing Examples with Synonym Lexical Overlaps

As the NCSyn were selected from the synonyms proposed by the human annotators, and chosen
according to frequency, this led to cases of lexical overlap. Removing the NCs with lexical overlap
with their NCSyn and analysing the correlations for Affinities and Scaled Similarities, the results
are as shown in Tables 14 and 15. The results are compatible with those of Tables 5 and 6 for

44
Wei He Investigating Idiomaticity in Word Representations

PSyn
EN-Nat EN-Neut PT-Nat PT-Neut
Model Name
mean std mean std mean std mean std
Word2Vec 0.517 0.209 0.517 0.207 0.498 0.251 0.488 0.258
GloVe 0.551 0.227 0.555 0.222 0.465 0.278 0.473 0.275
ELMo 0.714 0.147 0.646 0.155 0.629 0.166 0.551 0.192
SBERT ML 0.591 0.208 0.577 0.203 0.632 0.199 0.612 0.198
BERT 0.816 0.086 0.854 0.060 0.824 0.090 0.831 0.079
BERT ML 0.876 0.061 0.861 0.059 0.880 0.056 0.866 0.063
DistilBERT ML 0.867 0.058 0.864 0.057 0.868 0.059 0.870 0.056
LLama2 0.702 0.189 0.612 0.200 0.533 0.216 0.589 0.205
PComp
Word2Vec 0.840 0.039 0.838 0.039 0.714 0.269 0.703 0.280
GloVe 0.835 0.041 0.837 0.040 0.715 0.276 0.710 0.282
ELMo 0.859 0.042 0.823 0.040 0.781 0.080 0.733 0.093
SBERT ML 0.815 0.042 0.805 0.038 0.823 0.050 0.808 0.052
BERT 0.849 0.060 0.886 0.037 0.855 0.066 0.864 0.041
BERT ML 0.923 0.022 0.913 0.020 0.930 0.021 0.921 0.023
DistilBERT ML 0.922 0.015 0.922 0.013 0.929 0.018 0.931 0.014
LLama2 0.828 0.102 0.844 0.086 0.741 0.174 0.749 0.174
PW ordsSyn
Word2Vec 0.524 0.098 0.524 0.097 0.459 0.185 0.450 0.189
GloVe 0.569 0.119 0.572 0.116 0.356 0.196 0.357 0.198
ELMo 0.759 0.083 0.707 0.091 0.644 0.100 0.557 0.110
SBERT ML 0.659 0.112 0.645 0.112 0.670 0.119 0.662 0.122
BERT 0.780 0.105 0.850 0.064 0.783 0.077 0.820 0.054
BERT ML 0.881 0.035 0.867 0.040 0.887 0.035 0.877 0.039
DistilBERT ML 0.870 0.029 0.868 0.027 0.875 0.027 0.877 0.026
LLama2 0.668 0.148 0.601 0.151 0.490 0.137 0.560 0.118
PRand
Word2Vec 0.419 0.064 0.423 0.065 0.460 0.185 0.371 0.151
GloVe 0.413 0.108 0.419 0.108 0.356 0.196 0.293 0.219
ELMo 0.674 0.082 0.628 0.069 0.644 0.100 0.482 0.097
SBERT ML 0.479 0.067 0.473 0.067 0.670 0.119 0.479 0.072
BERT 0.746 0.117 0.855 0.061 0.783 0.077 0.808 0.037
BERT ML 0.872 0.031 0.872 0.028 0.887 0.035 0.883 0.032
DistilBERT ML 0.879 0.024 0.879 0.021 0.875 0.027 0.898 0.021
LLama2 0.631 0.100 0.568 0.105 0.490 0.137 0.544 0.102
Table 11: Mean and standard deviation at NC level for PSyn , PComp , PW ordsSyn and PRand , for
English (EN) and Portuguese (PT) for naturalistic (Nat) and neutral (Neut) sentences.

the complete set of NCs. As expected the correlations are smaller and less significant than those
obtained for the full set, as with the removal of the NCs with lexical overlap a smaller set was
used to calculate correlations. The ultimate test will be to redo the analysis with the full list of
NCs but only using NCSyn without lexical overlap, but this requires additional human annotation
and is left for future work.

45
Computational Linguistics Volume 1, Number 1

ASyn|W ordsSyn
EN-Nat EN-Neut PT-Nat PT-Neut
Model Name
mean std mean std mean std mean std
Word2Vec -0.002 0.149 0.004 0.156 0.025 0.152 0.038 0.160
GloVe -0.009 0.166 -0.006 0.170 0.058 0.193 0.072 0.193
ELMo -0.023 0.108 -0.040 0.134 -0.003 0.124 0.008 0.182
SBERT ML -0.036 0.160 -0.051 0.178 -0.019 0.154 -0.036 0.176
BERT 0.021 0.090 0.006 0.067 0.027 0.077 0.017 0.073
BERT ML -0.002 0.044 -0.003 0.056 -0.003 0.044 -0.008 0.059
DistilBERT ML -0.001 0.041 -0.003 0.045 -0.003 0.047 -0.002 0.049
LLama2 0.020 0.137 0.011 0.154 0.024 0.169 0.021 0.166
ASyn|Rand
Word2Vec 0.049 0.156 0.054 0.162 0.076 0.165 0.074 0.182
GloVe 0.070 0.177 0.077 0.179 0.110 0.213 0.100 0.219
ELMo 0.024 0.116 0.015 0.135 0.051 0.127 0.062 0.198
SBERT ML 0.059 0.173 0.072 0.186 0.081 0.165 0.099 0.171
BERT 0.040 0.103 0.001 0.065 0.057 0.085 0.026 0.070
BERT ML 0.003 0.048 -0.006 0.052 0.000 0.042 -0.011 0.054
DistilBERT ML -0.005 0.047 -0.007 0.049 -0.012 0.044 -0.015 0.046
LLama2 0.042 0.124 0.031 0.133 0.064 0.165 0.034 0.162
Table 12: Mean and standard deviation at NC level for ASyn|W ordsSyn and ASyn|Rand , for English
(EN) and Portuguese (PT) for naturalistic (Nat) and neutral (Neut) sentences.

SimR|Syn
EN-Nat EN-Neut PT-Nat PT-Neut
Model Name
mean std mean std mean std mean std
Word2Vec 0.164 0.365 0.159 0.362 0.221 0.356 0.183 0.373
GloVe 0.221 0.407 0.220 0.406 0.264 0.391 0.225 0.424
ELMo 0.076 0.512 0.012 0.470 0.154 0.384 0.104 0.412
SBERT ML 0.190 0.441 0.172 0.429 0.259 0.419 0.244 0.395
BERT 0.075 0.735 -0.166 0.659 0.289 0.486 0.098 0.437
BERT ML -0.024 0.533 -0.128 0.525 -0.057 0.510 -0.194 0.566
DistilBERT ML -0.147 0.566 -0.166 0.544 -0.257 0.589 -0.320 0.618
LLama2 0.194 0.506 0.095 0.466 0.129 0.389 0.056 0.448
SimR|W ordsSyn
Word2Vec 0.173 0.182 0.167 0.181 0.165 0.187 0.124 0.204
GloVe 0.245 0.236 0.243 0.233 0.113 0.237 0.061 0.246
ELMo 0.231 0.276 0.193 0.272 0.185 0.234 0.118 0.262
SBERT ML 0.336 0.231 0.315 0.230 0.339 0.258 0.340 0.256
BERT 0.092 0.307 -0.058 0.272 0.169 0.275 0.057 0.238
BERT ML 0.034 0.294 -0.068 0.357 0.007 0.296 -0.098 0.366
DistilBERT ML -0.105 0.284 -0.104 0.246 -0.196 0.300 -0.244 0.304
LLama2 0.094 0.380 0.058 0.368 0.099 0.240 0.039 0.255
Table 13: Mean and standard deviation at NC level for SimR|Syn and SimR|W ordsSyn , for English
(EN) and Portuguese (PT) for naturalistic (Nat) and neutral (Neut) sentences.

46
Wei He Investigating Idiomaticity in Word Representations

Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLaMA2

ML ML ML
ASyn|W ordsSyn
EN-Nat 0.44 0.35 0.44 0.48 0.42 0.44 0.32 -
EN-Neut 0.44 0.35 0.42 0.48 0.34 0.36 0.29 -
PT-Nat 0.26 0.19 0.23 0.16 0.32 0.12 - -
PT-Neut 0.28 - 0.26 - 0.21 - - -
ASyn|Rand
EN-Nat 0.49 0.37 0.57 0.53 0.57 0.56 0.46 0.16
EN-Neut 0.47 0.36 0.54 0.51 0.41 0.39 0.37 -
PT-Nat 0.25 0.16 0.25 0.25 0.40 0.20 - -
PT-Neut - - 0.28 0.22 0.31 - - -
Table 14: Spearman ρ correlation between the Affinity and human judgments for English and
Portuguese for naturalistic (Nat) and neutral (Neut) sentences after removing NCs with lexical
overlap between NC and NCSyn . Non-significant (p > 0.05) results omitted from the table.

Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLaMA2

ML ML ML
SimR|Syn
EN-Nat 0.49 0.37 0.57 0.53 0.57 0.56 0.46 0.15
EN-Neut 0.47 0.36 0.54 0.51 0.41 0.39 0.37 -
PT-Nat 0.29 0.21 0.37 0.24 0.32 0.24 - -
PT-Neut - - 0.40 - 0.29 - - -
SimR|W ordsSyn
EN-Nat - - 0.14 0.11 0.38 0.26 0.28 0.20
EN-Neut - - - - - - 0.22 -
PT-Nat - - - - - - - 0.17
PT-Neut - - - - - - - -

Table 15: Spearman ρ correlation between the Scaled Similarities and human judgments, for
SimR|Syn and SimR|W ordsSyn in both English and Portuguese after removing NCs with lexical
overlap between NC and NCSyn . Non-significant (p > 0.05) results were omitted from the table.

3. Appendix C: The impact of Rogue Dimensions

3.1 Standardisation Process

To mitigate the impact of rogue dimensions, a standardisation process using z-scores31 was applied
as proposed by Timkey and van Schijndel (2021). The mean vector µ was calculated across the
NC sentences and subtracted from each embedding vector to center the data. Each dimension of
the embedding was divided by its standard deviation σ.

3.2 Spearman Correlation Analysis

To assess the impact of standardisation, Spearman correlation was calculated between the PSyn
cosine similarities before and after standardisation:
• Pre-standardisation: Cosine similarities calculated using the original representations.
• Post-standardisation: Cosine similarities recalculated after standardisation.
The results are reported on Table 16.

31 z = (x − µ)/σ.

47
Computational Linguistics Volume 1, Number 1

Sent Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLaMA2

ML ML ML
EN-Nat 0.974 0.964 0.964 0.954 0.960
EN-Neut 0.965 0.888 0.876 0.908 0.650
PT-Nat 0.976 0.860 0.955 0.955 0.960
PT-Neut 0.952 0.874 0.874 0.911 0.927
NC Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLaMA2
ML ML ML
EN-Nat 0.991 0.967 0.951 0.953 0.937
EN-Neut 0.984 0.940 0.916 0.947 0.875
PT-Nat 0.987 0.852 0.939 0.939 0.939
PT-Neut 0.975 0.795 0.903 0.910 0.925
Table 16: Spearman ρ correlation for PSyn cosine similarities before and after standardisation
(results significant for p < 0.05.)

Math Syllabus F3-F5
100% (1)
Math Syllabus F3-F5
78 pages
Journal Pone 0295339
No ratings yet
Journal Pone 0295339
14 pages
Unsupervised Compositionality Prediction of Nominal Compounds
No ratings yet
Unsupervised Compositionality Prediction of Nominal Compounds
57 pages
Vector Based Models
No ratings yet
Vector Based Models
41 pages
Combining Local Context and Wordnet Similarity For Word Sense Identification
No ratings yet
Combining Local Context and Wordnet Similarity For Word Sense Identification
20 pages
6 (1)
No ratings yet
6 (1)
34 pages
Extracting Word Synonyms From Text Using Neural Approaches
No ratings yet
Extracting Word Synonyms From Text Using Neural Approaches
7 pages
Modeling Word Interpretation With Deep Language Models: The Interaction Between Expectations and Lexical Information
No ratings yet
Modeling Word Interpretation With Deep Language Models: The Interaction Between Expectations and Lexical Information
7 pages
Semantics in Transformer Based Models
No ratings yet
Semantics in Transformer Based Models
20 pages
Neural Models For NLP
No ratings yet
Neural Models For NLP
67 pages
Ed3book - Jan72023 111 141
No ratings yet
Ed3book - Jan72023 111 141
31 pages
04 - Word Representations
No ratings yet
04 - Word Representations
13 pages
Exploring Semanticity for Content and FunctionWord Distinction in Catalan
No ratings yet
Exploring Semanticity for Content and FunctionWord Distinction in Catalan
19 pages
Unit - 3 Distributional Semantics and Word Embedding
No ratings yet
Unit - 3 Distributional Semantics and Word Embedding
69 pages
NLP Module3
No ratings yet
NLP Module3
27 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
Word-Level Neutrosophic Sentiment Similarity
No ratings yet
Word-Level Neutrosophic Sentiment Similarity
36 pages
2024_Semantics of Multiword Expressions in Transformer-Based Models_Miletić_Walde_arXiv
No ratings yet
2024_Semantics of Multiword Expressions in Transformer-Based Models_Miletić_Walde_arXiv
19 pages
Cross-Cutting Models of Distributional Lexical Semantics
No ratings yet
Cross-Cutting Models of Distributional Lexical Semantics
53 pages
Electronics 10 01372 With Cover
No ratings yet
Electronics 10 01372 With Cover
24 pages
Lecture 3. 6 - Vector - Apr18 - 2021
No ratings yet
Lecture 3. 6 - Vector - Apr18 - 2021
106 pages
Vector Semantics and Embedding (part 1)
No ratings yet
Vector Semantics and Embedding (part 1)
66 pages
Survey On Vector Representations
No ratings yet
Survey On Vector Representations
46 pages
Slides 6
No ratings yet
Slides 6
21 pages
S2 Course 1 Review and Advanced Topics in Word Meaning
No ratings yet
S2 Course 1 Review and Advanced Topics in Word Meaning
16 pages
Idiom Aware Compositional Distributed Semantics
No ratings yet
Idiom Aware Compositional Distributed Semantics
10 pages
NLP-UNIT-4 (1) (1)
No ratings yet
NLP-UNIT-4 (1) (1)
23 pages
Semantic Representation
No ratings yet
Semantic Representation
48 pages
Auto Select Context Config
No ratings yet
Auto Select Context Config
11 pages
Manuscript
No ratings yet
Manuscript
11 pages
Lecture1 Word Embeddings
No ratings yet
Lecture1 Word Embeddings
99 pages
Notes
No ratings yet
Notes
37 pages
NLP Unit 3
No ratings yet
NLP Unit 3
20 pages
Week11 WordEmbedding
No ratings yet
Week11 WordEmbedding
20 pages
Vector-Based Models of Semantic Composition: Jeff Mitchell and Mirella Lapata
No ratings yet
Vector-Based Models of Semantic Composition: Jeff Mitchell and Mirella Lapata
9 pages
Admin, 4015
No ratings yet
Admin, 4015
19 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Neural Decoding
No ratings yet
Neural Decoding
34 pages
Semantics Mam Nata
No ratings yet
Semantics Mam Nata
5 pages
Semantic Similarity For English and Arabic Texts: A Review: Alzahrani 2016
No ratings yet
Semantic Similarity For English and Arabic Texts: A Review: Alzahrani 2016
29 pages
Agentive Cognitive Construction Grammar: Mind, Agency and the Materiality of Language: Agentive Cognitive Construction Grammar
From Everand
Agentive Cognitive Construction Grammar: Mind, Agency and the Materiality of Language: Agentive Cognitive Construction Grammar
Sergio Torres-Martínez
No ratings yet
WordNet-based Lexical Semantic Classification For Text Corpus Analysis
No ratings yet
WordNet-based Lexical Semantic Classification For Text Corpus Analysis
8 pages
2019-wiedemannetal-konvens-bert-5
No ratings yet
2019-wiedemannetal-konvens-bert-5
2 pages
NLP Summary
No ratings yet
NLP Summary
6 pages
Evoking Meaning by Choosing The Right Words: R Esum e
No ratings yet
Evoking Meaning by Choosing The Right Words: R Esum e
8 pages
Trigram 11
No ratings yet
Trigram 11
16 pages
1809.03348v1
No ratings yet
1809.03348v1
8 pages
Lec6-7- Traditional Semantic Processing
No ratings yet
Lec6-7- Traditional Semantic Processing
26 pages
Detection and Aptness A Study in Metapho PDF
No ratings yet
Detection and Aptness A Study in Metapho PDF
170 pages
Unit-III PDF
No ratings yet
Unit-III PDF
72 pages
Topics in Cognitive Science - 2010 - McNamara - Computational Methods to Extract Meaning From Text and Advance Theories of
No ratings yet
Topics in Cognitive Science - 2010 - McNamara - Computational Methods to Extract Meaning From Text and Advance Theories of
15 pages
Language Analysis - Sociolinguistics of Word Embeddings - PREPRINT - 8.8.2020
No ratings yet
Language Analysis - Sociolinguistics of Word Embeddings - PREPRINT - 8.8.2020
17 pages
Tac Lde Notation Graph
No ratings yet
Tac Lde Notation Graph
12 pages
Book_abstract_semanticity_QUALICO2023
No ratings yet
Book_abstract_semanticity_QUALICO2023
1 page
week2and3
No ratings yet
week2and3
76 pages
An Innovative Method For Hindi Word Sense Disambiguation: Binod Kumar Mishra Suresh Jain
No ratings yet
An Innovative Method For Hindi Word Sense Disambiguation: Binod Kumar Mishra Suresh Jain
17 pages
The Enigmatic Bridge: Computing and Linguistics
From Everand
The Enigmatic Bridge: Computing and Linguistics
Pasquale De Marco
No ratings yet
2021.sustainlp-1.9
No ratings yet
2021.sustainlp-1.9
5 pages
NLP Basic - YL
No ratings yet
NLP Basic - YL
16 pages
Verbos Livianos Computacional
No ratings yet
Verbos Livianos Computacional
30 pages
Lecture12 - Word RepEmb
No ratings yet
Lecture12 - Word RepEmb
28 pages
2305.11720v4
No ratings yet
2305.11720v4
29 pages
2403.14507v1
No ratings yet
2403.14507v1
11 pages
2308.00223v1
No ratings yet
2308.00223v1
35 pages
2305.17473v4
No ratings yet
2305.17473v4
62 pages
2406.16192v2
No ratings yet
2406.16192v2
36 pages
2406.01520v2
No ratings yet
2406.01520v2
55 pages
2308.01185v2
No ratings yet
2308.01185v2
8 pages
2403.15846v1
No ratings yet
2403.15846v1
22 pages
2406.03767v1
No ratings yet
2406.03767v1
26 pages
2406.02095v1
No ratings yet
2406.02095v1
13 pages
2406.14716v1
No ratings yet
2406.14716v1
5 pages
2406.02448v1
No ratings yet
2406.02448v1
21 pages
2409.20471v2
No ratings yet
2409.20471v2
14 pages
2405.17087v1
No ratings yet
2405.17087v1
59 pages
Modelling Silica Using MACE-MP-0 Machine Learnt Interatomic Potentials
No ratings yet
Modelling Silica Using MACE-MP-0 Machine Learnt Interatomic Potentials
20 pages
HC L-Diff: Hybrid Conditional Latent Diffusion With High Frequency Enhancement For CBCT-to-CT Synthesis
No ratings yet
HC L-Diff: Hybrid Conditional Latent Diffusion With High Frequency Enhancement For CBCT-to-CT Synthesis
13 pages
2405.20597v1
No ratings yet
2405.20597v1
24 pages
On The Free-Boundary Incompressible Elastodynamics With and Without Surface Tension
No ratings yet
On The Free-Boundary Incompressible Elastodynamics With and Without Surface Tension
27 pages
2410.14306v1
No ratings yet
2410.14306v1
10 pages
2408.01819v1
No ratings yet
2408.01819v1
12 pages
2405.16632v1
No ratings yet
2405.16632v1
24 pages
Adapting Language Models Via Token Translation: Zhili Feng Tanya Marwah Lester Mackey
No ratings yet
Adapting Language Models Via Token Translation: Zhili Feng Tanya Marwah Lester Mackey
5 pages
Information Plane and Compression-Gnostic Feedback in Quantum Machine Learning
No ratings yet
Information Plane and Compression-Gnostic Feedback in Quantum Machine Learning
16 pages
Direct Observation of Dynamical Quasi-Condensation On A Quantum Computer
No ratings yet
Direct Observation of Dynamical Quasi-Condensation On A Quantum Computer
11 pages
C G V2: E G A R L - S S: ITY Aussian Fficient and Eometrically Ccurate Econstruction FOR Arge Cale Cenes
No ratings yet
C G V2: E G A R L - S S: ITY Aussian Fficient and Eometrically Ccurate Econstruction FOR Arge Cale Cenes
17 pages
Hunyuan-Large: An Open-Source Moe Model With 52 Billion Activated Parameters by Tencent
No ratings yet
Hunyuan-Large: An Open-Source Moe Model With 52 Billion Activated Parameters by Tencent
18 pages
Petrov-Galerkin Model Reduction For Thermochemical Nonequilibrium Gas Mixtures
No ratings yet
Petrov-Galerkin Model Reduction For Thermochemical Nonequilibrium Gas Mixtures
32 pages
A L I T R A: Daptive Ength Mage Okenization Via Ecurrent Llocation
No ratings yet
A L I T R A: Daptive Ength Mage Okenization Via Ecurrent Llocation
21 pages
Near-Optimal Quantum Algorithm For Finding The Longest Common Substring Between Run-Length Encoded Strings
No ratings yet
Near-Optimal Quantum Algorithm For Finding The Longest Common Substring Between Run-Length Encoded Strings
21 pages
Scalable Quantum Simulations of Scattering in Scalar Field Theory On 120 Qubits
No ratings yet
Scalable Quantum Simulations of Scattering in Scalar Field Theory On 120 Qubits
50 pages
AI Lab
No ratings yet
AI Lab
46 pages
Germany17 Jann
No ratings yet
Germany17 Jann
84 pages
Analysis of Static Loading of A Helicopter Tubular Skid Landing Gear
No ratings yet
Analysis of Static Loading of A Helicopter Tubular Skid Landing Gear
2 pages
25 - Underground Conduits
No ratings yet
25 - Underground Conduits
40 pages
88990
No ratings yet
88990
60 pages
Related Rates
No ratings yet
Related Rates
3 pages
Strategy of Process Engineering -- Rudd, Dale F_, Watson, Charles Churchill -- 1st Ed_, PS, 1968 -- John Wiley & Sons, Incorporated -- 9780471744504 -- 7066099c264ee6aa8b361943d6c309a3 -- Anna’s Archive
No ratings yet
Strategy of Process Engineering -- Rudd, Dale F_, Watson, Charles Churchill -- 1st Ed_, PS, 1968 -- John Wiley & Sons, Incorporated -- 9780471744504 -- 7066099c264ee6aa8b361943d6c309a3 -- Anna’s Archive
488 pages
1st Prelim For Statistics and Probability For Grade 11
No ratings yet
1st Prelim For Statistics and Probability For Grade 11
3 pages
Paper A B-Dot Acquisition Controller For The RADARSAT Spacecraft
100% (1)
Paper A B-Dot Acquisition Controller For The RADARSAT Spacecraft
12 pages
Lecture 22 Linear Programming and Optimal Power Flow
No ratings yet
Lecture 22 Linear Programming and Optimal Power Flow
42 pages
Functions of Several Variables (Multi Variables Functions) : Assoc. Prof. Dr. Rozlan Alias
No ratings yet
Functions of Several Variables (Multi Variables Functions) : Assoc. Prof. Dr. Rozlan Alias
44 pages
مذكرة شرح Physics - فيزياء لغات للصف الاول الثانوى لمدارس اللغات-الامتحان التعليمى
100% (1)
مذكرة شرح Physics - فيزياء لغات للصف الاول الثانوى لمدارس اللغات-الامتحان التعليمى
20 pages
Esslce 2014 With Soln Tell
No ratings yet
Esslce 2014 With Soln Tell
14 pages
M.SC - SDS EntranceTest Syllabus&Sampleuestions 2022
No ratings yet
M.SC - SDS EntranceTest Syllabus&Sampleuestions 2022
2 pages
Forecasting Bank Loans Loss-Given-Default: Jo Ao A. Bastos
No ratings yet
Forecasting Bank Loans Loss-Given-Default: Jo Ao A. Bastos
16 pages
John Bremner - Mensa Mind Maze Challenge (1999, Carlton)
No ratings yet
John Bremner - Mensa Mind Maze Challenge (1999, Carlton)
132 pages
0580 s17 Ms 11
No ratings yet
0580 s17 Ms 11
4 pages
DLL For Dressmaking 8
100% (3)
DLL For Dressmaking 8
9 pages
Railway Point-Operating Machine Fault Detection Using Unlabeled Signaling Sensor Data
No ratings yet
Railway Point-Operating Machine Fault Detection Using Unlabeled Signaling Sensor Data
13 pages
A Algebraic Fractions Multiply Divide 1
No ratings yet
A Algebraic Fractions Multiply Divide 1
8 pages
College Algebra and Calculus An Applied Approach 1st Edition Ron Larson download pdf
100% (13)
College Algebra and Calculus An Applied Approach 1st Edition Ron Larson download pdf
85 pages
Electromagnetic Fields and Waves Ii
No ratings yet
Electromagnetic Fields and Waves Ii
63 pages
Invertible Conditional GANs For Image Editing
No ratings yet
Invertible Conditional GANs For Image Editing
9 pages
Solution Mark Alan Weiss
No ratings yet
Solution Mark Alan Weiss
7 pages
Moonlighting Intentions of Middle Level Employees of Selected IT Companies
No ratings yet
Moonlighting Intentions of Middle Level Employees of Selected IT Companies
12 pages
Transmission Line Parameters PDF
No ratings yet
Transmission Line Parameters PDF
5 pages
Chapter 3
No ratings yet
Chapter 3
12 pages
Wave Length SPM 1984 Tabel D-per-L0 Vs D-per-L
No ratings yet
Wave Length SPM 1984 Tabel D-per-L0 Vs D-per-L
2 pages

Investigating Idiomaticity in Word Representations: Wei He Tiago Kramer Vieira

Uploaded by

Investigating Idiomaticity in Word Representations: Wei He Tiago Kramer Vieira

Uploaded by

Investigating Idiomaticity in Word

Wei He∗ Tiago Kramer Vieira∗∗

School of Computer Science Institute of Informatics

Marcos Garcia† Carolina Scarton∗

Marco Idiart∗∗‡ Aline Villavicencio∗§

∗ 211 Portobello, Sheffield, S1 4DP, UK.

© 2024 Association for Computational Linguistics

Q1 To what extent is idiomaticity captured by word representation models? We assess

The main contributions of this work include:

2. Representing Multiword Expressions and Idiomaticity

2.1 Static and contextualised models for representing MWEs

2.2 Vector models evaluation on idiomaticity

Regarding idiomaticity, uniform assessment of the performance of different models on the

2.3 Vector operations and idiomatic knowledge induction

contributing to a more accurate representation of MWEs according to their idiomaticity level

2.4 Towards a more controlled assessment of idiomaticity in vector space models

2.5 Datasets for exploring idiomaticity in computational models

3. Materials and Methods

3.1 Noun Compound Idiomaticity Minimal Pairs Dataset

3.2 Word Representation Models

3.3 Measuring idiomatic meaning

1. Vector embeddings approximate meaning. We assume that the vector embeddings

Sim(Pi, T arget) = ⟨ cossim(expr(Pi), expr(NC)) ⟩P i (3)

Aff(Pi, Pj | T arget) = Sim (Pi, T arget) − Sim (Pj, T arget) . (4)

4. Probing for Idiomaticity

26 Given a value x in a dataset, the max-min normalisation of x is calculated as follows:

ρN C Word2Vec GloVe ELMo SBERT BERT BERT DistilB LLaMA2

Word2Vec GloVe ELMo SBERT BERT BERT DistilB LLaMA2 OpenAI

If a model accurately represents idiomaticity, the representation of a given NC should be more

ASyn|W ordsSyn ASyn|Rand

ASyn|W ordsSyn EN ASyn|Rand EN

ASyn|W ordsSyn PT ASyn|Rand PT

4.3 Can a more meaningful similarity measure be found to unveil NC meaning?

Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLaMA2

SimR|Syn SimR|W ordsSyn

SimR|Syn Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLaMA2

4.4 How are the results across models and languages?

4.5 Analysing example cases

NC N CSyn N CW ordsSyn Examples

5.1 Future Work

gray Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLama2

References Christopher Berner, Sam McCandlish, Alec

Louisiana. for idiomaticity in vector space. In Proceedings

Team, Gemini, Rohan Anil, Sebastian Borgeaud, (ELRA), Miyazaki, Japan.

Language Processing (EMNLP), pages

1. Appendix A: Measures for English and Portuguese

2. Appendix B: Results After Removing Examples with Synonym Lexical Overlaps

Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLaMA2

Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLaMA2

3. Appendix C: The impact of Rogue Dimensions

3.1 Standardisation Process

3.2 Spearman Correlation Analysis

Sent Word2Vec GloVe ELMo SBERT BERT BERT DistilBERT LLaMA2

You might also like