This is a Plain English Papers summary of a research paper called Do LLMs Understand Who Did What? Syntax vs. Meaning in Language Models. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.
Understanding Language vs. Understanding Meaning: The LLM Debate
Large Language Models (LLMs) have sparked heated debate about whether they truly "understand" language. Critics often claim LLMs lack understanding because they fail at tasks requiring common sense, logical reasoning, or grounding in the physical world. But this research paper argues these criticisms miss a crucial distinction: many abilities often categorized as "language understanding" are actually distinct cognitive processes in humans.
Instead of focusing on these broader cognitive abilities, the researchers examine a more fundamental aspect of language understanding: the ability to determine "who did what to whom" in sentences. These thematic roles represent a core aspect of sentence meaning, distinct from syntactic structure.
The central question: Does the word prediction objective that trains LLMs lead to sentence representations that capture thematic roles? The researchers investigated this across four different models: BERT, GPT-2, Llama2, and Persimmon.
This focus on thematic roles addresses a gap identified in other research on linguistic blind spots in large language models, which found LLMs struggle with certain linguistic phenomena despite their fluent outputs.
Experiment 1: How LLMs Represent Sentence Meaning vs. Structure
The first experiment tested whether LLM representations distinguish between sentences based on thematic roles (who did what to whom) or primarily syntactic structure. The researchers compared sentences that shared or differed in both meaning and structure.
For example, take the base sentence "The lawyer saved the author." The researchers created variations that:
- Maintained meaning but changed structure ("The author was saved by the lawyer")
- Changed meaning but maintained structure ("The author saved the lawyer")
- Changed both meaning and structure ("The attorney was rescued by the writer")
Same Structure (active) | Different Structure (passive) | |
---|---|---|
Same Meaning | (A) The attorney rescued the writer | (B) The author was saved by the lawyer |
Different Meaning | (C) The author saved the lawyer | (D) The attorney was rescued by the writer |
Table 1. Sample Experiment 1 materials, compared to the base sentence "The lawyer saved the author".
Using representational similarity analysis, the researchers examined how LLMs encoded these sentences. The results revealed a striking pattern across all four models:
Comparison | BERT | GPT2 | Llama2 | Persimmon |
---|---|---|---|---|
$\mathrm{SEM}{\mathrm{d}}-\mathrm{SYNT}{\mathrm{s}}>\quad z=8.34, p=10^{-15}$ | $z=8.23, p<10^{-14}$ | $z=6.86, p<10^{-10}$ | $z=7.14, p<10^{-11}$ | |
$\mathrm{SEM}{\mathrm{s}}-\mathrm{SYNT}{\mathrm{d}}$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.95,1]$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.88,0.98]$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.77,0.91]$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.87,0.97]$ |
$\mathrm{SEM}{\mathrm{s}}-\mathrm{SYNT}{\mathrm{s}}>\quad z=5.11, p=10^{-2}$ | $z=4.50, p<10^{-4}$ | $z=4.05, p<10^{-3}$ | $z=5.16, p<10^{-2}$ | |
$\mathrm{SEM}{\mathrm{d}}-\mathrm{SYNT}{\mathrm{d}}$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.66,0.84]$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.61,0.79]$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.60,0.79]$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.65,0.83]$ |
$\mathrm{SEM}{\mathrm{s}}-\mathrm{SYNT}{\mathrm{d}}>\quad z=8.34, p=10^{-15}$ | $z=8.33, p<10^{-16}$ | $z=7.13, p<10^{-11}$ | $z=6.47, p<10^{-9}$ | |
$\mathrm{SEM}{\mathrm{d}}-\mathrm{SYNT}{\mathrm{d}}$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.89,0.99]$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[1,1]$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.73,0.89]$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.73,0.89]$ |
$\mathrm{SEM}{\mathrm{s}}-\mathrm{SYNT}{\mathrm{d}}>\quad z=6.34, p=10^{-8}$ | $z=8.19, p<10^{-14}$ | $z=4.40, p<10^{-4}$ | $z=3.48, p=.003$ | |
$\mathrm{SEM}{\mathrm{s}}-\mathrm{SYNT}{\mathrm{s}}$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.71,0.87]$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.95,1]$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.57,0.76]$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.53,0.73]$ |
$\mathrm{SEM}{\mathrm{d}}-\mathrm{SYNT}{\mathrm{s}}>\quad z=8.41, p=10^{-15}$ | $z=8.32, p<10^{-15}$ | $z=7.85, p<10^{-13}$ | $z=7.85, p<10^{-13}$ | |
$\mathrm{SEM}{\mathrm{d}}-\mathrm{SYNT}{\mathrm{d}}$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[1,1]$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[1,1]$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.85,0.97]$ | $\mathrm{CI}{\mathrm{s} \mathrm{s}{\mathrm{s}}}=[0.91,0.99]$ |
Table 2. Results of Experiment 1 for each LLM.
The key finding: LLMs prioritize syntactic structure over thematic roles in their representations. This contrasts with human behavior, where meaning typically takes precedence over syntax. This aligns with findings from research on how humans and LLMs interpret subjective content, which has identified similar differences in processing priorities.
Experiment 2: Where (If Anywhere) Do LLMs Encode Thematic Roles?
Since the overall representations didn't prioritize thematic roles, the researchers conducted a second experiment to determine whether this information existed anywhere within the models' components. They expanded their stimulus set to include a variety of syntactic constructions that maintained the same thematic roles:
Active / Passive | DO / PO | Simple / cleft | Sentence |
---|---|---|---|
Active | DO | Simple | The man gave the woman the milk. |
Cleft (subject) | It was the man who gave the woman the milk. | ||
Cleft (dir. object) | It was the milk that the man gave the woman. | ||
PO | Simple | The man gave the milk to the woman. | |
Cleft (subject) | It was the man who gave the milk to the woman. | ||
Cleft (dir. object) | It was the milk that the man gave to the woman. | ||
Cleft (indir. object) | It was the woman who the man gave the milk to. | ||
Passive | DO | Simple | The woman was given the milk by the man. |
Cleft (indir. object) | It was the woman who was given the milk by the man. | ||
PO | Simple | The milk was given to the woman by the man. | |
Cleft (dir. object) | It was the milk that was given to the woman by the man. | ||
Cleft (indir. object) | It was the woman who the milk was given to by the man. |
Table 3. Example stimuli for Experiment 2.
The researchers analyzed three components of the LLMs:
- Hidden states (overall sentence representations)
- Individual units (specific neurons)
- Attention heads (components that control information flow)
Using probing classifiers, they tested whether any of these components could identify thematic roles regardless of syntax. This approach echoes methods used in research examining how sentences are encoded in large language models.
The results revealed a fascinating pattern: thematic roles were not strongly encoded in hidden states or individual units, but they were robustly encoded in specific attention heads. These attention heads could identify who was doing what to whom regardless of the syntactic structure used to express those relationships.
This finding suggests that LLMs do extract thematic role information, but this information doesn't strongly influence their overall representations. This partial encoding may explain why LLMs sometimes appear to understand "who did what to whom" in practical applications despite not prioritizing this information in their overall representations.
What This Tells Us About LLM "Understanding"
This research provides nuanced insight into what aspects of meaning LLMs capture. Rather than making blanket statements about whether LLMs "understand" language, the findings demonstrate specific strengths and limitations in how they process thematic roles.
The apparent tension between these results and human-like behavior of LLMs in practical applications reflects an important distinction: having information encoded somewhere in a neural network versus using that information to shape behavior. LLMs encode thematic roles in specific attention heads, but this information doesn't strongly influence their overall sentence representations.
This differs from human language processing, where thematic roles play a central role in comprehension. Humans prioritize "who did what to whom" over syntactic structure when judging sentence similarity, while LLMs do the opposite.
The research contributes to a growing body of work examining the effects of LLMs on human language processing by clarifying what aspects of meaning these models capture and how they differ from human processing.
Conclusion: Refining Our Understanding of LLM Capabilities
This research provides a more precise understanding of what LLMs "know" about who did what to whom in sentences. The key takeaway: LLMs can extract thematic roles through specific attention mechanisms, but this information influences their representations more weakly than in humans.
Rather than asking whether LLMs generally "understand" language, we should focus on specific aspects of meaning they capture and how this information is represented. The findings suggest that while LLMs extract core semantic relationships, their representational priorities differ from humans in significant ways.
Future work might explore how to align LLM representations more closely with human priorities, potentially by designing training objectives that emphasize thematic roles. This could lead to models that not only process language fluently but also represent meaning in more human-like ways.
The research demonstrates the importance of precise characterizations of LLM capabilities rather than broad claims about "understanding," contributing valuable insight to ongoing debates about the nature of language processing in artificial intelligence systems.