LLMs Fail Symmetry Test: New Training Improves Relational Reasoning

This is a Plain English Papers summary of a research paper called LLMs Fail Symmetry Test: New Training Improves Relational Reasoning. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.

Understanding Symmetric and Antisymmetric Relations in Language Models

Large language models (LLMs) struggle with understanding symmetry and antisymmetry in relations, performing at random chance levels on related tasks. This research introduces a novel approach to enhance LLMs' ability to capture symmetric relations (like "country borders another country") and antisymmetric relations (like "parent_of") through symmetry-aware training objectives.

When a relation is symmetric, if "Country A borders Country B" then "Country B borders Country A" is also true. Conversely, with antisymmetric relations like "is a parent of," the reverse statement is false. Current LLMs fail to grasp these fundamental properties, creating challenges for applications in relation extraction, natural language inference, fact-checking, and common sense reasoning.

The researchers created a Wikidata-derived benchmark dataset and demonstrated that standard LLMs perform comparably to random chance on this benchmark. Their solution involves retraining the encoder component of LLMs using contrastive learning with k-nearest neighbors, incorporating specialized symmetry-aware objectives. This approach focuses on enhancing the encoder itself rather than adding classification heads, creating a more versatile model that can be applied across tasks without task-specific fine-tuning.

The retrained encoders match the performance of models with fine-tuned classification heads while requiring fewer training samples and exhibiting better retention of previously acquired knowledge. This efficiency in few-shot learning and reduced catastrophic forgetting showcases the effectiveness of symmetry-aware training for enhancing LLMs' relational understanding.

Teaching Models to Recognize Symmetry: Symmetry-Aware Training Approaches

The researchers explored several training approaches, comparing their effectiveness in helping language models understand symmetric and antisymmetric relations.

Method/Property	Retraining with Random Label Embeddings	Retraining with k-NN	Retraining with k-NN and Learnt Distance Metric	Standard Fine-Tuning
Encoder Functionality	Single sentence encoding	Single sentence encoding	Pair of sentences encoding	Pair of sentences encoding
Trained Parameters	Encoder only	Encoder only	Encoder only	Encoder and head
Training Objective	Symmetry-aware training	Symmetry-aware training	Symmetry-aware training	Cross entropy
Label Embeddings	Randomly initialized, static during training	Implicitly represented by cluster centroids	Implicitly represented by cluster centroids	Absent (No explicit label embeddings)
Distance Metric	Fixed (Equation 4)	Fixed (Equation 4)	Dynamically learned by the encoder	Implicitly learned by the classification head
Probing Method	Nearest label embedding selection, no k-NN	k-NN for top-k closest neighbors, followed by majority voting	k-NN for top-k closest neighbors, followed by majority voting	Argmax on output of the classification head, no k-NN

Table 1: Comparative Analysis of Models for Capturing Symmetric and Antisymmetric Relations

Defining Symmetric and Antisymmetric Relations

Symmetric and antisymmetric relations are formally defined based on their bidirectional implications:

A relation r is symmetric if, for all entities x and y, r(x,y) implies r(y,x).

Conversely, a relation is antisymmetric if r(x,y) implies the negation of r(y,x).

To test language models' understanding of these relations, the researchers adopted a sentence-pair classification approach similar to natural language inference (NLI). The model receives a premise sentence stating a relation and a hypothesis formed by swapping the subject and object. The model should recognize whether the relation is symmetric or antisymmetric and determine if the hypothesis is entailed or contradicted.

This approach to capturing symmetry in language models is part of a broader effort to improve relational understanding in AI. Similar work in unified frameworks for symmetry enforcement has shown promising results for enhancing model capabilities.

Learning with Random Label Embeddings

The first approach retrains only the encoder component of the LLM without adding any classification head. During training, the model processes sentence pairs (premise and hypothesis) as separate inputs and minimizes the distance for positive labels while maximizing it for negative labels with a margin.

A critical innovation is the symmetry-aware distance metric function derived from the RotatE model. Standard distance functions like dot products and cosine similarity aren't symmetry-aware, but the proposed metric captures both symmetric and antisymmetric properties of relations.

During inference, the model computes the distance between the input sentence embedding and each label embedding, selecting the label with the closest embedding as its prediction.

Improving with k-Nearest Neighbors

The second approach incorporates k-Nearest Neighbors (k-NN) into the retraining process. The model processes pairs of samples without explicit labels, using an objective function that minimizes the distance between label embeddings of samples with the same label and maximizes it for samples with different labels.

A key advantage of the k-NN approach is its flexibility in handling varying numbers of labels. Unlike fine-tuning methods that require retraining for new labels, k-NN adapts to new labels without extensive retraining. This flexibility proves beneficial for tasks like fact-checking where claims can be ambiguous.

During inference, the model computes distances from the test sample's label embedding to the label embeddings of training samples, with the final label determined by majority voting among the k nearest neighbors.

The researchers also explored a variant where the model learns the distance metric dynamically rather than using a fixed metric. This approach encodes pairs of samples together and allows the model to adaptively determine the best representation for label embeddings based on training data.

Recent research on symmetry transformations in generative models shows similar principles being applied in other AI domains, highlighting the broader relevance of symmetry-aware approaches.

Testing Symmetry Understanding: Experimental Evaluation

Creating a Symmetry Test Bench

To evaluate language models' understanding of symmetric and antisymmetric relations, the researchers developed a specialized dataset derived from Wikidata. Each example consists of triples demonstrating these relation types, formatted to fit the Natural Language Inference framework.

The dataset includes both lexicalized and delexicalized versions. In the lexicalized format, statements use natural language (e.g., "Nibong LRT Station is part of LRT Singapore" contradicts "LRT Singapore is part of Nibong LRT Station"). The delexicalized version uses Wikidata IDs for entities (e.g., "Q7024230 is part of Q2231347" contradicts "Q2231347 is part of Q7024230").

This comprehensive dataset contains 400,000 examples spanning 14 symmetric and antisymmetric relations, with 100,000 examples reserved for testing. The evaluation measured accuracy, few-shot learning efficiency (training samples needed), and resilience to catastrophic forgetting (performance drop on MNLI after training).

This novel approach to testing relational understanding complements other research on capturing symmetry in language models, establishing benchmarks for future work in this area.

Performance Results: Better Understanding with Less Training

The experimental results revealed that pre-trained LLMs, even those fine-tuned on MNLI, barely outperform random chance on symmetry and antisymmetry recognition tasks.

Model (Method)	Accuracy (Lexicalized)	Accuracy (Delexicalized)	Training Samples	Catastrophic Forgetting $(\Delta \downarrow)$
Random Baseline	$50 \%$	$50 \%$	-	-
RoBERTa-Large	$48.3 \%$	$49.7 \%$	-	-
RoBERTa-Large-MNLI	$51.2 \%$	$56.7 \%$	-	-
Random Label Embeddings	$100 \%$	$100 \%$	48	$\downarrow 5.8 \%$
k-NN	$100 \%$	$100 \%$	64	$\downarrow 7.7 \%$
k-NN with Learnt Distance Metric	$100 \%$	$100 \%$	400	$\downarrow 21.5 \%$
Fine-Tuning	$100 \%$	$100 \%$	336	$\downarrow 11.2 \%$

Table 2: Accuracy, Training Samples, and Catastrophic Forgetting of Different Models on Symmetric and Antisymmetric Relation Tasks

All the proposed retraining methods achieved 100% accuracy in both lexicalized and delexicalized formats. This high performance across both formats demonstrates the models' ability to generalize beyond specific entities and capture underlying relational semantics.

Most notably, methods employing a fixed symmetry-aware distance metric required significantly fewer training samples. The Random Label Embeddings approach needed only 48 samples, while standard fine-tuning required 336 samples to achieve the same performance. This efficiency suggests that with a well-designed distance metric, the task becomes more straightforward for the encoder.

The retraining methods also showed varied resilience to catastrophic forgetting. While fine-tuning caused an 11.2% drop in performance on the MNLI dataset, methods with a fixed symmetry-aware distance metric retained more knowledge, with the Random Label Embeddings approach showing only a 5.8% drop.

The researchers also tested smaller language models (MiniLM and all-MiniLM with 6 and 12 layers), achieving similar results across model sizes. This demonstrates the approach's scalability and generalizability across different LLM architectures.

These findings relate to recent research on removing symmetries to control model expressivity, showing complementary approaches to manipulating symmetry properties in AI models.

Key Takeaways: Improving Relational Understanding in LLMs

The study demonstrates that retraining LLM encoders effectively enhances their understanding of symmetric and antisymmetric relations. This approach matches the performance achieved by fine-tuning classification heads while offering additional benefits:

Improved few-shot learning: Methods with symmetry-aware distance metrics require significantly fewer training samples.
Better knowledge retention: Less catastrophic forgetting compared to standard fine-tuning, preserving more previously acquired knowledge.
Greater flexibility: The k-NN approach adapts to new labels without extensive retraining, beneficial for tasks requiring nuanced understanding.

These advantages make symmetry-aware training a promising approach for enhancing LLMs' relational reasoning capabilities across various NLP applications.

Current Limitations and Future Directions

Despite the impressive performance of the retrained models, several limitations remain. The study relies on automatically generated datasets from Wikidata, which may lack the syntactic diversity found in natural language. While the models achieved perfect accuracy, this doesn't mean they fully understand the nuances of symmetric and antisymmetric relations in all contexts.

The dataset's focus on Wikidata may overlook domain-specific nuances present in other knowledge sources. This limitation could affect performance in real-world applications like information extraction, question-answering, and natural language inference, where contextual understanding is crucial.

These limitations suggest that understanding symmetry and antisymmetry in language models remains an open challenge requiring further research, despite the significant progress demonstrated in this work. As noted in research on language model rigidity, LLMs often struggle with flexible reasoning about relationships, highlighting the importance of continued work in this area.

Ethical Considerations in Using Crowdsourced Knowledge

The research relies on Wikidata, a collaborative, open-domain knowledge base widely used in the research community. However, crowdsourced resources can reflect biases from contributors, potentially introducing cultural, geographical, or individual perspectives into the dataset.

These biases could inadvertently skew model evaluations, lead to partial perspectives, or reinforce existing stereotypes. Researchers and practitioners using such datasets should approach them with awareness of these limitations and interpret results with consideration for potential biases.

Future work should incorporate diverse data sources to reduce reliance on a single, potentially biased repository. Transparency and continued scrutiny in data curation and model evaluation processes remain essential for responsible AI development in this domain.

The templates used for generating the dataset include a variety of relation types:

Property ID	Templates
P40	$[\mathrm{Y}]$ is a child of $[\mathrm{X}]$.
P1382	$[\mathrm{Y}]$ partially overlaps with $[\mathrm{X}]$.
P279	$[\mathrm{X}]$ is a type of $[\mathrm{Y}]$.
P3373	$[\mathrm{X}]$ is a sibling of $[\mathrm{Y}]$.
P1560	$[\mathrm{X}]$ is an equivalent name of $[\mathrm{Y}]$ for other gender.
P131	$[\mathrm{X}]$ is located in $[\mathrm{Y}]$.
P25	$[\mathrm{Y}]$ is the mother of $[\mathrm{X}]$.
P22	$[\mathrm{Y}]$ is the father of $[\mathrm{X}]$.
P460	$[\mathrm{X}]$ possibly the same as $[\mathrm{Y}]$.
P2670	$[\mathrm{X}]$ has part(s) that are instances of $[\mathrm{Y}]$.
P1542	$[\mathrm{X}]$ led to $[\mathrm{Y}]$.
P1889	$[\mathrm{X}]$ is different from $[\mathrm{Y}]$.
P361	$[\mathrm{X}]$ is part of $[\mathrm{Y}]$.
P828	$[\mathrm{X}]$ caused by $[\mathrm{Y}]$.

Table 3: Templates used for natural language conversion of triples where [X] and [Y] are placeholders

Additional experiments with smaller language models demonstrated the approach's effectiveness across model sizes:

Method	MiniLM (6 layers)	MiniLM (12 layers)	all-MiniLM (6 layers)	all-MiniLM (12 layers)
Accuracy (Lexicalized)
Pretrained	$46.7 \%$	$49.2 \%$	$51.2 \%$	$48.3 \%$
Random Label Embeddings	$100 \%$	$100 \%$	$100 \%$	$100 \%$
k-NN	$99.1 \%$	$99.1 \%$	$100 \%$	$100 \%$
k-NN with Learnt Distance Metric	$97.6 \%$	$100 \%$	$100 \%$	$100 \%$
Fine-Tuning	$99.8 \%$	$100 \%$	$100 \%$	$100 \%$
Accuracy (Delexicalized)
Pretrained	$49.2 \%$	$48.3 \%$	$52.3 \%$	$52.3 \%$
Random Label Embeddings	$100 \%$	$100 \%$	$100 \%$	$100 \%$
k-NN	$100 \%$	$100 \%$	$100 \%$	$100 \%$
k-NN with Learnt Distance Metric	$100 \%$	$100 \%$	$100 \%$	$100 \%$
Fine-Tuning	$100 \%$	$100 \%$	$99.8 \%$	$100 \%$

Table 4: Accuracy of Different Models on Symmetric and Antisymmetric Relation Tasks with MiniLM and all-MiniLM Variants

Click here to read the full summary of this paper

LLMs Fail Symmetry Test: New Training Improves Relational Reasoning

Understanding Symmetric and Antisymmetric Relations in Language Models

Teaching Models to Recognize Symmetry: Symmetry-Aware Training Approaches

Defining Symmetric and Antisymmetric Relations

Learning with Random Label Embeddings

Improving with k-Nearest Neighbors

Testing Symmetry Understanding: Experimental Evaluation

Creating a Symmetry Test Bench

Performance Results: Better Understanding with Less Training

Key Takeaways: Improving Relational Understanding in LLMs

Current Limitations and Future Directions

Ethical Considerations in Using Crowdsourced Knowledge

Comments (0)

Read More

#reading

#popular

LLMs Fail Symmetry Test: New Training Improves Relational Reasoning

Understanding Symmetric and Antisymmetric Relations in Language Models

Teaching Models to Recognize Symmetry: Symmetry-Aware Training Approaches

Defining Symmetric and Antisymmetric Relations

Learning with Random Label Embeddings

Improving with k-Nearest Neighbors

Testing Symmetry Understanding: Experimental Evaluation

Creating a Symmetry Test Bench

Performance Results: Better Understanding with Less Training

Key Takeaways: Improving Relational Understanding in LLMs

Current Limitations and Future Directions

Ethical Considerations in Using Crowdsourced Knowledge

Comments (0)

Read More

⚛️ Build a Simple Todo App with React Store - a Tiny React State Manager

System Hacking: Journey into the Intricate World of Cyber Intrusion

How to manage large env files?

Top 15 Builder.ai Alternatives for 2025: Explore the Best App Development Platforms

#reading

#popular