Using the LLM Security Leaderboard to Select Models for Safe and Sustainable Code

Most language model benchmarking and comparison is focused on speed and accuracy. But with AI code generation, language model choice affects the safety and sustainability of resulting code. While many popular AI code-generation approaches rely on frontier models from providers like OpenAI and Anthropic, small- and mid-sized open-source models have advanced significantly and address specific needs for speed, efficiency, privacy, security, and compliance. To ensure developers and enterprises make informed choices, we’ve launched the LLM Security Leaderboard on Hugging Face to evaluate open-source models across four (initial) security dimensions. We’re taking an open, community-driven approach to this evaluation, and encourage you to join us in refining this benchmark.

You can read more about our criteria and methodology here. Below are our takeaways from this first wave of analysis:

Key Findings

  1. All models struggle with Bad Package Detection: Llama 3.2-3B led, but only correctly flagged ~29% of bad NPM and PyPI packages. Nearly all the models we evaluated detected less than 5% of bad packages, and several popular models detected 0%, they simply provided instructions on how to install the package, regardless of whether the package existed or included typos. These models put the responsibility for bad package detection squarely on the user.
  2. CVE Knowledge is Alarmingly Low: Awareness of Common Vulnerabilities and Exposures (CVEs) in dependencies is a basic requirement for secure code. Yet most models scored between 8% and 18% accuracy in this category. Qwen2.5-Coder-3B-Instruct was the leader, but still scored low at 18.25%. These results suggest that the depth and consistency of CVE knowledge needs to be significantly improved.
  3. Insecure Code Recognition is a Mixed Bag: Top models like Qwen2.5-Coder-32B-Instruct and microsoft/phi-4 successfully identified vulnerabilities in roughly half of the code snippets presented. Lower-performing models recognized vulnerabilities in fewer than a quarter of cases; the inconsistency underscores the need for more targeted training on secure coding practices.
  4. Model Size != Security: While larger models often perform better on general benchmarks, security-specific performance varied significantly. Smaller models like Llama-3.2-3B-Instruct and IBM's Granite 3.3-2B-Instruct punched above their weight, reinforcing that sheer model size is not decisive and that architecture, training methodologies, and datasets play crucial roles in security capabilities.
  5. Newer != Better: Newer models like Qwen2.4-Coder-32b (knowledge cutoff June 24, 2024) and Granite-3.3-2b-Instruct (knowledge cutoff April 24, 2024) have about the same or lower bad package and CVE detection capabilities as older models like Llama-3.2-3b-Instruct (knowledge cutoff March 23, 2023), suggesting that these newer models were not trained on the latest bad package and CVE knowledge.

What This Means for Developers and Researchers

These findings should guide how teams approach secure AI adoption for software development:

  • Select models thoughtfully, especially when using LLMs in security-sensitive codegen workflows.
  • Prioritize secure prompting techniques - careless prompting can exacerbate vulnerabilities.
  • Complement LLMs with security-aware tools, like Stacklok's open-source project CodeGate, to reinforce defenses.
  • Augment LLMs with Retrieval-Augmented Generation (RAG), using knowledge from leading vulnerability datasets such as NVD, OSV, Stacklok Insight, etc.
  • Push for better fine-tuning and training on security datasets across the community.

Get Involved

This is just the beginning. The LLM Security Leaderboard is live at Hugging Face, and we're inviting the community to submit models, suggest new evaluation methods, and contribute to a stronger, safer AI ecosystem.

Explore the leaderboard. Submit your models. Join the conversation.

Let's build a future where AI coding is safe and secure.