A while ago, a friend and I were discussing code cognitive complexity and maintainability. A friend of mine wished to have a tool to automatically evaluate whether a piece of code is hard to maintain. I wasn’t sure this was even possible — maintainability is notoriously hard to quantify programmatically.

But LLMs can understand and generate human-like text or even code, and I wondered if that same capability could be applied to interpreting and evaluating code quality, going beyond what traditional static analysis tools can do.

That thought led to the tool I eventually built. It’s now available on PyPi, and I believe it could be a valuable addition to any CI pipeline.

In a previous post, I shared some early thoughts on maintainability and cognitive complexity of code that emerged while working on this tool. In this post, I’d like to go deeper and walk through the development process, using my CLI tool as a case study for building an LLM-based application.

🧠 Leveraging LangChain for LLM-Based Applications

To bring this idea to life, I used LangChain, a Python library for building LLM-powered applications. LangChain abstracts away the APIs of specific language models, so I could focus entirely on building the core functionality without worrying about the details of how to communicate with an LLM.

One of its most useful features is the seamless integration with Pydantic. The output schemas were defined using Pydantic models, and LangChain automatically generated prompts that guided the LLM to return structured responses, otherwise a lot more time would be spent on parsing and error handling of LLM output.

💡 Attempt 1: Overall Design and Estimation with a Simple Prompt

With the basic setup ready, I began designing the metric. The idea was straightforward:

Asking the LLM to evaluate each function individually and assign a cognitive complexity score from 1 to 5, where higher scores indicate code that’s harder to understand and maintain.

The prompt included both formatting instructions (generated via LangChain) and an explanation of grading criteria. I experimented with different ways of phrasing the grading scale to make the results more stable and meaningful.

To refine the evaluation, I added a few extra fields:

  • A is_setup_or_declaration flag to identify and skip boilerplate code (like config or constant declarations).
  • start_line_number and end_line_number to estimate the size of each function.

At first, I tried using a single function_length field, but estimating start and end lines separately produced more reliable results.

class CodeComplexityEvaluation(BaseModel):
    function_name: str = Field(description="Name of the function")

    is_setup_or_declaration: bool = Field(
        description="The code is part of setup or declaration boilerplate, such as defining constants or configuring a framework."
    )

    start_line_number: int = Field(
        description="Number of the first line of the function, considering existing formatting"
    )

    end_line_number: int = Field(
        description="Number of the last line of the function, considering existing formatting"
    )

    complexity_score: float = Field(
        description=(
            "Overall code complexity on a scale from 0 to 5, as discussed in the article "
            "'Simplifying Complex Code with Advanced Programming Approaches.'\n\n"
            "Interpretation:\n"
            "0 - 1: Very low complexity. The code is straightforward, easy to read, and requires minimal domain or technical knowledge.\n"
            "2 - 3: Moderate complexity. The code may use some advanced techniques or domain knowledge, but remains relatively approachable.\n"
            "4: High complexity. The code relies on multiple advanced concepts, intricate domain logic, or specialized optimizations.\n"
            "5: Extremely high complexity. The code likely combines various advanced paradigms, deep domain knowledge, and complex abstractions, "
            "making it very challenging to understand or maintain."
        )
    )

The code was evaluated file by file. For each file, the LLM assessed maintainability of every function individually. Then, the overall file score was calculated as a length-weighted average of the complexity score for each function.

The same approach was applied at the project level: the total score for the entire codebase was computed as a weighted average across all files, again considering size of each file to better reflect its impact on overall maintainability.

✅ Initial Results and Observations

This approach yielded some promising early results. The LLM was able to evaluate functions and assign scores that often aligned with my own assessments.

However, I noticed the results weren’t fully consistent. Scores varied by 5–15% between runs, likely due to the stochastic nature of LLMs and their sensitivity to slight changes in input or internal randomness.

In the context of a CI pipeline, this kind of inconsistency is a problem. CI tools need reliable metrics to determine whether code meets quality standards. An unstable score makes it hard to track whether code is actually improve or degrade over time.

📖 Attempt 2: Enhancing Accuracy with Explanations

To address the inconsistency, my next step was to enhance the prompt by explicitly asking the LLM to explain WHY it assigned a particular score.

From previous experiments, I had noticed that when an LLM is prompted to provide reasoning, it tends to produce more thoughtful and consistent responses. I suspect this happens because generating an explanation forces the model to “think through” its decision, leading to better alignment between the score and the reasoning behind it.

These explanations were also for helping me spot recurring patterns or biases in the model’s behavior. In some cases, I could identify where the prompt needed fine-tuning or where the LLM misunderstood me. This kind of prompt iteration is an essential part of building robust LLM-based applications.

While this adjustment did lead to slightly more stable scores, the improvement wasn’t enough. Variability was reduced, but not to a level I felt comfortable using in a CI pipeline. I needed a more robust solution.

📊 Attempt 3: Incorporating Multiple Metrics for Robustness

I realized that relying on a single score to capture code complexity was too limiting and too fragile. So I shifted toward evaluating multiple metrics, each representing a different aspect of cognitive complexity.

The idea was simple: by breaking complexity into several dimensions and scoring each separately. I expected to create a more stable composite score. If one metric fluctuated slightly due to randomness, the others could help balance it out, leading to a more reliable overall result.

Introducing multiple metrics also opened the door to a new approach:

Treating each metric as a probability between 0 and 1. This standard scale made it easier for the LLM to reason about each factor, as it aligned with common patterns found in prompts and training data.

It also gave me more flexibility when interpreting the results. Instead of making a strict yes/no decision on whether a factor was present, I could adjust thresholds, playing around with sensitivity and accuracy.

class CodeComplexityConfidenceEvaluation(BaseModel):
    function_name: str = Field(description="Name of the function")

    is_setup_of_declaration: bool = Field(
        description="The code is part of setup or declaration boilerplate, such as defining constants or configuring a framework."
    )

    start_line_number: int = Field(
        description="Number of the first line of the function, considering existing formatting"
    )

    end_line_number: int = Field(
        description="Number of the last line of the function, considering existing formatting"
    )

    use_of_advanced_algorithms: float = Field(
        description="Use of advanced algorithms requiring domain-specific knowledge. 0 means no such algorithms, 1 means heavily reliant on them."
    )
    low_level_optimizations: float = Field(
        description="Low-level optimizations that require deep knowledge of hardware or language internals."
    )
    complex_third_party_libraries: float = Field(
        description="Use of complex third-party libraries (e.g., Rx.js, Pandas, TensorFlow)."
    )
    business_logic_domain_expertise: float = Field(
        description="Business logic requiring domain-specific expertise."
    )
    advanced_coding_techniques: float = Field(
        description="Use of advanced coding techniques (e.g., functional programming)."
    )
    excessive_mutable_state: float = Field(
        description="Excessive reliance on mutable state. 0 means purely immutable or minimal state, 1 means heavy reliance on mutable data."
    )
    deeply_nested_control_structures: float = Field(
        description="Deeply nested control structures (more than 3 levels)."
    )
    long_classes: float = Field(
        description="Long classes (over 200 lines). 0 means no long classes, 1 means code is dominated by extremely large classes."
    )
    long_functions: float = Field(
        description="Long functions (over 100 lines). 0 means short functions, 1 means extremely long, monolithic functions."
    )
    parallelism_and_concurrency: float = Field(
        description="Usage of parallelism or concurrency patterns (threads, async, futures, etc.)."
    )
    recursion: float = Field(
        description="Usage of recursive functions or algorithms."
    )
    global_variables: float = Field(
        description="Use of global variables."
    )
    magic_numbers: float = Field(
        description="Magic numbers (unexplained constants) that reduce readability."
    )
    long_lists_of_positional_parameters: float = Field(
        description="Functions with a large number of positional parameters."
    )
    advanced_language_features: float = Field(
        description="Use of advanced language features (e.g., metaprogramming, reflection)."
    )
    inconsistent_indentation_or_formatting: float = Field(
        description="Poorly formatted code, inconsistent indentation, or misaligned braces."
    )
    long_monolithic_blocks_of_code: float = Field(
        description="Large uninterrupted blocks of code lacking clear separation."
    )
    non_descriptive_variable_function_names: float = Field(
        description="Non-descriptive or misleading names for variables, functions, or classes."
    )
    excessive_branching: float = Field(
        description="Frequent or complicated branching (if/else, switch), making logic harder to follow."
    )
    inconsistent_error_handling: float = Field(
        description="Multiple, inconsistent ways of handling errors throughout the code."
    )
    complex_boolean_logic: float = Field(
        description="Multiple combined boolean expressions making the logic difficult to parse."
    )
    code_duplication: float = Field(
        description="Repetitive code blocks or functions duplicated across the codebase."
    )
    non_idiomatic_use_of_language_features: float = Field(
        description="Using language features in a way that goes against common idioms or best practices."
    )

However, this approach came with a trade-off. It was challenging to balance the level of detail in the description of each factor with the overall size of the prompt. The more detailed and explicit the prompt, the better the LLM could identify specific factors, but longer prompts also meant slower response times, higher costs, and a greater chance of hitting token limits.

On the other hand, shorter prompts were faster and cheaper to run but often resulted in weaker detection accuracy, leading to higher error rates and less reliable evaluations.

🧩 Attempt 4: More factors with Enums

Doubling down on the idea of averaging out inconsistencies, I decided to increase the number of factors even further, while simplifying the prompt by representing each factor as an enum value.

By scaling up the number of simpler, well-defined factors, I aimed to make the scoring system both more granular and consistent, without overwhelming the model with lengthy descriptions.

For example:

class CodeComplexityFactors(str, Enum):
    use_advanced_coding_techniques = (
        "Use of advanced coding techniques, such as functional programming, that are less commonly understood."
    )
    use_advanced_algorithms = (
        "Use of advanced algorithms requiring specialized knowledge, making the code harder to understand."
    )
    use_parallelism_concurrency_patterns = (
        "Use of parallelism, concurrency, or recursion, which adds complexity due to the challenges of handling state across multiple threads or processes."
    )
    use_advanced_language_features = (
        "Use of advanced language features, such as reflection or metaprogramming, which can obscure code readability and require deep understanding."
    )
    complicated_arithmetic_expressions = (
        "Complex arithmetic expressions that involve multiple operations or formulas, making it harder to reason about."
    )
    complicated_boolean_expressions = (
        "Complex boolean logic, including multiple conditions that can be difficult to follow and debug."
    )
    complicated_string_manipulation = (
        "Complex string manipulations that involve multiple functions or operations, reducing clarity."
    )
    complicated_bitwise_operations = (
        "Use of bitwise operations and manipulation, which are generally low-level and harder to understand."
    )
    use_complex_third_party_libraries = (
        "Use of complex third-party libraries (e.g., Rx.js, Pandas, TensorFlow) that require specialized knowledge to understand and work with."
    )
    business_domain_expertise = (
        "Code that requires specific knowledge in the business domain, such as finance or healthcare."
    )
    technical_domain_expertise = (
        "Code that requires technical domain knowledge, such as signal processing or computer graphics."
    )
    application_domain_expertise = (
        "Code that requires understanding of the business logic unique to the application."
    )
    use_global_variables = (
        "Use of global variables or hidden mutable state, which makes the code harder to reason about and introduces potential side effects."
    )
    non_standard_coding_conventions = (
        "Use of non-standard or inconsistent coding and naming conventions, which can confuse engineers unfamiliar with the code."
    )
    excessive_mutable_state = (
        "Excessive reliance on mutable state, making the code harder to predict and test."
    )
    magic_numbers = (
        "Use of magic numbers (unexplained constants) that lack context, reducing clarity."
    )
    long_lists_of_positional_parameters = (
        "Functions with long lists of positional parameters, which can lead to confusion and misuse."
    )
    excessive_boilerplate_code = (
        "Excessive boilerplate code, which can obscure the core functionality and make the code harder to maintain."
    )
    inconsistent_indentation_or_formatting = (
        "Inconsistent indentation or formatting, reducing readability and maintainability."
    )
    long_monolithic_blocks_of_code = (
        "Long, monolithic blocks of code without clear separation of concerns, making it difficult to follow."
    )
    non_descriptive_variable_function_names = (
        "Non-descriptive or misleading names for variables or functions, reducing clarity and making it harder to understand the code."
    )
    overly_complex_function_signatures = (
        "Overly complex function signatures, making it hard to understand the purpose and use of the function."
    )
    deeply_nested_control_flow = (
        "Deeply nested branching in control flow (e.g., if/else, switch), making it hard to follow the execution logic."
    )
    complicated_control_flow_branching = (
        "Complicated branching in control flow, adding difficulty in understanding the code's decision-making."
    )
    deeply_nested_loops = (
        "Deeply nested loops (e.g., for, while), which can reduce code readability and increase cognitive load."
    )
    complicated_loop_structure = (
        "Complicated loop structures that involve multiple conditions, breaking out of loops, or complex logic."
    )
    hidden_side_effects = (
        "Hidden side effects that are not immediately obvious from the function signature, making debugging and reasoning more difficult."
    )
    code_duplication = (
        "Code duplication across functions or classes, which increases maintenance complexity."
    )
    non_idiomatic_use_of_language_features = (
        "Non-idiomatic use of language features, which may be unfamiliar or unintuitive for engineers working in the language."
    )
    complex_math_concepts = (
        "Use of advanced mathematical concepts or models, which require specialized knowledge to understand."
    )
    functional_programming = (
        "Use of functional programming paradigms, which require a different way of thinking and may not be familiar to all engineers."
    )
    complex_inheritance = (
        "Complex inheritance hierarchies, which can be hard to trace and understand."
    )
    complex_polymorphism = (
        "Complex use of polymorphism, which may introduce unexpected behavior and harder-to-understand relationships between classes."
    )
    complex_data_structures = (
        "Use of complex data structures (e.g., graphs, trees) that require specialized knowledge to work with."
    )
    bitwise_operations = (
        "Use of bitwise operations, which are generally low-level and harder to understand."
    )
    concurrency_mechanisms = (
        "Use of complex concurrency mechanisms, which add complexity in terms of state management and performance."
    )
    complex_regular_expressions = (
        "Use of complex regular expressions, which are often hard to read and understand at a glance."
    )
    reflection_and_metaprogramming = (
        "Use of reflection, metaprogramming, or other runtime code manipulation that reduces readability and increases cognitive load."
    )
    high_performance_computations = (
        "High-performance computations or low-level system optimizations, requiring specialized knowledge and potentially obscuring clarity."
    )
    low_level_networking = (
        "Low-level networking or socket programming, which requires specialized technical knowledge."
    )
    use_of_category_theory = (
        "Use of category theory concepts, which are very abstract and require a deep understanding to work with."
    )
    domain_specific_languages = (
        "Use of domain-specific languages (DSLs), which introduce custom syntax or rules that may be unfamiliar."
    )

To make the scoring more meaningful, I also introduced a custom weight to each enum value. By asking the LLM to identify which factors were present in the code and then applying the corresponding weights, I could compute a weighted sum that reflected the impact of the estimated complexity. This gave me a more flexible way to evaluate code, where each factor contributed proportionally based on how much it affects readability, maintainability, or onboarding effort.

code_complexity_factors_weight = {
    CodeComplexityFactors.use_advanced_coding_techniques: 10,
    CodeComplexityFactors.use_advanced_algorithms: 6,
    CodeComplexityFactors.complicated_control_structures: 3,
    CodeComplexityFactors.use_parallelism_concurrency_patterns: 4,
    CodeComplexityFactors.use_advanced_language_features: 6,
    CodeComplexityFactors.complicated_arithmetic_expressions: 3,
    CodeComplexityFactors.complicated_boolean_expressions: 3,
    CodeComplexityFactors.complicated_string_manipulation: 2,
    CodeComplexityFactors.complicated_bitwise_operations: 4,
    CodeComplexityFactors.use_complex_third_party_libraries: 3,
    CodeComplexityFactors.business_domain_expertise: 4,
    CodeComplexityFactors.technical_domain_expertise: 4,
    CodeComplexityFactors.application_domain_expertise: 3,
    CodeComplexityFactors.use_global_variables: 2,
    CodeComplexityFactors.non_standard_coding_conventions: 2,
    CodeComplexityFactors.excessive_mutable_state: 2,
    CodeComplexityFactors.magic_numbers: 1,
    CodeComplexityFactors.long_lists_of_positional_parameters: 2,
    CodeComplexityFactors.excessive_boilerplate_code: 1,
    CodeComplexityFactors.inconsistent_indentation_or_formatting: 1,
    CodeComplexityFactors.long_monolithic_blocks_of_code: 2,
    CodeComplexityFactors.non_descriptive_variable_function_names: 2,
    CodeComplexityFactors.overly_complex_function_signatures: 2,
    CodeComplexityFactors.deeply_nested_control_flow: 3,
    CodeComplexityFactors.complicated_control_flow_branching: 2,
    CodeComplexityFactors.deeply_nested_loops: 3,
    CodeComplexityFactors.complicated_loop_structure: 2,
    CodeComplexityFactors.hidden_side_effects: 4,
    CodeComplexityFactors.code_duplication: 2,
    CodeComplexityFactors.non_idiomatic_use_of_language_features: 3,
    CodeComplexityFactors.complex_math_concepts: 7,
    CodeComplexityFactors.functional_programming: 10,
    CodeComplexityFactors.complex_inheritance: 4,
    CodeComplexityFactors.complex_polymorphism: 4,
    CodeComplexityFactors.complex_data_structures: 6,
    CodeComplexityFactors.bitwise_operations: 4,
    CodeComplexityFactors.concurrency_mechanisms: 5,
    CodeComplexityFactors.complex_regular_expressions: 4,
    CodeComplexityFactors.reflection_and_metaprogramming: 4,
    CodeComplexityFactors.high_performance_computations: 5,
    CodeComplexityFactors.low_level_networking: 6,
    CodeComplexityFactors.use_of_category_theory: 10,
    CodeComplexityFactors.domain_specific_languages: 6,
}

My prompt was generating output in JSON format, one of the fields had to contain the array of enum values. During experimentation, I started noticing an increase in poorly formatted JSON outputs from the LLM. It turned out that LangChain was including the enum descriptions directly in the prompt and expecting output to contain array of string exactly matching my enum descriptions. LLM response was very long and verbose, that led to inconsistent representations and parsing errors. To fix this, I revised the enums to use concise, clear values that reduced ambiguity and minimized the chance of misinterpretation. This helped with formatting, but didn’t fully solve the problem.

Despite these improvements, the approach still had reliability issues. The LLM would sometimes miss key factors or falsely detect ones that weren’t present. This was especially problematic for high-weight factors, since errors in those would heavily skew the final score.

⚖️ Attempt 5: Grouping Factors into Key Categories

After playing with the trade-offs of fine-grained metrics, I realized I needed a better balance:

  • Too many individual factors overwhelmed the LLM and made the scoring process fragile.
  • A single metric was too simplistic and lacked the nuance needed for meaningful evaluation.

But by spending time refining the individual factors, I started to see patterns, groups of related traits that could be consolidated into broader categories. This led me to define five key dimensions of complexity:

  1. Readability Issues – Problems related to naming, formatting, or clarity that reduce how easily code can be understood.

  2. Control Flow Complexity – Use of deeply nested logic, recursion, or heavy branching that increases cognitive load.

  3. Project-Specific Knowledge – Dependencies on internal business logic, frameworks, or custom libraries that are hard to understand without context.

  4. Domain-Specific Knowledge – Use of specialized concepts (e.g., from machine learning, graphics, physics, or signal processing) that require prior expertise.

  5. Advanced Coding Techniques – Patterns like metaprogramming, functional programming or other techniques that are powerful but mostly used due to personal preference.

Grouping the factors this way allowed me to keep the prompt concise while still capturing the most important sources of complexity. It also made it easier to assign meaningful weights to each category based on how difficult they are to understand, refactor, or onboard new developers into.

class FunctionComplexityEvaluation(BaseModel):
    function_name: str = Field(description="Name of the function")

    is_setup_of_declaration: bool = Field(
        description="The code is part of setup or declaration boilerplate, such as defining constants or configuring a framework."
    )

    start_line_number: int = Field(
        description="Number of the first line of the function, considering existing formatting"
    )

    end_line_number: int = Field(
        description="Number of the last line of the function, considering existing formatting"
    )

    readability_score: float = Field(
        description="Estimate how readable the code is based on factors like naming conventions, formatting, and non-runtime characteristics."
    )

    cognitive_complexity_score: float = Field(
        description="Estimate the cognitive complexity of control structures and expressions. Higher scores result from deeply nested control flow, complex expressions, and multiple branching levels."
    )

    project_specific_knowledge_score: float = Field(
        description="Estimate how much project-specific knowledge is required, such as the use of third-party libraries or specific business rules."
    )

    technical_domain_knowledge_score: float = Field(
        description="Estimate the level of deep technical domain knowledge required, such as advanced algorithms, parallel programming, signal processing, or low-level optimizations."
    )

    advanced_code_techniques_score: float = Field(
        description="Estimate the use of advanced coding techniques (like functional programming paradigms) that are not essential for solving the task but reflect the developer’s preference."
    )

By focusing on five broad categories, I could include clear definitions and concrete examples for each, which helped the model make more consistent and accurate evaluations. It also made the weighting process much simpler. Instead of juggling dozens of individual factors, I could assign meaningful weights to just five core categories, each representing a different dimension of complexity.

After some refinement, I realized that adding more detailed grading guidelines for each category directly into the prompt would further improve consistency. Clear score ranges gave the LLM a more structured way to estimate maintainability and helped align its output with my expectations.

from pydantic import BaseModel, Field

class FunctionComplexityEvaluation(BaseModel):
    function_name: str = Field(description="Name of the function")

    is_setup_of_declaration: bool = Field(
        description="The code is part of setup or declaration boilerplate, such as defining constants or configuring a framework."
    )

    start_line_number: int = Field(
        description="Number of the first line of the function, considering existing formatting"
    )

    end_line_number: int = Field(
        description="Number of the last line of the function, considering existing formatting"
    )

    readability_score: float = Field(
        description="Estimate how readable the code is based on factors like naming conventions, formatting, and non-runtime characteristics.\n\
        Score ranges:\n\
        0 - 0.3: The code follows standard naming conventions, is well-formatted, and lacks clutter (e.g., no magic numbers or excessive boilerplate).\n\
        0.3 - 0.7: Minor readability issues, inconsistent formatting, occasional use of non-descriptive names, or slight violations of coding standards.\n\
        0.7 - 1: Significant readability problems, non-standard conventions, poor naming, inconsistent formatting, or extensive use of boilerplate code."
    )

    cognitive_complexity_score: float = Field(
        description="Estimate the cognitive complexity of control structures and expressions.\n\
        Score ranges:\n\
        0 - 0.3: Simple control structures (minimal nesting, straightforward logic, few operators).\n\
        0.3 - 0.7: Moderate complexity, involving some nesting (2–3 levels), more complex boolean/arithmetic expressions, or multiple operators.\n\
        0.7 - 1: Highly complex control structures, deeply nested (3+ levels), intricate logic with many operators, or multiple conditional/loop combinations."
    )

    project_specific_knowledge_score: float = Field(
        description="Estimate how much project-specific knowledge is required, such as the use of third-party libraries or specific business rules.\n\
        Score ranges:\n\
        0 - 0.3: Little to no project-specific knowledge required, uses common third-party libraries or standard business rules.\n\
        0.3 - 0.7: Some project-specific knowledge is needed, involving custom libraries or moderately complex business rules.\n\
        0.7 - 1: Extensive project-specific knowledge required, highly customized third-party libraries or intricate, specific business logic."
    )

    technical_domain_knowledge_score: float = Field(
        description="Estimate the level of deep technical domain knowledge required, such as advanced algorithms, parallel programming, signal processing, or low-level optimizations.\n\
        Score ranges:\n\
        0 - 0.3: Minimal technical domain knowledge required, standard algorithms and techniques used.\n\
        0.3 - 0.7: Moderate technical domain knowledge, involving specialized algorithms, parallel programming, or some scientific/engineering calculations.\n\
        0.7 - 1: High level of technical domain knowledge required, including advanced algorithms, low-level optimizations, or complex scientific/mathematical concepts."
    )

    advanced_code_techniques_score: float = Field(
        description="Estimate the use of advanced coding techniques (like functional programming paradigms) that are not essential for solving the task but reflect the developer’s preference.\n\
        Score ranges:\n\
        0 - 0.3: No or minimal use of advanced techniques, the code is straightforward and easy to follow.\n\
        0.3 - 0.7: Some use of advanced techniques (e.g., functional programming, metaprogramming) that increase complexity but do not dominate the code.\n\
        0.7 - 1: Heavy use of advanced techniques that significantly add complexity without being essential for solving the problem (e.g., monads, currying, complex metaprogramming)."
    )

I assigned weights to each category based on how difficult it is to address that type of complexity in real-world scenarios:

  • Readability Issues (Weight: 1) – These are usually easy to fix. Renaming variables, cleaning up formatting, or adding comments, minimal effort or low risk.

  • Control Flow Complexity (Weight: 2) – Refactoring deeply nested logic or simplifying branching structures is harder and can introduce bugs if not done carefully.

  • Project-Specific Knowledge (Weight: 3) – This often requires onboarding or checking internal documentation. It makes harder to onboard new team members and it is hard for engineers to keep their knowledge up to date.

  • Domain-Specific Knowledge (Weight: 4) – Understanding concepts from fields like machine learning or graphics can take significant time and isn’t always easily accessible.

  • Advanced Coding Techniques (Weight: 5) – Unnecessary complexity, often reflecting personal preferences rather than project needs, and understanding them may require deep technical knowledge and hands-on experience.

📏 Function length component

In addition to the cognitive complexity estimated by the language model, I decided to also consider function length as part of the final score. Long functions often require developers to hold more context in their minds, which becomes especially difficult when the logic is hard.

A short function that handles something complex can still be understandable. But even simple logic, when stretched over dozens of lines, becomes difficult to follow. That’s why keeping functions small is a well-known best practice, something I wanted the tool to encourage.

To capture this, I introduced a function size factor:

desired_length = 10  # lines of code
function_size_factor = math.sqrt(function_length / desired_length)

A 10-line function is treated as the baseline. Shorter functions are typically simpler and easier to reason about, while longer ones get penalized. The function size factor growth is limited by sq. root, to prevent excessively long functions from dominating entire score.

📈 Computing the Final Score

To keep the final score more human friendly, I wanted to keep it in range from 1 to 5 like typical star rating. I applied a hyperbolic tangent (tanh) function to the adjusted composite score:

MIN_VALUE = 1
MAX_VALUE = 5
VALUE_RANGE = MAX_VALUE - MIN_VALUE # 4
final_score = MIN_VALUE + VALUE_RANGE * tanh(composite_score * function_size_factor)

The hyperbolic tangent function brings several useful properties to the scoring formula:

  • Across zero it grows steadily, capturing the initial linear increase in perceived complexity.
  • As the input gets larger, the output asymptotically approaches 1, ensuring the score stays bounded and doesn’t spike uncontrollably.

By applying this function, I ensured the final score stays within desired range, while also modeling the non-linear nature of how developers experience complexity.

🚢 Finalizing the Tool and Additional Features

As the tool evolved, I added a few more features to improve its performance and usability:

  • Progressive Evaluation – To save time and compute, the tool caches previous results and skips files that haven’t changed since the last run. This makes it much faster to use in CI pipelines or large projects.

  • Improvement Suggestions – When a file exceeds the target complexity score, the tool generates helpful, actionable feedback on what could be improved, highlighting specific areas that contribute most to the score.

  • Configuration Options – The tool behaviour can be customized through a config file or CLI flags. This allows teams to adapt it to fit their needs.

🤩 Wrapping Up

I hope this tool will be useful to other engineers and companies looking to bring code complexity evaluation into their CI workflows. While it’s still early, I see this as one of the first practical steps toward automating parts of the code review process using LLMs.

With this post, I didn’t just want to showcase the tool, I wanted to share the journey of building it. From experimenting with prompts to balance reliability, performance, and cost. This project taught me a lot about working with LLMs in real-world scenarios.

It also gave me a deeper understanding of what actually makes code complex, readable, or maintainable. Now I approach code quality with a more structured mindset, and I hope these insights help others do the same.


🧪 Try it out: codepass on PyPI
🚀 Code: Github repo
💬 Got feedback or ideas? Drop a comment below!