Introduction

Welcome to Project Stage 02! 😄

In our last stage, we went through the journey of making a basic GCC pass that goes through statements and functions to print out simple statistics. Building off of it, for this stage, we are creating a more advanced pass.

To avoid extra implications, I will just try to overwrite the pass I already made in stage 01. As always, I will be using aarch64 and x86 servers to test my logic.

❇️ Please find the source code here: GitHub ❇️


📌 Understanding the Requirements

According to my course instructions (Here), let's summarize what we have to do this time around:

  • Identify cloned functions (name pattern function.variant)
  • Compare cloned functions to check if they are substantially the same.
  • Output a string on whether to prune the functions

Let's incrementally implement the code.


📌 Cleaning Up Pass Structure

Basing of our code in Stage 01, we're gonna set up the structure so that we can start fresh. Nothing special here. 😄

#include "config.h"
#include "system.h"
#include "coretypes.h"
#include "backend.h"
#include "tree.h"
#include "gimple.h"
#include "tree-pass.h"
#include "ssa.h"
#include "tree-pretty-print.h"
#include "gimple-iterator.h"
#include "gimple-walk.h"
#include "internal-fn.h"
#include "gimple-pretty-print.h"

// Additional headers needed for clone analysis
#include "gimple-ssa.h"
#include "cgraph.h"
#include "attribs.h"
#include "pretty-print.h"
#include "tree-inline.h"
#include "intl.h"
#include 
#include 
#include 

namespace {

const pass_data pass_data_kzaw =
{
  GIMPLE_PASS, /* type */
  "kzaw", /* name */
  OPTGROUP_NONE, /* optinfo_flags */
  TV_TREE_NRV, /* tv_id */
  PROP_cfg , /* properties_required */
  0, /* properties_provided */
  0, /* properties_destroyed */
  0, /* todo_flags_start */
  0, /* todo_flags_finish */
};

class pass_kzaw : public gimple_opt_pass
{
public:
  pass_kzaw (gcc::context *ctxt)
    : gimple_opt_pass (pass_data_kzaw, ctxt)
  {}

  bool gate (function *) final override {
    return 1;
  }

  unsigned int execute (function *) final override;

private:
  // Helper methods for clone analysis will go here
};

// Pass execution code will go here

} // anonymous namespace

// Factory function that creates an instance of the pass
gimple_opt_pass *
make_pass_kzaw (gcc::context *ctxt)
{
  return new pass_kzaw (ctxt);
}

📌 Pseudocode for execute()

It's important to know what we are trying to do in execute() before implementing any function. Here is what I can come up with:

  • Get the function declaration (a function fun will be passed as argument)
  • Check if the function's name follows the pattern of a clone (i.e., contains a dot separating a base name and a variant).
  • If a clone is detected, add its info (statements, name, etc.) to a map of clones (mapped by base name)
  • If the clone group contains two or more variants, it compares the first two variants' statements
  • Log a prune decision

Since we are trying to store the function's information, we will need a data structure and the map group that I mentioned. You might have noticed from the above code that there is a private: block in our class. Let's add it there.

private:
  // Data structure to store function information
  struct function_info {
    tree decl;                 // Function declaration
    function *fun;             // Function pointer
    std::string base_name;     // Base function name (without variant)
    std::string variant;       // Variant part of the name
    std::vector<gimple *> stmts; // Statements in the function
  };

  std::map<std::string, std::vector<function_info>> clone_groups;

This pseudocode looks reasonable for now so let us continue!


📌 is_clone_function: Check for a Clone

Before we implement the code, we should add a helper method. Let's add our function signature in the private block.

Every signature of subsequent functions I create in this project will be added in this block. I will not state it explicitly from now on.

👉 Function Signature and Purpose

bool is_clone_function(tree decl, std::string &base_name, std::string &variant);

It takes a function declaration (decl) and two string references (base_name and variant).

The tree data structure is used to represent various programming constructs, such as expressions, statements, declarations, and types, in a uniform way.

Purpose
To check if a given function declaration is a clone function. A clone function is identified by having a dot (.) in its name that separates the base name from a variant suffix. The function returns true if the pattern is found and the function is not a resolver; otherwise, it returns false.

👉 Function Logic

We will get the name of the function and then look for a pattern. If it matches, return true, otherwise, false. Please note that it is important especially in working with GCC compilation process to print out any useful information to the dump file.

Source Code

bool
pass_kzaw::is_clone_function(tree decl, std::string &base_name, std::string &variant)
{
  // Get the function name
  const char *full_name = IDENTIFIER_POINTER(DECL_NAME(decl));
  std::string func_name(full_name);

  // Look for the '.variant' pattern in the function name
  size_t dot_pos = func_name.find('.');
  if (dot_pos != std::string::npos) {
    base_name = func_name.substr(0, dot_pos);
    variant = func_name.substr(dot_pos);

    // Skip resolver functions
    if (variant == ".resolver") {
      return false;
    }

    if (dump_file) {
      fprintf(dump_file, "Found potential clone: %s (base: %s, variant: %s)\n", 
              full_name, base_name.c_str(), variant.c_str());
    }
    return true;
  }

  return false;
}

📌 collect_function_statements: Collect Function Statements

We will need a function to collect all the statements from a function.

👉 Function Signature and Purpose

void collect_function_statements(function *fun, std::vector<gimple *> &stmts);

It takes a reference to a vector of gimple * (statements). Its goal is to collect all GIMPLE statements from the function into the provided vector.

Purpose
To go through all basic blocks in the given function and collect each GIMPLE statement from those blocks into the stmts vector.

👉 Function Logic

We did something like this in project stage 01.

Source Code

void
pass_kzaw::collect_function_statements(function *fun, std::vector<gimple *> &stmts)
{
  basic_block bb;

  // Clear the vector in case it's being reused
  stmts.clear();

  // Collect all statements in the function
  FOR_EACH_BB_FN(bb, fun) {
    for (gimple_stmt_iterator gsi = gsi_start_bb(bb); !gsi_end_p(gsi); gsi_next(&gsi)) {
      gimple *stmt = gsi_stmt(gsi);
      stmts.push_back(stmt);
    }
  }

  if (dump_file) {
    fprintf(dump_file, "Collected %zu statements from function %s\n", 
            stmts.size(), function_name(fun));
  }
}

📌 compare_functions: Compare Functions

👉 Function Signature and Purpose

bool pass_kzaw::compare_functions(const std::vector<gimple *> &func1_stmts, 
    const std::vector<gimple *> &func2_stmts);

This function takes two vectors of GIMPLE statements and compares them structurally.

Purpose
To determine if the two functions are equal or not, ignoring minor differences like variable renaming. This function detects the clones by comparing their statements after grouping them by a common base name.

👉 Function Logic

  • Statement Count Check

We first check if the number of statements in both functions is the same. If they are different, it could count as being different.

if (func1_stmts.size() != func2_stmts.size()) {
  if (dump_file) {
    fprintf(dump_file, "Functions have different statement counts: %zu vs %zu\n", 
            func1_stmts.size(), func2_stmts.size());
  }
  return false;
}
  • Iterating Through Statements

Next, we will loop over the statements from both functions at the same time. For each pair at the same index, the first check is to compare their GIMPLE codes. (which denote the type of statement).

for (size_t i = 0; i < func1_stmts.size(); i++) {
  gimple *stmt1 = func1_stmts[i];
  gimple *stmt2 = func2_stmts[i];

  // Check if statement codes are different
  if (gimple_code(stmt1) != gimple_code(stmt2)) {
    if (dump_file) {
      fprintf(dump_file, "Statement %zu: Different gimple codes\n", i);
      print_gimple_stmt(dump_file, stmt1, 0, TDF_SLIM);
      print_gimple_stmt(dump_file, stmt2, 0, TDF_SLIM);
    }
    return false;
  }
  • Detailed Comparison based on Statement Type

A switch statement is used to perform type-specific comparisons for each GIMPLE statement.

⚠️ Now there are a LOT of GIMPLE statements and this code would be way complex and out of scope if we were to implement each of them. So I decided to only implement for 4: GIMPLE_ASSIGN, GIMPLE_CALL, GIMPLE_COND and GIMPLE_RETURN since these are the basic code logics.

Please also note that help was retrieved from community members at my Stack Overflow and Discord. Macros were also referenced from GCC Documentation. At this point of the project, I'm not really confident in the logic but I really appreciate everyone who helped me. 🙏.

// Compare based on statement type
  switch (gimple_code(stmt1)) {

➡️ GIMPLE_ASSIGN (Assignment Statements)

For assignment statements, the function compares:

  • The right-hand side operation code.
  • The number of operands.
  • The operand types and, for constants, their exact values.
case GIMPLE_ASSIGN:
  {
    // Check operation code
    if (gimple_assign_rhs_code(stmt1) != gimple_assign_rhs_code(stmt2)) {
      if (dump_file) {
        fprintf(dump_file, "Assignment operation mismatch at statement %zu\n", i);
      }
      return false;
    }

    // Check operand count
    unsigned op_count1 = gimple_num_ops(stmt1);
    unsigned op_count2 = gimple_num_ops(stmt2);
    if (op_count1 != op_count2) {
      if (dump_file) {
        fprintf(dump_file, "Different number of operands at statement %zu\n", i);
      }
      return false;
    }

    // For constants and literals, values must match exactly
    for (unsigned j = 1; j < op_count1; j++) {
      tree op1 = gimple_op(stmt1, j);
      tree op2 = gimple_op(stmt2, j);

      if (TREE_CODE(op1) != TREE_CODE(op2)) {
        if (dump_file) {
          fprintf(dump_file, "Different operand types at statement %zu\n", i);
        }
        return false;
      }

      // For constants, compare values
      if (CONSTANT_CLASS_P(op1) && CONSTANT_CLASS_P(op2)) {
        if (!vrp_operand_equal_p(op1, op2)) {
          if (dump_file) {
            fprintf(dump_file, "Different constant values at statement %zu\n", i);
          }
          return false;
        }
      }
    }
  }
  break;

➡️ GIMPLE_CALL (Function Calls)

For function calls, the function checks:

  • That the type of function call (tree code) is the same.
  • If the call is direct (address expression), it compares the function declarations.
  • The number of arguments in the call.
case GIMPLE_CALL:
  {
    // Check if calling the same function
    tree func1 = gimple_call_fn(as_a<gcall *>(stmt1));
    tree func2 = gimple_call_fn(as_a<gcall *>(stmt2));

    if (TREE_CODE(func1) != TREE_CODE(func2)) {
      if (dump_file) {
        fprintf(dump_file, "Different function call types at statement %zu\n", i);
      }
      return false;
    }

    // If it's a direct function call, compare the function declarations
    if (TREE_CODE(func1) == ADDR_EXPR) {
      tree fn_decl1 = TREE_OPERAND(func1, 0);
      tree fn_decl2 = TREE_OPERAND(func2, 0);

      if (DECL_NAME(fn_decl1) != DECL_NAME(fn_decl2)) {
        if (dump_file) {
          fprintf(dump_file, "Calling different functions at statement %zu\n", i);
        }
        return false;
      }
    }

    // Check argument count
    unsigned arg_count1 = gimple_call_num_args(as_a<gcall *>(stmt1));
    unsigned arg_count2 = gimple_call_num_args(as_a<gcall *>(stmt2));
    if (arg_count1 != arg_count2) {
      if (dump_file) {
        fprintf(dump_file, "Different number of arguments in call at statement %zu\n", i);
      }
      return false;
    }
  }
  break;

➡️ GIMPLE_COND (Conditional Statements)

For conditional statements, the function compares the conditional codes of both statements.

case GIMPLE_COND:
  {
    // Check condition code
    enum tree_code code1 = gimple_cond_code(as_a<gcond *>(stmt1));
    enum tree_code code2 = gimple_cond_code(as_a<gcond *>(stmt2));
    if (code1 != code2) {
      if (dump_file) {
        fprintf(dump_file, "Different conditional codes at statement %zu\n", i);
      }
      return false;
    }
  }
  break;

➡️ GIMPLE_COND (Conditional Statements)

For return statements, the function checks whether one statement returns a value while the other does not.

case GIMPLE_RETURN:
  {
    // Check if one returns a value and the other doesn't
    tree ret_val1 = gimple_return_retval(as_a<greturn *>(stmt1));
    tree ret_val2 = gimple_return_retval(as_a<greturn *>(stmt2));

    if ((ret_val1 == NULL_TREE) != (ret_val2 == NULL_TREE)) {
      if (dump_file) {
        fprintf(dump_file, "One function returns a value, the other doesn't at statement %zu\n", i);
      }
      return false;
    }
  }
  break;

For any statement types not explicitly handled, the function assumes that if the GIMPLE code is the same, the statements are considered equal.

  • Successful Comparison

If all pairs of statements pass the comparison checks, the function logs that the functions are substantially the same and returns true.

// If we've made it this far, the functions are considered substantially the same
if (dump_file) {
  fprintf(dump_file, "Functions are substantially the same\n");
}
return true;

Source Code

bool
pass_kzaw::compare_functions(const std::vector<gimple *> &func1_stmts, 
                             const std::vector<gimple *> &func2_stmts)
{
  // First check: different statement count means different functions
  if (func1_stmts.size() != func2_stmts.size()) {
    if (dump_file) {
      fprintf(dump_file, "Functions have different statement counts: %zu vs %zu\n", 
              func1_stmts.size(), func2_stmts.size());
    }
    return false;
  }

  // Iterate through statements and compare them
  for (size_t i = 0; i < func1_stmts.size(); i++) {
    gimple *stmt1 = func1_stmts[i];
    gimple *stmt2 = func2_stmts[i];

    // Check if statement codes are different
    if (gimple_code(stmt1) != gimple_code(stmt2)) {
      if (dump_file) {
        fprintf(dump_file, "Statement %zu: Different gimple codes\n", i);
        print_gimple_stmt(dump_file, stmt1, 0, TDF_SLIM);
        print_gimple_stmt(dump_file, stmt2, 0, TDF_SLIM);
      }
      return false;
    }

    // Compare based on statement type
    switch (gimple_code(stmt1)) {
    case GIMPLE_ASSIGN:
      {
        // Check operation code
        if (gimple_assign_rhs_code(stmt1) != gimple_assign_rhs_code(stmt2)) {
          if (dump_file) {
            fprintf(dump_file, "Assignment operation mismatch at statement %zu\n", i);
          }
          return false;
        }

        // Check operand count
        unsigned op_count1 = gimple_num_ops(stmt1);
        unsigned op_count2 = gimple_num_ops(stmt2);
        if (op_count1 != op_count2) {
          if (dump_file) {
            fprintf(dump_file, "Different number of operands at statement %zu\n", i);
          }
          return false;
        }

        // For constants and literals, values must match exactly
        for (unsigned j = 1; j < op_count1; j++) {
          tree op1 = gimple_op(stmt1, j);
          tree op2 = gimple_op(stmt2, j);

          if (TREE_CODE(op1) != TREE_CODE(op2)) {
            if (dump_file) {
              fprintf(dump_file, "Different operand types at statement %zu\n", i);
            }
            return false;
          }

          // For constants, compare values
          if (CONSTANT_CLASS_P(op1) && CONSTANT_CLASS_P(op2)) {
            if (!vrp_operand_equal_p(op1, op2)) {
              if (dump_file) {
                fprintf(dump_file, "Different constant values at statement %zu\n", i);
              }
              return false;
            }
          }
        }
      }
      break;

    case GIMPLE_CALL:
      {
        // Check if calling the same function
        tree func1 = gimple_call_fn(as_a<gcall *>(stmt1));
        tree func2 = gimple_call_fn(as_a<gcall *>(stmt2));

        if (TREE_CODE(func1) != TREE_CODE(func2)) {
          if (dump_file) {
            fprintf(dump_file, "Different function call types at statement %zu\n", i);
          }
          return false;
        }

        // If it's a direct function call, compare the function declarations
        if (TREE_CODE(func1) == ADDR_EXPR) {
          tree fn_decl1 = TREE_OPERAND(func1, 0);
          tree fn_decl2 = TREE_OPERAND(func2, 0);

          if (DECL_NAME(fn_decl1) != DECL_NAME(fn_decl2)) {
            if (dump_file) {
              fprintf(dump_file, "Calling different functions at statement %zu\n", i);
            }
            return false;
          }
        }

        // Check argument count
        unsigned arg_count1 = gimple_call_num_args(as_a<gcall *>(stmt1));
        unsigned arg_count2 = gimple_call_num_args(as_a<gcall *>(stmt2));
        if (arg_count1 != arg_count2) {
          if (dump_file) {
            fprintf(dump_file, "Different number of arguments in call at statement %zu\n", i);
          }
          return false;
        }
      }
      break;

    case GIMPLE_COND:
      {
        // Check condition code
        enum tree_code code1 = gimple_cond_code(as_a<gcond *>(stmt1));
        enum tree_code code2 = gimple_cond_code(as_a<gcond *>(stmt2));
        if (code1 != code2) {
          if (dump_file) {
            fprintf(dump_file, "Different conditional codes at statement %zu\n", i);
          }
          return false;
        }
      }
      break;

    case GIMPLE_RETURN:
      {
        // Check if one returns a value and the other doesn't
        tree ret_val1 = gimple_return_retval(as_a<greturn *>(stmt1));
        tree ret_val2 = gimple_return_retval(as_a<greturn *>(stmt2));

        if ((ret_val1 == NULL_TREE) != (ret_val2 == NULL_TREE)) {
          if (dump_file) {
            fprintf(dump_file, "One function returns a value, the other doesn't at statement %zu\n", i);
          }
          return false;
        }
      }
      break;

    default:
      // For other statement types, we consider them the same if the code matches
      break;
    }
  }

  // If we've made it this far, the functions are considered substantially the same
  if (dump_file) {
    fprintf(dump_file, "Functions are substantially the same\n");
  }
  return true;
}

📌 print_prune_decision: Print Decision

Finally, this is the function that prints whether a function should be pruned or not.

👉 Function Signature and Purpose

void print_prune_decision(const std::string &base_name, bool should_prune);

Purpose
Logs a decision based on whether a clone function group (identified by its base name) should be pruned or not.

👉 Function Logic

Source Code

void
pass_kzaw::print_prune_decision(const std::string &base_name, bool should_prune)
{
  if (dump_file) {
    if (should_prune) {
      fprintf(dump_file, "PRUNE: %s\n", base_name.c_str());
    } else {
      fprintf(dump_file, "NOPRUNE: %s\n", base_name.c_str());
    }
  }
}

📌 execute()

Now that all of our required functions have been implemented, it's time to tie everything together by implementing the main execute() function.

👉 Function Logic

Source Code

unsigned int
pass_kzaw::execute(function *fun)
{
  // Get the function declaration
  tree fndecl = current_function_decl;

  // Skip external functions or those with no body
  if (!fndecl || DECL_EXTERNAL(fndecl) || !DECL_STRUCT_FUNCTION(fndecl)) {
    return 0;
  }

  // Check if this is a clone function
  std::string base_name, variant;
  bool is_clone = is_clone_function(fndecl, base_name, variant);

  if (is_clone) {
    // Collect statements in this function
    function_info info;
    info.decl = fndecl;
    info.fun = fun;
    info.base_name = base_name;
    info.variant = variant;
    collect_function_statements(fun, info.stmts);

    // Add to the appropriate clone group
    clone_groups[base_name].push_back(info);

    // Check if we've seen enough clones of this function to make a decision
    if (clone_groups[base_name].size() >= 2) {
      if (dump_file) {
        fprintf(dump_file, "Analyzing clones of function: %s\n", base_name.c_str());
      }

      // Compare the first two variants
      const function_info &info1 = clone_groups[base_name][0];
      const function_info &info2 = clone_groups[base_name][1];

      bool are_same = compare_functions(info1.stmts, info2.stmts);

      // Print the pruning decision
      print_prune_decision(base_name, are_same);

      // Clear the clone group after making the decision
      // This ensures we only emit one decision per base function
      clone_groups.erase(base_name);
    }
  }

  return 0;
}

Final Thoughts

I would like to end this part of the project stage 02 here. We have implemented the logic for the pass. I won't go into details of compiling and building gcc again as they have been mentioned in Lab 4 and Project Stage 01.

The next step will be to test out this logic in our x86 and aarch64 servers. Let us go to the next part of this Project Stage 02.

Thank you for coding with me! 😄