Introduction
Welcome to Project Stage 02! 😄
In our last stage, we went through the journey of making a basic GCC pass that goes through statements and functions to print out simple statistics. Building off of it, for this stage, we are creating a more advanced pass.
To avoid extra implications, I will just try to overwrite the pass I already made in stage 01. As always, I will be using aarch64
and x86
servers to test my logic.
❇️ Please find the source code here: GitHub ❇️
📌 Understanding the Requirements
According to my course instructions (Here), let's summarize what we have to do this time around:
- Identify cloned functions (name pattern
function.variant
) - Compare cloned functions to check if they are substantially the same.
- Output a string on whether to prune the functions
Let's incrementally implement the code.
📌 Cleaning Up Pass Structure
Basing of our code in Stage 01, we're gonna set up the structure so that we can start fresh. Nothing special here. 😄
#include "config.h"
#include "system.h"
#include "coretypes.h"
#include "backend.h"
#include "tree.h"
#include "gimple.h"
#include "tree-pass.h"
#include "ssa.h"
#include "tree-pretty-print.h"
#include "gimple-iterator.h"
#include "gimple-walk.h"
#include "internal-fn.h"
#include "gimple-pretty-print.h"
// Additional headers needed for clone analysis
#include "gimple-ssa.h"
#include "cgraph.h"
#include "attribs.h"
#include "pretty-print.h"
#include "tree-inline.h"
#include "intl.h"
#include
#include
#include
namespace {
const pass_data pass_data_kzaw =
{
GIMPLE_PASS, /* type */
"kzaw", /* name */
OPTGROUP_NONE, /* optinfo_flags */
TV_TREE_NRV, /* tv_id */
PROP_cfg , /* properties_required */
0, /* properties_provided */
0, /* properties_destroyed */
0, /* todo_flags_start */
0, /* todo_flags_finish */
};
class pass_kzaw : public gimple_opt_pass
{
public:
pass_kzaw (gcc::context *ctxt)
: gimple_opt_pass (pass_data_kzaw, ctxt)
{}
bool gate (function *) final override {
return 1;
}
unsigned int execute (function *) final override;
private:
// Helper methods for clone analysis will go here
};
// Pass execution code will go here
} // anonymous namespace
// Factory function that creates an instance of the pass
gimple_opt_pass *
make_pass_kzaw (gcc::context *ctxt)
{
return new pass_kzaw (ctxt);
}
📌 Pseudocode for execute()
It's important to know what we are trying to do in execute()
before implementing any function. Here is what I can come up with:
- Get the function declaration (a function
fun
will be passed as argument) - Check if the function's name follows the pattern of a clone (i.e., contains a dot separating a base name and a variant).
- If a clone is detected, add its info (statements, name, etc.) to a map of clones (mapped by base name)
- If the clone group contains two or more variants, it compares the first two variants' statements
- Log a prune decision
Since we are trying to store the function's information, we will need a data structure and the map group that I mentioned. You might have noticed from the above code that there is a private:
block in our class. Let's add it there.
private:
// Data structure to store function information
struct function_info {
tree decl; // Function declaration
function *fun; // Function pointer
std::string base_name; // Base function name (without variant)
std::string variant; // Variant part of the name
std::vector<gimple *> stmts; // Statements in the function
};
std::map<std::string, std::vector<function_info>> clone_groups;
This pseudocode looks reasonable for now so let us continue!
📌 is_clone_function
: Check for a Clone
Before we implement the code, we should add a helper method. Let's add our function signature in the private
block.
Every signature of subsequent functions I create in this project will be added in this block. I will not state it explicitly from now on.
👉 Function Signature and Purpose
bool is_clone_function(tree decl, std::string &base_name, std::string &variant);
It takes a function declaration (decl
) and two string references (base_name
and variant
).
The
tree
data structure is used to represent various programming constructs, such as expressions, statements, declarations, and types, in a uniform way.
Purpose
To check if a given function declaration is a clone function. A clone function is identified by having a dot (.) in its name that separates the base name from a variant suffix. The function returns true
if the pattern is found and the function is not a resolver; otherwise, it returns false
.
👉 Function Logic
We will get the name of the function and then look for a pattern. If it matches, return true
, otherwise, false
. Please note that it is important especially in working with GCC compilation process to print out any useful information to the dump file.
Source Code
bool
pass_kzaw::is_clone_function(tree decl, std::string &base_name, std::string &variant)
{
// Get the function name
const char *full_name = IDENTIFIER_POINTER(DECL_NAME(decl));
std::string func_name(full_name);
// Look for the '.variant' pattern in the function name
size_t dot_pos = func_name.find('.');
if (dot_pos != std::string::npos) {
base_name = func_name.substr(0, dot_pos);
variant = func_name.substr(dot_pos);
// Skip resolver functions
if (variant == ".resolver") {
return false;
}
if (dump_file) {
fprintf(dump_file, "Found potential clone: %s (base: %s, variant: %s)\n",
full_name, base_name.c_str(), variant.c_str());
}
return true;
}
return false;
}
📌 collect_function_statements
: Collect Function Statements
We will need a function to collect all the statements from a function.
👉 Function Signature and Purpose
void collect_function_statements(function *fun, std::vector<gimple *> &stmts);
It takes a reference to a vector of gimple * (statements). Its goal is to collect all GIMPLE statements from the function into the provided vector.
Purpose
To go through all basic blocks in the given function and collect each GIMPLE statement from those blocks into the stmts
vector.
👉 Function Logic
We did something like this in project stage 01.
Source Code
void
pass_kzaw::collect_function_statements(function *fun, std::vector<gimple *> &stmts)
{
basic_block bb;
// Clear the vector in case it's being reused
stmts.clear();
// Collect all statements in the function
FOR_EACH_BB_FN(bb, fun) {
for (gimple_stmt_iterator gsi = gsi_start_bb(bb); !gsi_end_p(gsi); gsi_next(&gsi)) {
gimple *stmt = gsi_stmt(gsi);
stmts.push_back(stmt);
}
}
if (dump_file) {
fprintf(dump_file, "Collected %zu statements from function %s\n",
stmts.size(), function_name(fun));
}
}
📌 compare_functions
: Compare Functions
👉 Function Signature and Purpose
bool pass_kzaw::compare_functions(const std::vector<gimple *> &func1_stmts,
const std::vector<gimple *> &func2_stmts);
This function takes two vectors of GIMPLE statements and compares them structurally.
Purpose
To determine if the two functions are equal or not, ignoring minor differences like variable renaming. This function detects the clones by comparing their statements after grouping them by a common base name.
👉 Function Logic
- Statement Count Check
We first check if the number of statements in both functions is the same. If they are different, it could count as being different.
if (func1_stmts.size() != func2_stmts.size()) {
if (dump_file) {
fprintf(dump_file, "Functions have different statement counts: %zu vs %zu\n",
func1_stmts.size(), func2_stmts.size());
}
return false;
}
- Iterating Through Statements
Next, we will loop over the statements from both functions at the same time. For each pair at the same index, the first check is to compare their GIMPLE codes. (which denote the type of statement).
for (size_t i = 0; i < func1_stmts.size(); i++) {
gimple *stmt1 = func1_stmts[i];
gimple *stmt2 = func2_stmts[i];
// Check if statement codes are different
if (gimple_code(stmt1) != gimple_code(stmt2)) {
if (dump_file) {
fprintf(dump_file, "Statement %zu: Different gimple codes\n", i);
print_gimple_stmt(dump_file, stmt1, 0, TDF_SLIM);
print_gimple_stmt(dump_file, stmt2, 0, TDF_SLIM);
}
return false;
}
- Detailed Comparison based on Statement Type
A switch
statement is used to perform type-specific comparisons for each GIMPLE statement.
⚠️ Now there are a LOT of GIMPLE statements and this code would be way complex and out of scope if we were to implement each of them. So I decided to only implement for 4: GIMPLE_ASSIGN, GIMPLE_CALL, GIMPLE_COND and GIMPLE_RETURN since these are the basic code logics.
Please also note that help was retrieved from community members at my Stack Overflow and Discord. Macros were also referenced from GCC Documentation. At this point of the project, I'm not really confident in the logic but I really appreciate everyone who helped me. 🙏.
// Compare based on statement type
switch (gimple_code(stmt1)) {
➡️ GIMPLE_ASSIGN (Assignment Statements)
For assignment statements, the function compares:
- The right-hand side operation code.
- The number of operands.
- The operand types and, for constants, their exact values.
case GIMPLE_ASSIGN:
{
// Check operation code
if (gimple_assign_rhs_code(stmt1) != gimple_assign_rhs_code(stmt2)) {
if (dump_file) {
fprintf(dump_file, "Assignment operation mismatch at statement %zu\n", i);
}
return false;
}
// Check operand count
unsigned op_count1 = gimple_num_ops(stmt1);
unsigned op_count2 = gimple_num_ops(stmt2);
if (op_count1 != op_count2) {
if (dump_file) {
fprintf(dump_file, "Different number of operands at statement %zu\n", i);
}
return false;
}
// For constants and literals, values must match exactly
for (unsigned j = 1; j < op_count1; j++) {
tree op1 = gimple_op(stmt1, j);
tree op2 = gimple_op(stmt2, j);
if (TREE_CODE(op1) != TREE_CODE(op2)) {
if (dump_file) {
fprintf(dump_file, "Different operand types at statement %zu\n", i);
}
return false;
}
// For constants, compare values
if (CONSTANT_CLASS_P(op1) && CONSTANT_CLASS_P(op2)) {
if (!vrp_operand_equal_p(op1, op2)) {
if (dump_file) {
fprintf(dump_file, "Different constant values at statement %zu\n", i);
}
return false;
}
}
}
}
break;
➡️ GIMPLE_CALL (Function Calls)
For function calls, the function checks:
- That the type of function call (tree code) is the same.
- If the call is direct (address expression), it compares the function declarations.
- The number of arguments in the call.
case GIMPLE_CALL:
{
// Check if calling the same function
tree func1 = gimple_call_fn(as_a<gcall *>(stmt1));
tree func2 = gimple_call_fn(as_a<gcall *>(stmt2));
if (TREE_CODE(func1) != TREE_CODE(func2)) {
if (dump_file) {
fprintf(dump_file, "Different function call types at statement %zu\n", i);
}
return false;
}
// If it's a direct function call, compare the function declarations
if (TREE_CODE(func1) == ADDR_EXPR) {
tree fn_decl1 = TREE_OPERAND(func1, 0);
tree fn_decl2 = TREE_OPERAND(func2, 0);
if (DECL_NAME(fn_decl1) != DECL_NAME(fn_decl2)) {
if (dump_file) {
fprintf(dump_file, "Calling different functions at statement %zu\n", i);
}
return false;
}
}
// Check argument count
unsigned arg_count1 = gimple_call_num_args(as_a<gcall *>(stmt1));
unsigned arg_count2 = gimple_call_num_args(as_a<gcall *>(stmt2));
if (arg_count1 != arg_count2) {
if (dump_file) {
fprintf(dump_file, "Different number of arguments in call at statement %zu\n", i);
}
return false;
}
}
break;
➡️ GIMPLE_COND (Conditional Statements)
For conditional statements, the function compares the conditional codes of both statements.
case GIMPLE_COND:
{
// Check condition code
enum tree_code code1 = gimple_cond_code(as_a<gcond *>(stmt1));
enum tree_code code2 = gimple_cond_code(as_a<gcond *>(stmt2));
if (code1 != code2) {
if (dump_file) {
fprintf(dump_file, "Different conditional codes at statement %zu\n", i);
}
return false;
}
}
break;
➡️ GIMPLE_COND (Conditional Statements)
For return statements, the function checks whether one statement returns a value while the other does not.
case GIMPLE_RETURN:
{
// Check if one returns a value and the other doesn't
tree ret_val1 = gimple_return_retval(as_a<greturn *>(stmt1));
tree ret_val2 = gimple_return_retval(as_a<greturn *>(stmt2));
if ((ret_val1 == NULL_TREE) != (ret_val2 == NULL_TREE)) {
if (dump_file) {
fprintf(dump_file, "One function returns a value, the other doesn't at statement %zu\n", i);
}
return false;
}
}
break;
For any statement types not explicitly handled, the function assumes that if the GIMPLE code is the same, the statements are considered equal.
- Successful Comparison
If all pairs of statements pass the comparison checks, the function logs that the functions are substantially the same and returns true
.
// If we've made it this far, the functions are considered substantially the same
if (dump_file) {
fprintf(dump_file, "Functions are substantially the same\n");
}
return true;
Source Code
bool
pass_kzaw::compare_functions(const std::vector<gimple *> &func1_stmts,
const std::vector<gimple *> &func2_stmts)
{
// First check: different statement count means different functions
if (func1_stmts.size() != func2_stmts.size()) {
if (dump_file) {
fprintf(dump_file, "Functions have different statement counts: %zu vs %zu\n",
func1_stmts.size(), func2_stmts.size());
}
return false;
}
// Iterate through statements and compare them
for (size_t i = 0; i < func1_stmts.size(); i++) {
gimple *stmt1 = func1_stmts[i];
gimple *stmt2 = func2_stmts[i];
// Check if statement codes are different
if (gimple_code(stmt1) != gimple_code(stmt2)) {
if (dump_file) {
fprintf(dump_file, "Statement %zu: Different gimple codes\n", i);
print_gimple_stmt(dump_file, stmt1, 0, TDF_SLIM);
print_gimple_stmt(dump_file, stmt2, 0, TDF_SLIM);
}
return false;
}
// Compare based on statement type
switch (gimple_code(stmt1)) {
case GIMPLE_ASSIGN:
{
// Check operation code
if (gimple_assign_rhs_code(stmt1) != gimple_assign_rhs_code(stmt2)) {
if (dump_file) {
fprintf(dump_file, "Assignment operation mismatch at statement %zu\n", i);
}
return false;
}
// Check operand count
unsigned op_count1 = gimple_num_ops(stmt1);
unsigned op_count2 = gimple_num_ops(stmt2);
if (op_count1 != op_count2) {
if (dump_file) {
fprintf(dump_file, "Different number of operands at statement %zu\n", i);
}
return false;
}
// For constants and literals, values must match exactly
for (unsigned j = 1; j < op_count1; j++) {
tree op1 = gimple_op(stmt1, j);
tree op2 = gimple_op(stmt2, j);
if (TREE_CODE(op1) != TREE_CODE(op2)) {
if (dump_file) {
fprintf(dump_file, "Different operand types at statement %zu\n", i);
}
return false;
}
// For constants, compare values
if (CONSTANT_CLASS_P(op1) && CONSTANT_CLASS_P(op2)) {
if (!vrp_operand_equal_p(op1, op2)) {
if (dump_file) {
fprintf(dump_file, "Different constant values at statement %zu\n", i);
}
return false;
}
}
}
}
break;
case GIMPLE_CALL:
{
// Check if calling the same function
tree func1 = gimple_call_fn(as_a<gcall *>(stmt1));
tree func2 = gimple_call_fn(as_a<gcall *>(stmt2));
if (TREE_CODE(func1) != TREE_CODE(func2)) {
if (dump_file) {
fprintf(dump_file, "Different function call types at statement %zu\n", i);
}
return false;
}
// If it's a direct function call, compare the function declarations
if (TREE_CODE(func1) == ADDR_EXPR) {
tree fn_decl1 = TREE_OPERAND(func1, 0);
tree fn_decl2 = TREE_OPERAND(func2, 0);
if (DECL_NAME(fn_decl1) != DECL_NAME(fn_decl2)) {
if (dump_file) {
fprintf(dump_file, "Calling different functions at statement %zu\n", i);
}
return false;
}
}
// Check argument count
unsigned arg_count1 = gimple_call_num_args(as_a<gcall *>(stmt1));
unsigned arg_count2 = gimple_call_num_args(as_a<gcall *>(stmt2));
if (arg_count1 != arg_count2) {
if (dump_file) {
fprintf(dump_file, "Different number of arguments in call at statement %zu\n", i);
}
return false;
}
}
break;
case GIMPLE_COND:
{
// Check condition code
enum tree_code code1 = gimple_cond_code(as_a<gcond *>(stmt1));
enum tree_code code2 = gimple_cond_code(as_a<gcond *>(stmt2));
if (code1 != code2) {
if (dump_file) {
fprintf(dump_file, "Different conditional codes at statement %zu\n", i);
}
return false;
}
}
break;
case GIMPLE_RETURN:
{
// Check if one returns a value and the other doesn't
tree ret_val1 = gimple_return_retval(as_a<greturn *>(stmt1));
tree ret_val2 = gimple_return_retval(as_a<greturn *>(stmt2));
if ((ret_val1 == NULL_TREE) != (ret_val2 == NULL_TREE)) {
if (dump_file) {
fprintf(dump_file, "One function returns a value, the other doesn't at statement %zu\n", i);
}
return false;
}
}
break;
default:
// For other statement types, we consider them the same if the code matches
break;
}
}
// If we've made it this far, the functions are considered substantially the same
if (dump_file) {
fprintf(dump_file, "Functions are substantially the same\n");
}
return true;
}
📌 print_prune_decision
: Print Decision
Finally, this is the function that prints whether a function should be pruned or not.
👉 Function Signature and Purpose
void print_prune_decision(const std::string &base_name, bool should_prune);
Purpose
Logs a decision based on whether a clone function group (identified by its base name) should be pruned or not.
👉 Function Logic
Source Code
void
pass_kzaw::print_prune_decision(const std::string &base_name, bool should_prune)
{
if (dump_file) {
if (should_prune) {
fprintf(dump_file, "PRUNE: %s\n", base_name.c_str());
} else {
fprintf(dump_file, "NOPRUNE: %s\n", base_name.c_str());
}
}
}
📌 execute()
Now that all of our required functions have been implemented, it's time to tie everything together by implementing the main execute()
function.
👉 Function Logic
Source Code
unsigned int
pass_kzaw::execute(function *fun)
{
// Get the function declaration
tree fndecl = current_function_decl;
// Skip external functions or those with no body
if (!fndecl || DECL_EXTERNAL(fndecl) || !DECL_STRUCT_FUNCTION(fndecl)) {
return 0;
}
// Check if this is a clone function
std::string base_name, variant;
bool is_clone = is_clone_function(fndecl, base_name, variant);
if (is_clone) {
// Collect statements in this function
function_info info;
info.decl = fndecl;
info.fun = fun;
info.base_name = base_name;
info.variant = variant;
collect_function_statements(fun, info.stmts);
// Add to the appropriate clone group
clone_groups[base_name].push_back(info);
// Check if we've seen enough clones of this function to make a decision
if (clone_groups[base_name].size() >= 2) {
if (dump_file) {
fprintf(dump_file, "Analyzing clones of function: %s\n", base_name.c_str());
}
// Compare the first two variants
const function_info &info1 = clone_groups[base_name][0];
const function_info &info2 = clone_groups[base_name][1];
bool are_same = compare_functions(info1.stmts, info2.stmts);
// Print the pruning decision
print_prune_decision(base_name, are_same);
// Clear the clone group after making the decision
// This ensures we only emit one decision per base function
clone_groups.erase(base_name);
}
}
return 0;
}
Final Thoughts
I would like to end this part of the project stage 02 here. We have implemented the logic for the pass. I won't go into details of compiling and building gcc
again as they have been mentioned in Lab 4 and Project Stage 01.
The next step will be to test out this logic in our x86
and aarch64
servers. Let us go to the next part of this Project Stage 02.
Thank you for coding with me! 😄