Project Stage 2: Creating a Custom GCC Pass to Identify Function Clones
Contents
- Introduction
- Understanding Our Custom GCC Pass
- My Approach to Finding Cloned Functions
- Getting Started
- Finding Those Cloned Functions
- Building the Resolver Map
- Challenges I Encountered
- Tracking Down Clone Variants
- Rebuilding and Testing Result
- Building the Resolver Map
- Challenges I Encountered
- Tracking Down Clone Variants
Introduction
Hi there! I hope all is fine with you and that you're prepared to start the second phase of our project! I'll show you how to make a unique GCC pass that detects duplicated functions today. Since I'm expanding on my earlier work, I won't go into how to integrate the pass into the source directory again (if you need that information, see my "Project Stage 1" post). I'll instead concentrate on the new pass, its rationale, and the challenges I faced.
Understanding Our Custom GCC Pass
You will understand our goal if you have read my earlier post on Automatic Function Multi-Versioning, or AFMV. To put it simply, this pass must detect functions that were copied during compilation. There are resolvers and matching variations for every copied function. After identifying these clones, our pass will return "PRUNE" or "NOPRUNE" based on whether they are similar or different. The first step, determining which functions have been cloned, is covered in this article.
My Approach to Finding Cloned Functions
Before writing any code, I sketched out my algorithm:
- Loop through all functions in the program
- For each function, get its complete name
- Extract the base name and suffix:
- If the name contains a dot (.), split it into two parts
- Base name: everything before the first dot
- A suffix: everything after the dot
- First pass - Create a resolver map:
- If the suffix is "resolver", store the mapping from base name to full resolver function name
- Second pass - Find the clone variants:
- Using similar logic as the first pass, but this time identifying and logging variants and their resolvers
- For each function, get its complete name
- If the name contains a dot (.), split it into two parts
- Base name: everything before the first dot
- A suffix: everything after the dot
- If the suffix is "resolver", store the mapping from base name to full resolver function name
- Using similar logic as the first pass, but this time identifying and logging variants and their resolvers
There's probably a more elegant way to do this, but this approach made sense to me. I might simplify it in a future iteration.
Getting Started
Before diving into the code, I took a few preparatory steps:
- Created a new GitHub repository for this pass
- I didn't want to overwrite my Stage 1 work, so I started fresh to keep things organized
- Pushed my Stage 1 code to the new repo as a starting point
- Started developing the function identification code
I decided to build on my existing pass to save time. I also checked the passes.def file and confirmed my pass was already positioned late enough in the compilation pipeline. If things don't work, I can always move it further down and rebuild.
Finding Those Cloned Functions
I worked on my solution in phases. Allow me to provide the code and talk about the problems I ran across.
Part 1 - Building Resolver Map
/* This pass was creatd by Kavya Shah with the help of
Professor Chris Tyler's SPO600 wiki and lecture.
This pass accomplishes the following:
*/
// These headers were taken from Professor Chris Tyler's Week 7 Lecture
// These headers were taken from Professor Chris Tyler's Week 7 Lecture
#include "config.h"
#include "system.h"
#include "coretypes.h"
#include "backend.h"
#include "tree.h"
#include "gimple.h"
#include "tree-pass.h"
#include "ssa.h"
#include "tree-pretty-print.h"
#include "gimple-iterator.h"
#include "gimple-walk.h"
#include "internal-fn.h"
#include "gimple-pretty-print.h"
#include "cgraph.h"
// Added headers
#include "gimple-ssa.h"
#include "attribs.h"
#include "pretty-print.h"
#include "tree-inline.h"
#include "intl.h"
#include "function.h"
#include "basic-block.h"
#include <string>
#include <map>
#include <cstdio>
// Namespace <--- This section I learned from SPO600 Week 7 - Class 1 Lecture from Professor Chris Tyler
namespace{
const pass_data pass_data_kbshah6 = {
GIMPLE_PASS, /* type */
"kbshah6", /* name of my pass [We will use this inside passes.def as pass_kbshah6_pass]*/
OPTGROUP_NONE, /* optinfo_ flags */
TV_NONE, /* tv_id */
PROP_cfg, /* specify that we need properties */
0, /* the properties provided */
0, /* the properties destroyed */
0, /* todo_flags_start */
0, /* todo_flags_finish */
};
/*
Please refer to the instructions below from Professor CHris Tyler that helped me build the class:
Pass Code
Each pass provides a C++ source file which provides:
- A pass_data structure which defines the pass details (defined in tree-pass.h);
- A class which defines a gimple_opt_pass and includes the public methods gate and execute.
- The gate method controls whether the pass is active. It can consider multiple factors, including command-line arguments, source language, and target. This method returns a boolean.
- The execute method executes the pass. It accepts a pointer to a function object and will be called once for each function in the unit being compiled (which is not always what you want!).
- A method in the anon namespace named make_pass_name which returns a pointer to the gimple_opt_pass described above.
*/
// This is where you identify the class
class pass_kbshah6 : public gimple_opt_pass {
public:
// Constructor
pass_kbshah6(gcc::context *ctxt) : gimple_opt_pass(pass_data_kbshah6, ctxt) {
}
// The gate function
bool gate(function *) final override {
// return
return true;
}
// The execute function: this is where the magic happens
unsigned int execute (function * /*func*/) override {
// Instantiate function name
/*
Inside function.cc, there's a function_name method that returns
the name of a function. Check out line 6454:
https://github.com/gcc-mirror/gcc/blob/master/gcc/function.cc
*/
//const char* function_Name = function_name(func);
// This is where we will get started with identifying the functions that have been cloned
if (dump_file) {
// Lets create a map that holds the functions
std::map<std::string, std::string> resolverMap;
// Use cgraph node
cgraph_node *node;
// Lets use FOR_EACH_FUNCTION
FOR_EACH_FUNCTION(node) {
// Get the function pointer
function *current_function_pointer = node->get_fun();
// Validate
if (!current_function_pointer)
continue;
// Get the complete funciton name
std::string functionName(function_name(current_function_pointer));
// Get the dot
size_t dot = functionName.find('.');
// Check if the dot is there
if (dot== std::string::npos){
continue;
}
// Get the suffix first
std::string suffix = functionName.substr(dot + 1);
// Now we check that if the function has a resolver suffix, we simply store its base name
if (suffix == "resolver") {
std::string baseName = functionName.substr(0, dot);
resolverMap[baseName] = functionName;
// Show an output
fprintf(dump_file, "Resolver was found for base function: %s\n", baseName.c_str());
}
}
// Second pass goes here where we use the names inside our map and find all the variants
}
// Return value
return 0;
}
};
}
// This is used inside the tree-pass.h file
gimple_opt_pass* make_pass_kbshah6(gcc::context *ctxt) {
return new pass_kbshah6(ctxt);
}
/* This pass was creatd by Kavya Shah with the help of
Professor Chris Tyler's SPO600 wiki and lecture.
This pass accomplishes the following:
*/
// These headers were taken from Professor Chris Tyler's Week 7 Lecture
// These headers were taken from Professor Chris Tyler's Week 7 Lecture
#include "config.h"
#include "system.h"
#include "coretypes.h"
#include "backend.h"
#include "tree.h"
#include "gimple.h"
#include "tree-pass.h"
#include "ssa.h"
#include "tree-pretty-print.h"
#include "gimple-iterator.h"
#include "gimple-walk.h"
#include "internal-fn.h"
#include "gimple-pretty-print.h"
#include "cgraph.h"
// Added headers
#include "gimple-ssa.h"
#include "attribs.h"
#include "pretty-print.h"
#include "tree-inline.h"
#include "intl.h"
#include "function.h"
#include "basic-block.h"
#include <string>
#include <map>
#include <cstdio>
// Namespace <--- This section I learned from SPO600 Week 7 - Class 1 Lecture from Professor Chris Tyler
namespace{
const pass_data pass_data_kbshah6 = {
GIMPLE_PASS, /* type */
"kbshah6", /* name of my pass [We will use this inside passes.def as pass_kbshah6_pass]*/
OPTGROUP_NONE, /* optinfo_ flags */
TV_NONE, /* tv_id */
PROP_cfg, /* specify that we need properties */
0, /* the properties provided */
0, /* the properties destroyed */
0, /* todo_flags_start */
0, /* todo_flags_finish */
};
/*
Please refer to the instructions below from Professor CHris Tyler that helped me build the class:
Pass Code
Each pass provides a C++ source file which provides:
- A pass_data structure which defines the pass details (defined in tree-pass.h);
- A class which defines a gimple_opt_pass and includes the public methods gate and execute.
- The gate method controls whether the pass is active. It can consider multiple factors, including command-line arguments, source language, and target. This method returns a boolean.
- The execute method executes the pass. It accepts a pointer to a function object and will be called once for each function in the unit being compiled (which is not always what you want!).
- A method in the anon namespace named make_pass_name which returns a pointer to the gimple_opt_pass described above.
*/
// This is where you identify the class
class pass_kbshah6 : public gimple_opt_pass {
public:
// Constructor
pass_kbshah6(gcc::context *ctxt) : gimple_opt_pass(pass_data_kbshah6, ctxt) {
}
// The gate function
bool gate(function *) final override {
// return
return true;
}
// The execute function: this is where the magic happens
unsigned int execute (function * /*func*/) override {
// Instantiate function name
/*
Inside function.cc, there's a function_name method that returns
the name of a function. Check out line 6454:
https://github.com/gcc-mirror/gcc/blob/master/gcc/function.cc
*/
//const char* function_Name = function_name(func);
// This is where we will get started with identifying the functions that have been cloned
if (dump_file) {
// Lets create a map that holds the functions
std::map<std::string, std::string> resolverMap;
// Use cgraph node
cgraph_node *node;
// Lets use FOR_EACH_FUNCTION
FOR_EACH_FUNCTION(node) {
// Get the function pointer
function *current_function_pointer = node->get_fun();
// Validate
if (!current_function_pointer)
continue;
// Get the complete funciton name
std::string functionName(function_name(current_function_pointer));
// Get the dot
size_t dot = functionName.find('.');
// Check if the dot is there
if (dot== std::string::npos){
continue;
}
// Get the suffix first
std::string suffix = functionName.substr(dot + 1);
// Now we check that if the function has a resolver suffix, we simply store its base name
if (suffix == "resolver") {
std::string baseName = functionName.substr(0, dot);
resolverMap[baseName] = functionName;
// Show an output
fprintf(dump_file, "Resolver was found for base function: %s\n", baseName.c_str());
}
}
// Second pass goes here where we use the names inside our map and find all the variants
}
// Return value
return 0;
}
};
}
// This is used inside the tree-pass.h file
gimple_opt_pass* make_pass_kbshah6(gcc::context *ctxt) {
return new pass_kbshah6(ctxt);
}
Challenges I Faced with My Initial Pass
When implementing my first version of the code, I ran into several issues:
-
Variable Naming Problems
I made some basic mistakes with variable names - a reminder that even simple errors can cause headaches! -
String Handling Complications
Initially, I tried using thechardata type for string manipulation, but it required tedious memory management. Switching to C++std::stringobjects made everything much easier, especially when extracting substrings withsubstr(). This was particularly helpful when creating the base name extraction logic. -
Undeclared Variable Errors
When I ranmakein my gcc-build-001 directory, the compiler complained about undeclared variables likefunc. I had to comment out those problematic sections to get the code to compile successfully.
These issues reminded me that even though the algorithm seems straightforward on paper, implementation details can still trip you up if you're not careful with the basics.
Part 2: Identifying the clone variants
/* This pass was creatd by Kavya Shah with the help of
Professor Chris Tyler's SPO600 wiki and lecture.
This pass accomplishes the following:
*/
// These headers were taken from Professor Chris Tyler's Week 7 Lecture
// These headers were taken from Professor Chris Tyler's Week 7 Lecture
#include "config.h"
#include "system.h"
#include "coretypes.h"
#include "backend.h"
#include "tree.h"
#include "gimple.h"
#include "tree-pass.h"
#include "ssa.h"
#include "tree-pretty-print.h"
#include "gimple-iterator.h"
#include "gimple-walk.h"
#include "internal-fn.h"
#include "gimple-pretty-print.h"
#include "cgraph.h"
// Added headers
#include "gimple-ssa.h"
#include "attribs.h"
#include "pretty-print.h"
#include "tree-inline.h"
#include "intl.h"
#include "function.h"
#include "basic-block.h"
#include <string>
#include <map>
#include <cstdio>
// Namespace <--- This section I learned from SPO600 Week 7 - Class 1 Lecture from Professor Chris Tyler
namespace{
const pass_data pass_data_kbshah6 = {
GIMPLE_PASS, /* type */
"kbshah6", /* name of my pass [We will use this inside passes.def as pass_kbshah6_pass]*/
OPTGROUP_NONE, /* optinfo_ flags */
TV_NONE, /* tv_id */
PROP_cfg, /* specify that we need properties */
0, /* the properties provided */
0, /* the properties destroyed */
0, /* todo_flags_start */
0, /* todo_flags_finish */
};
/*
Please refer to the instructions below from Professor CHris Tyler that helped me build the class:
Pass Code
Each pass provides a C++ source file which provides:
- A pass_data structure which defines the pass details (defined in tree-pass.h);
- A class which defines a gimple_opt_pass and includes the public methods gate and execute.
- The gate method controls whether the pass is active. It can consider multiple factors, including command-line arguments, source language, and target. This method returns a boolean.
- The execute method executes the pass. It accepts a pointer to a function object and will be called once for each function in the unit being compiled (which is not always what you want!).
- A method in the anon namespace named make_pass_name which returns a pointer to the gimple_opt_pass described above.
*/
// This is where you identify the class
class pass_kbshah6 : public gimple_opt_pass {
public:
// Constructor
pass_kbshah6(gcc::context *ctxt) : gimple_opt_pass(pass_data_kbshah6, ctxt) {
}
// The gate function
bool gate(function *) final override {
// return
return true;
}
// The execute function: this is where the magic happens
unsigned int execute (function * /*func*/) override {
// Instantiate function name
/*
Inside function.cc, there's a function_name method that returns
the name of a function. Check out line 6454:
https://github.com/gcc-mirror/gcc/blob/master/gcc/function.cc
*/
//const char* function_Name = function_name(func);
// This is where we will get started with identifying the functions that have been cloned
if (dump_file) {
// Lets create a map that holds the functions
std::map<std::string, std::string> resolverMap;
// Use cgraph node
cgraph_node *node;
// Lets use FOR_EACH_FUNCTION
FOR_EACH_FUNCTION(node) {
// Get the function pointer
function *current_function_pointer = node->get_fun();
// Validate
if (!current_function_pointer)
continue;
// Get the complete funciton name
std::string functionName(function_name(current_function_pointer));
// Get the dot
size_t dot = functionName.find('.');
// Check if the dot is there
if (dot== std::string::npos){
continue;
}
// Get the suffix first
std::string suffix = functionName.substr(dot + 1);
// Now we check that if the function has a resolver suffix, we simply store its base name
if (suffix == "resolver") {
std::string baseName = functionName.substr(0, dot);
resolverMap[baseName] = functionName;
// Show an output
fprintf(dump_file, "Resolver was found for base function: %s\n", baseName.c_str());
}
}
// Second pass goes here where we use the names inside our map and find all the variants
FOR_EACH_FUNCTION(node) {
// Get the function pointer
function *current_function_pointer = node->get_fun();
// Validate
if (!current_function_pointer)
continue;
// Get the complete funciton name
std::string functionName(function_name(current_function_pointer));
// Get the dot
size_t dot = functionName.find('.');
// Check if the dot is there
if (dot== std::string::npos){
continue;
}
// Get the suffix first
std::string suffix = functionName.substr(dot + 1);
// Now we check that if the function has a resolver suffix, we simply store its base name
if (suffix == "resolver") {
continue;
}
// Get base name
std::string baseName = functionName.substr(0, dot);
// Now we check if base has a resolver
if (resolverMap.find(baseName) != resolverMap.end()) {
fprintf(dump_file, "Clone variant successfully found: %s (base function: %s) with resolver: %s\n", functionName.c_str(), baseName.c_str(), resolverMap[baseName].c_str());
}
}
}
// Return value
return 0;
}
};
}
// This is used inside the tree-pass.h file
gimple_opt_pass* make_pass_kbshah6(gcc::context *ctxt) {
return new pass_kbshah6(ctxt);
}
Rebuilding and testing the pcc
After making changes to my custom GCC pass file, I followed these steps to rebuild the compiler and test my implementation:
Rebuilding the Compiler
- Navigated to my build directory:
cd ~/gcc-build-001
- Rebuilt GCC with parallel processing:
time make -j$(nproc) |& tee rebuild_log.log
- Installed the updated compiler:
make install
cd ~/gcc-build-001time make -j$(nproc) |& tee rebuild_log.logmake installTesting with Sample Code:
To verify that my pass was working correctly, I used the sample code provided by Professor Chris Tyler:
- Went back to my home directory:
cd ~
- Unpacked the test files:
tar xvf /public/spo600-test-clone.tgz
- Navigated to the test directory
- Modified the Makefile to enable comprehensive dumps by uncommenting
DUMP_ALL
- Verified I was using my custom-built GCC:
which gcc
- If necessary, I updated my PATH:
PATH="$HOME/gcc-test-001/bin:$PATH"
- Cleaned any previous build artifacts:
make clean
- Built the test code:
make
- Examined my pass's output by viewing the dump file:
cat clone-test-aarch64-prune-clone-test-core.c.265t.kbshah6
cd ~tar xvf /public/spo600-test-clone.tgzDUMP_ALLwhich gcc
- If necessary, I updated my PATH:
PATH="$HOME/gcc-test-001/bin:$PATH"
make cleanmakecat clone-test-aarch64-prune-clone-test-core.c.265t.kbshah6
This process allowed me to confirm whether my pass was correctly identifying cloned functions and their resolvers as intended.
Result
Success! Looking at the results, I can see that my first function is performing exactly as intended - it correctly identified the variant function and matched it with its corresponding resolver. This is such a satisfying moment - the smile on my face says it all!
Getting this to work properly took quite a bit of time and effort, so seeing the successful output makes all that troubleshooting worthwhile. It's always rewarding when code you've been struggling with finally works as expected!

Comments
Post a Comment