What is Static Analysis for C/C++ Code and Why You Should Use It?
The “four stages of competence” is a well-known learning model that depicts stages that a learner passes through while acquiring a skill. If you have never heard of static code analysis, this blog post is for you. By the end of this blog post, if you decide to learn more about static code analysis, you have successfully transitioned from the “I don’t know that I don’t know” state to the “I know that I don’t know” state.
At their heart, compilers are tools that transform human-readable text to machine-readable code. If the compiler doesn’t encounter any error during this transformation from source code an executable is born. 99% of the time, a program is born with logical errors/vulnerabilities/functional defects that the compiler knows nothing about. (What about the remaining 1%? They are miracle births that are celebrated far and wide.)
Not only are compilers faithful in taking programmer’s instructions but also, they help to give hints to improve a program. These hints depend on compiler settings and are known as warning levels. These warnings are a result of static code analysis done by the compiler during program compilation.
The “90/10” rule is a well-known paradigm that applies to many areas of computer science. For example, 90 percent of a program execution time is spent in 10 percent of the code. Extrapolating this rule to warnings thrown by the compiler we can claim that 90 percent of programmers fix only 10 percent of compiler warnings. They are just warnings, I hear you say!
Bad things happen when well-intended warnings given by the compiler are not heeded to. Programmers make subtle mistakes while writing a program that cannot be caught by the compiler. These mistakes manifest themselves as software errors? or vulnerabilities. ?Let us go through some such examples:
Ariane 5 Disaster
What can be worse than losing a rocket 30 seconds into its launch blowing up 370 million dollars in tax payer’s money? All because the compiler tried to cram a 64-bit value into a 16-bit address.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
… | 63 |
A 64-bit value can be a very large number (a number up to 2^64 -1 to be precise). Clearly, this cannot fit into a 16-bit address space but the compiler generated instructions to do just that! You can read more about the disaster here.
Therac-25 radiation overdose
Therac-25 is the story of how a machine intended to save the lives of patients ultimately became their killer – all because of bugs in its software. Race conditions and integer overflows resulted in the machine malfunctioning thereby sending high doses of radiation.
Race conditions happen when multiple threads executing a block of code are not properly synchronized. For example:
Thread 1 | Thread local storage | Variable | Thread 2 | Thread local storage | ||
Read Value | 0 | ← | 0 | → | Read Value | 0 |
Increment Value | 1 | 0 | Increment Value | 1 | ||
Write Value | 1 | → | 1 | ← | Write Value | 1 |
Data races gave an incorrect result because of concurrent execution of threads that are not synchronized. Such bugs are non-deterministic and quite hard to track.
Integer overflows happen when an operation results in values that are outside of the range that can be represented with a given number of digits. Assume that a 16-bit unsigned integer is holding its maximum value 65535 like so:
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
If we try to add 1 to this number an integer overflow happens because 65536 cannot be represented in a 16-bit integer. This overflow causes undefined behavior with signed integers and can cause data to leak to other memory space (with an unsigned integer, however, overflow behavior is defined and predictable and will not cause a problem).
Both race conditions and integer overflows are not normally detected during compilation – although a special class of Integer overflows involving only constants can be detected.
Heartbleed – vulnerability in OpenSSL
This is one of those rare bugs which has a webpage of its own! OpenSSL is a cryptographic library used to secure information. It is opensource and well supported in all modern operating systems. A missing bounds-check in the handling of the TLS heartbeat extension can be exploited to reveal sensitive information. Bound-check omission, though not caught by the compiler can be caught by a static code analysis tool. A critical software vulnerability can even be weaponized. The most famous of such cases is that of Stuxnet where industrial automation software was specifically targeted to deliver a worm that disrupted the programable logic controllers thereby acting as a weapon.
Aside: What is bound check and why is it important?
Arrays are variables that store values of the same type in contiguous memory locations. Assume we have an array like so:
42 | 0 | 100 | 7 | 13 | 5 | 3 | 2 |
The bounds of the array are (0, 7) both inclusive. Some languages like C# or Java check if the array access is within this bound. Languages like C and C++ let the programmer deal with array access and there is no bound check added by the compiler. This leads to subtle bugs that can be caught during static code analysis.
There are different ways in which the software industry tries to deal with such bugs. One is the tacit understanding that not all bugs can be eliminated. The following can and should be done to minimize software errors:
- Code reviews – The practice of overseeing code before it gets into production is a great way of arresting bugs. The effectiveness of this approach depends greatly on the capability of the reviewer.
- Software testing – More than 50% of the time in software projects is spent on testing. Testing can catch bugs and automated test runs to ensure the bugs don’t regress. But testing is costly and also can lead to a false sense of security.
- Static code analysis – Unlike code reviews that need manual reviewers, static code analysis uses tools to check programs. This checking can even be integrated into the nightly builds to generate daily build reports. A disadvantage would be the cost associated with the tool.
What is static code analysis?
During compilation source code is transformed into intermediate representations like Abstract Syntax Tree (AST) and Control Flow Graph (CFG). Compilers use these intermediate representations to run data flow analysis (DFA) algorithms to do code optimizations. During the code optimization stage, it is possible to determine the unused variables and unused code (dead code). The primary goal of a compiler is to transform such intermediate representations to executable code, whereas the primary goal of a static code analysis tool is to use the intermediate representations to find issues in the code.
What can static code analysis do for you?
Static code analysis can
- Detect code that deviates from a coding standard (e.g. MISRA C)
- Detect code that can lead to resource or memory leaks
- Detect code that can lead to null pointer dereferencing
- Detect concurrency issues in code leading to race conditions
- Detect incorrect use of APIs
- Detect conditionals that always evaluate to either true or false
- Detect operator precedence issues
- More…
(Check out the appendix section for more details on some of these issues.)
Most static code analysis tools are well integrated into the development environment. This gives the programmer a chance to run static code analysis on demand. Mostly this opportunity comes after the “last elusive bug” is found or the last customer feature is done – which is never. ?
Hence the ideal way to run static code analysis is to integrate it with the source control management and its nightly build setup. Our tool Softacheck lets you seamlessly analyze C and C++ code hosted on GitHub. Unlike some static code analysis tools that are prohibitively expensive Softacheck is currently free! ?
Appendix
- AST: Abstract Syntax Tree is data structure obtained as a result of lexing and parsing a program. For more information see https://en.wikipedia.org/wiki/Abstract_syntax_tree
- CFG: Control Flow Graph. A graph where basic blocks of a program constitute the nodes and control flow depicts the edges. This data structure is obtained as a result of an optimization pass in a compiler. Most static code analysis require this as a prerequisite. For more information see https://en.wikipedia.org/wiki/Control-flow_graph
- DFA: Data flow analysis. Data flow analysis sets up recurrence equations the solutions of which can decide if some particular optimization (e.g. Liveness analysis, Code hoisting, Copy propagation, and Common sub-expression elimination) can be done. For more information see https://en.wikipedia.org/wiki/Data-flow_analysis
- Incorrect use of API: Every API – be it a webservice request, a third-party library call or even the call to a standard library function – has a contract that needs to be followed. Take for example the standard C function strtok:
char * strtok ( char * str, const char * delimiters );
The contract says: On a first call, the function expects a C string as an argument for str, whose first character is used as the starting location to scan for tokens. In subsequent calls, the function expects a null pointer and uses the position right after the end of the last token as the new starting location for scanning.
Without understanding and following this contract we are guaranteed to have bugs.
- Operator precedence issues: An expression can involve multiple operands, the precedence of which might not be as intended by the programmer. Take for example the following statement:
if (isUser = AuthenticateUser(username, password) == FAIL) {
The expression involves two operators: the equality operator (==) and the assignment operator (=). Equality operator has higher precedence and we have a classic operator precedence bug. Fixing this uses paranthesis forcing the correct operator precedence :
if ((isUser = AuthenticateUser(username, password)) == FAIL) {
Very informative, will definitely try the tool!
Nice post. Loved it. Will try the tool.