How to Use Traffic Analysis to Defeat TOR
As was mentioned by the great OTW last week, TOR, aka The Onion Router, has had its integrity attacked by the NSA. In an attempt to reduce the anonymity granted by the service, the NSA has opened a great many nodes of their own. The purpose is presumably to trace the origin of a communication by compromising some entrance and exit nodes. Once both are compromised, it is much easier to correlate traffic with a particular individual.
This is hardly the first attack against TOR, but the question many may wonder is how this attack actually works, and better yet, what methods exist that do not require a billion dollar budget and a horde of computer specialists. As I have a fondness for probability and computers, I thought I would try to help explain how a correlation attack works, how long it takes for a positive identification, what other types of attacks exist, and finally, how to defend yourself.
You can hardly attack a system you don't understand. So understanding how TOR works is paramount to understanding how it is attacked.
TOR is an anonymizing service that uses multiple hops to obfuscate both origination and endpoint of a particular internet communication. In essence, its purpose is to ensure that no one can look at any packet of a particular transmission and read both the sender and its destination.
It accomplishes this task by scrubbing a packet of identifying information, and then creating multiple layers of encrypted routing instructions. The purpose is to ensure that at all hops, only the outermost layer of routing information is view-able. This next poorly drawn image I made in MSPaint should help visualization.
All TOR routes are made up of three hops which include two relays and an exit node. Traffic can only leave the system through an exit node. All traffic leaving the exit node is unencrypted.
Note that each packet is encrypted in its entirety for each layer of encryption. So even if someone could analyze the packets from each relay, short of breaking the encryption algorithm (RSA in this case), you cannot simply follow a packet by how it looks. If the relays are not busy, an attacker (Oscar) could attempt to use packet counting in order to find the identity of the person Bob is communicating with, but in general, most nodes are too busy to be attacked this way easily.
So what is this traffic conformation attack we are worried about?
Suppose we were in a large crowded room with a lot of talking people. Further, suppose that all people were standing in a big circle and looked only at the center of the room while speaking. How could you determine whom was speaking to whom among this rude bunch?
Most people may try to use names, but if names are never mentioned (or alternately, everyone used pseudonyms), then you would have to rely upon other methods. The most common method being the statistical method.
If we were going to identify who was talking to who, we might want to consider looking at when they are communicating and who else is communicating at roughly the same time (or slightly after).
For example, we generally assume that when someone stops speaking, they will stop and wait to listen to what the other person wants to say before they continue. If we pay attention to the right two people long enough, we will eventually come to the conclusion that they are speaking to each other by simply observing their patterns of starting and stopping communication. We may not understand what they are saying (they may speak a different language), but we still can know, statistically, that they are speaking to each other.
This is an example of traffic confirmation. It is most commonly used when someone is already of the suspicion that two people (or a person and a server) are communicating.
With respect to the TOR network, the most dreaded fear users have is the idea that both the entrance relay (labeled 1 above) and the exit node are monitored. The only thing worse than this would be if every relay was corrupted along the path. However, as tracking packets would be trivial in that circumstance, I shall ignore it here. It is well known that if both the entrance and exit nodes are compromised, then is becomes easy to use traffic conformation to link two people. This is done with statistical methods like Bayesian probability.
The above is the famous Bayes' Theorem. There are many uses for it. My favorite being iterative evidence gathering. It is rare in real life to gather that one perfect clue that tells you exactly what happened somewhere.
For example, a police officer is not always going to be presented with a sweater with the name Arnold stitched into it and the victim's blood covered all over it. Sure, sometimes it happens, but more likely he will gather some evidence that suggests a small pool of possible offenders, and then more evidence shortening that pool, and eventually some more evidence that closes in on the murderer.
Bayes' Theorem works exactly like this as well. You start with some initial probabilities, and then use the equation to find the posterior probability (the probability of something happening given the evidence). Then you update the probabilities and run the equation again should you find more evidence. Given enough iterations, it will eventually tell you what the probability is that something is occurring. This kind of math can be exploited for traffic conformation.
First, we must be sure we define our problem correctly as Bayes' Theorem will not work unless we do. Toward this end, we shall define A to be "Bob is communicating with Alice" and B will be "Bob sent packets right before Oscar Intercepted them at Node 3". Note a few complications though we must keep in mind.
The first thing is that there are more relays in the system than the ones Oscar is controlling. While Oscar controls enough to be sure of where Bob is sending, and to be sure from where Alice is receiving, neither piece gives him enough information to tie Bob and Alice together.
For example, Bob could be communicating with another exit node through the middle node that Oscar is not surveilling. Likewise, Alice could be receiving from another relay that is communicating through the middle relay. In a busy network (and TOR is very busy), no one packet can give all the information you need. What we do have though is a very good idea of how to estimate our prior probabilities.
I am not going to go into detail how I came up with these probabilities except to say that I based them off of naive notions so as to create the most unbiased and weak model I could make. Then I over simplified a lot. The simplifications I made are:
- I assumed all relays are chosen randomly (an okay assumption).
- I assumed that all packets sent through relay 3 were captured and correlated properly. i.e. there is no error.
- I assumed we could determine with some ease the amount of delay so that there was no error in correlating Bob's packets with Alice's.
- As can be determined from my previous two assumptions, I assumed that latency is more or less constant or predictable.
These simplifications make the resultant model both easier to understand, and excessively powerful. In real life, there are more variables than I tried to show, but this is good enough to demonstrate. I sat down and wrote a Python program that creates a method called BayesCorr() that can estimate the exact probability given the parameters relay number, exit node number, and packet count. The image below shows results for 5, 10, 15 and 20 packets with 3200 relays and 900 exit nodes.
And this is my source code:
Note: I have never used that website before and am not sure how long my document will persist.
As you can see, the more packets that we capture and correlate, the more sure we are that we know who is communicating. How sure we need to be will depend on our circumstances of course.
Ultimately the biggest problem with my model is its insistence that the world is perfect and predictable in its entirety. This is obviously wrong.
In real life, latency varies all the time as a function of network traffic. This network traffic is itself a function of time of day, location of the relay, popularity of said relay, the load a relay's ISP is suffering under, and the alignment of the stars.
Okay, I made that last one up.
Furthermore, we cannot honestly believe that whatever criteria we use to filter packets from the entrance and exit nodes will be perfect and capture every communique. Even if we could guarantee that, with latency varying all the time, it is hard to be certain which packet coming from node 2 originated in node 1.
All of this means a lot of things, but mostly it means that when I estimated P(B|A) = 1, I didn't just make a mistake, but I committed a math crime punishable by death. In reality, this probability will be lower. Most importantly, this probability is so difficult to estimate, I would actually suggest trying to measure this parameter. Basically, whatever software you are using to capture packets (look at OTW's tutorials on Snort for an example), and whatever method you are using to estimate latency in order to assist in matching will have a certain error rate. I would suggest running your programs under as real a set of conditions you can muster, and discover how many errors it makes.
In general, this is a better idea than naive estimation. I still contend that my example has its uses, if for nothing else but illustration.
The biggest problem with this attack is that unless we happen to have a large multi-billion dollar budget, it is hard to get enough servers and bandwidth to get a lot of nodes on the TOR network. This means that our chances of actually getting both the entrance and exit nodes are slim at best. So while a government could afford to do this, we, as normal everyday people cannot.
The first bright side is that the correlation attack above can work even without having both nodes compromised. However, the amount of traffic that passes through ordinary routers or MITM machines is too high for much accuracy. Indeed, what makes the node attack so effective is the fact that basically all traffic of interest is sent to you, and generally only traffic of interest.
A better attack for the average hacker is the latency attack. Suppose a hacker has a small botnet. This Botnet is not necessarily large enough to bring someone down, but it could slow their systems in order to increase latency.
Once the hacker Oscar begins introducing latency, he simply needs to observe the latency in different nodes by trying to connect through them. Alternatively, the botnet could attack the nodes while observing Alice for the same effect.
Once the whole path is traced, Oscar could then read incoming and outgoing packets from node one, and then perform the same latency attack while observing everyone on node one. If they are slowed down, they are clearly communicating with Alice. Otherwise, Oscar keeps going until he finds the identity of the communicator (Bob).
To speed this attack up, Oscar can attack half the relays at a time in order to determine which nodes are being used. Then he simply splits each group in half while eliminating the ineffective portion. Eventually he will find the path. From there he can once again find Bob using the same strategy.
Lastly, Oscar could try to take advantage of non-anonymized traffic. He can either look for this traffic in the wild, or attempt to induce it. In the latter case, Oscar could try to inject an exploit into the unencrypted return stream from Alice to Bob. Assuming Bob is vulnerable, Oscar could force Bob to broadcast his IP address through the TOR route. Oscar can now pick up the information in the clear.
Note that this method is unlikely to work unless Oscar already has an idea who is communicating with Alice (and has done recon on Bob), or Oscar's exploit is widespread and unpatched.
Sadly, there is little you can do to protect yourself from a determined hacker or a sophisticated government. These attacks hurt not just TOR, but any low latency system out there like Garlic routing. There are some guidelines though to make tracing you harder.
- Use a proxy in addition to TOR.
- Limit your time communicating through TOR. Remember, it took 20 packets to get to 95% confidence, and that assumed perfect conditions.
- Use busy nodes if possible.
- Look for odd behavior. I made it seem like a latency attack is like quick lightning, but ultimately it is about as fast per step as regular correlation attacks. If your internet is speeding up and slowing down regularly, try switching nodes.
- Keep your computer up to date! Do not ignore an update just because it is inconvenient.
Luck is always welcome when it comes down to it.
I am done. Thank you for reading this extremely long post. As always, I would appreciate feedback in the comments section.