Open-source data scraping is an essential reconnaissance tool for government agencies and hackers alike, with big data turning our digital fingerprints into giant neon signs. The problem is no longer whether the right data exists, it's filtering it down to the exact answer you want. TheHarvester is a Python email scraper which does just that by searching open-source data for target email addresses.
In many social engineering or recon scenarios, you'll want to find email addresses for a person or members of an organization. There are many reasons for this depending on your goal, whether as targets for technical attacks or as a way to contact the target by email.
Some uses of email scraping data include provoking a response from a target, presenting a service, sending a phishing email, or generating a list of employees to pretend to be. Sometimes, you will only need to learn that particular organization's email formatting to guess what another email account would for a particular user.
Today, I'll teach you to begin searching for emails like an OSINT researcher with the classic tool theHarvester for macOS (or Mac OS X) and Kali Linux.
Open-source intelligence (OSINT) is the branch of intelligence that relies on searching unclassified data to build a picture of a target. Private companies have advanced this art into a science, with police and intelligence forces routinely buying software tools from private vendors that source data from public APIs to build invasive profiles on targets. These tools are used to skirt laws on data collection against protesters and can return more information that the subject may know or remember about themselves.
Hackers use frameworks like Maltego to build detailed profiles of targets by pulling from APIs to notice patterns. In Maltego, transforms like like SocialLinks can be run against a person to find their close friends and associates without setting foot outside.
My tutorials will cover a number of ways to track, gather, analyze, and act on data left in public databases by a target. Police, intelligence agencies, and scam artists use data as a weapon, and my tutorials on Maltego, the Operative Framework, and other OSINT tools will prepare you to know more about a target than they know about themselves, to support bold social engineering strategies that require detailed information to pull off.
Even with modest results, social engineering attacks can benefit from increasingly specific tidbits of data strung together in a way to make it seem as though you have much more information than you really do about a target. Conversely, a target that yields volumes of information about their activities may support a "we know all"-style tactic where you convince them you already know all the details of their organization.
Either tactic lowers the target's inhibitions when discussing things they possibly shouldn't be sharing since they assume you already have the information. This is why phone scams based off of data scraping tools like these frequently fool victims into giving personal information over the phone to scammers who present personal details about the victim while pretending to be from another business.
Organizations using PGP, such as journalists or anyone sending and receiving encrypted emails, are very easy to find in theHarvester. I was able to download a detailed list of The Guardian's journalists with a single string.
Organizations using encrypted mail like the Electronic Frontier Foundation (EFF) are also prime targets. The following identifies individual members and the formatting for official email addresses.
The Harvester can also pull up associated domains and hostnames of a target. We are able to probe for more information about the domain, subdomain, and organization. In this case, we learned about a hostname IP.
And now that you know a little bit about OSINT, theHarvester, and what it can do, let's get into actually using theHarvester on your system to scrape email addresses.
If you're using Kali, hit up the next step for instructions on installing theHarvester. Otherwise, if you're on macOS or Mac OS X, make sure you have Xcode installed, then run the following in a terminal window:
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" < /dev/null 2> /dev/null
Press enter and wait for the command to finish. When it's done, run:
brew install theharvester
After that, run the following command to confirm that it's working.
For a minimal footprint, theHarvester works great on our Kali Pi. Of course, any Kali system will work, too.
On Kali Linux, run theHarvester in a terminal window to see if it's installed. If not, you'll see:
You can sometimes run apt-get theharvester and Kali will fetch this for you, but in my case, it didn't work. So instead, clone it directly and confirm the installation by running the following in terminal.
git clone https://github.com/laramies/theHarvester.git
sudo python ./theHarvester.py
To initiate a harvester search, you'll need to spell out a couple variables for the script to understand. With each command, theHarvester will run searches on multiple platforms to find email addresses and websites related to the organization or domain you specify. If you have a screen name for your target, this will suffice. In this case, our target is WonderHowTo.
The most simple search you can run looks like this:
theHarvester.py -d wonderhowto.com -b all -l 200
In the script, we're telling it to pull from all data sources and to limit the results to 200 results.
Sometimes this is enough. If it's not, we can turn up the volume at the risk of an API getting upset at us for making too many queries. On a Mac, excessive (or sometimes any) queries to the Bing API can cause the script to crash, requiring you to run queries sequentially rather via the all argument.
In total, the available arguments to refine your searches include:
- -d: Domain to search or company name.
- -b: Data source: baidu, bing, bingapi, dogpile, google, googleCSE, googleplus, google-profiles, linkedin, pgp, twitter, vhost, yahoo, all.
- -s: Start in result number X (default: 0).
- -v: Verify host name via DNS resolution and search for virtual hosts.
- -f: Save the results into an HTML and XML file (both).
- -n: Perform a DNS reverse query on all ranges discovered.
- -c: Perform a DNS brute force for the domain name.
- -t: Perform a DNS TLD expansion discovery.
- -e: Use this DNS server.
- -l: Limit the number of results to work with (bing goes from 50 to 50 results, google 100 to 100, and pgp doesn't use this option).
- -h: Use SHODAN database to query discovered hosts.
In some cases, our more invasive all query will have turned up nothing new, so we can take to another tactic to pry more information out of the internet. Using the -s argument to ignore false-hits within the first few results by specifying how far back in the results can help, as can running a deep scan of 1,000–5,000 results on each engine individually, can yield additional data on a target.
If you hit on valuable results, you can save them to an HTML file using the -f option followed by the name to save the file as. A note about accuracy: theHarvester is a database scraper, it doesn't pull these from the domains directly, and thus it's passive. The result, however, is that it can not validate the results, so you may get some fake results mixed in. Sometimes these can be easily spotted, sometimes it takes some scrutiny.
As a researcher, you need to think critically and understand the tools you are using so you can sort fact from fiction. Can you spot emails that might not be legitimate in the following pull?
Now that you have an email or two, you can begin to build profiles on these targets by plugging the data into other database search tools like Maltego, or even Facebook and LinkedIn networks. Social media accounts, work presentations, domain names, and screen names can all be correlated with some data from theHarvester. While theHarverster may not always return results, it's a valuable data-scraping tool in particular for email addresses hosted on private domains or in cases where a business name or screen name is known.
Learning to think like an open-source intelligence researcher will help you take each piece of data to find new ones and weave pieces of information together to build a profile of a target. Each use of theHarvester should be viewed as an investigation with the intention of answering a question. Ensuring you are asking the right question, in the right way, is an important part of getting the right results from your query. If you're not sure what you're looking for, you may often find nothing.