News: A Basic Website Crawler, in Python, in 12 Lines of Code.

A Basic Website Crawler, in Python, in 12 Lines of Code.

Your first, very basic web crawler.

Hello again. Today I will show you how to code a web crawler, and only use up 12 lines of code (excluding whitespaces and comments).

Requirements

  • Python
  • A website with lot's of links!

Step 1 Layout the logic.

OK, as far as crawlers (web spiders) go, this one cannot be more basic. Well, it can, if you remove lines 11-12, but then it's about as useful as a broken pencil - there's just no point. (Get it? Hehe...he...Im a sad person... )

So what does a webcrawler do? Well, it scours a page for URL's (in our case) and puts them in a neat list. But it does not stop there. Nooooo sir. It then iterates through each found url, goes into it, and retrieves the URL's in that page. And so on (if you code it further).

What we are coding is a very scaled down version of what makes google its millions. Well it used to be. Now it's 50% searches, 20% advertising, 10% users' profile sales and 20% data theft. But hey, who's counting.

This has a LOT of potential, and should you wish to expand on it, I'd love to see what you come up with.

So let's plan the program.

The logic here is fairly straightforward:

  • user enters the beginning url
  • crawler goes in, and goes through the source code, gethering all URL's inside
  • crawler then visits each url in another for loop, gathering child url's from the initial parent urls.
  • profit???

Step 2 The Code:

#! C:\python27

import re, urllib

textfile = file('depth_1.txt','wt')
print "Enter the URL you wish to crawl.."
print 'Usage  - "http://phocks.org/stumble/creepy/" <-- With the double quotes'
myurl = input("@> ")
for i in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(myurl).read(), re.I):
        print i 
        for ee in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(i).read(), re.I):
                print ee
                textfile.write(ee+'\n')
textfile.close()

That's it... No really.. That. Is. It.

So we create a file called depth_1. We prompt the user for entry of a url

Which should be entered in the following format -"http://www.google.com/"

With the quotation.

Then we loop through the page we passed, parse the source and return urls, get the child urls, write them to the file. Print the url's on the screen and close the file.

Done!

Finishing Statement

So, I hope this aids you in some way, and again, if you improve on it - please share it with us!

Regards

Mr.F

16 Comments

Though, the format seems to have been lost xD. If you want. I'll paste it somewhere, or if you like it, you can just toss it in your source :).

Can you pastebin that? I tried to clean the ident's up, but i get this:

if sys.argv[1] == '-h':
IndexError: list index out of range

You may have already figured this out, but that error means you're missing the second argument. I'm not seeing that line of code anywhere in the text though, so perhaps this has been changed?

Hm, yea I sort of did... Yet that does not help me. I dun goofed.

*wonders if this is a windows thing*

And yes I know you told me to sysarg that, but for the life of me I could not figure out what it does. Afaik this sys.arg thing just seems to hold things like app path and so forth... Am I missing something?

No, it basically assigns arguments after the application to variables for use within the program. For example: crawler.py http://www.google.com. That would spider google. With sys.argv they are labeled with 0 being the program itself. So arguments following it would be sys.argv[1], sys.argv[2] etc.

I've made a modification to your source :3. I added argument functionality to it, to make it shorter. However, I took out the usage thing...it could be re-added :D. Awesome stuff man, thanks for the tutorial :D.

import re, urllib

textfile = file('depth_1.txt','wt')
for i in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(sys.argv[1]).read(), re.I):
print i
for ee in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(i).read(), re.I):
print ee
textfile.write(ee+'\n')
textfile.close()

When I enter "http://www.google.com", I get:

IOError: Errno 22 The filename, directory name, or volume label syntax is incorrect: '\\search?'

For many other sites, I get:

IOError: Errno 2 The system cannot find the path specified: '\\services\\Services.css'

I guess they block the access.
Is there any way to circumvene this?

Have you sorted out this issue ? I too have this problem .

can someone explain this error?

Traceback (most recent call last):
File "webcrawl.py", line 9, in <module>
for ee in re.findall('''href="'(.^"'+)"'''', urllib.urlopen(i).read(), re.I):

File "/Applications/Canopy.app/appdata/updates/ready/canopy-1.2.0.1610.macosx-x8664/Canopy.app/Contents/lib/python2.7/urllib.py", line 86, in urlopen

return opener.open(url)

File "/Applications/Canopy.app/appdata/updates/ready/canopy-1.2.0.1610.macosx-x8664/Canopy.app/Contents/lib/python2.7/urllib.py", line 207, in open

return getattr(self, name)(url)

File "/Applications/Canopy.app/appdata/updates/ready/canopy-1.2.0.1610.macosx-x8664/Canopy.app/Contents/lib/python2.7/urllib.py", line 462, in openfile

return self.openlocalfile(url)

File "/Applications/Canopy.app/appdata/updates/ready/canopy-1.2.0.1610.macosx-x8664/Canopy.app/Contents/lib/python2.7/urllib.py", line 476, in openlocalfile

raise IOError(e.errno, e.strerror, e.filename)
IOError: Errno 2 No such file or directory: 'live.jpg'

12 lines is huge, what about two lines like this

SOSingleTagMatch(WCMethodPage('http://www.innosia.com/transform', 'GET', '', ''), 'name="_RequestVerificationToken"', 'value="', '"'));

It is a syntax from ScrapperMin App in android, it let you do web crawling, parsing, login, download, upload, and compile your script into APK

just a different query i want to know how can you check on-line games like come2play site games that which game is written in which language

Enter the URL you wish to crawl..
Usage - "http://phocks.org/stumble/creepy/" <-- With the double quotes
@> "https://www.facebook.com"
Traceback (most recent call last):
File "/home/miet/mycrawlertest1.py", line 8, in <module>
for i in re.findall('''href="'(.^"'+)"'''', urllib.urlopen(myurl).read(), re.I):
AttributeError: 'module' object has no attribute 'urlopen'

Very nice post. One suggestion, instead of using regex if we use BeautifulSoup it will make the code more concise.

Share Your Thoughts

  • Hot
  • Latest