Introduction to Web Scarping by Mohamed Ahmed

Oct 21, 2017 07:55 PM
636441826037667343.jpg

first let us know how COMPONENTS OF A WEB PAGE ( for yassin ehab ) >>>

When visiting a web page, our browser is the client (I recommend you read the architectureclient / server to understand a little more about this) make a request to a web server. This request is by the GET method of the HTTP protocol , since we are receiving files from the server. The server then sends these files telling our web browser how to process the page for us. The files are usually divided as follows:

HTML: Contains the main content of the web page.

JavaScript: Adds interactivity to the webpage.

CSS: Add the style to the web page, it will look nicer.

Images: Image formats such as .PNG and .JPG allow web pages to display images.

Our web browser receives all the files and shows us the page. Many things really happen but you can strengthen your knowledge by reading the architecture I mentioned earlier. At the time of making a spider or doing web scraping, what interests us most is the main content of the web page, ie HTML.

UNDERSTANDING HTML

The HTML markup language (HTML) is a language in which web pages are created. HTML is not a programming / scripting language, such as Python; instead, it is a tag language that tells a browser how to distribute the content to be displayed.

Let's get to know a little more HTML so we know how to do scraping effectively. HTML consists of elements called tags. The most basic label is the . This tag allows us to tell the web browser that everything inside is HTML.

Let's put an example, we create an html document in our notepad or favorite text editor (Notepad ++, Sublime, Atom, among others):

Code html5

< html >

< / html >

We have not yet added any content to our website, so if we see it in your browser u would see everything in white.

Within the HTML tags we must add the other two essential tags that make up the HTML, the tag and the tag . The body tag includes the main content of the web page, the head tag contains data such as the title of the page, the encoding of characters to use, tags and good, other information that is not useful for scraping.

Code: HTML5

< html >

< head >

< / head >

< body >

< / body >

< / html >

If u notice, we have just put the tags, not any content, so we will never see anything again in the browser.

Now, we'll add our first content to the page, labeled

. The p tag defines a paragraph and any text within this tag will be displayed as a separate paragraph.

Code: HTML5

< html >

< head >

< / head >

< body >

< p >

First paragraph of text!

< / p >

< p >

Second paragraph of text!

< / p >

< / body >

< / html >

save it and view it on your browser :)

We can also add hyperlinks with the tag :

Code: HTML5

< html >

< head >

< / head >

< body >

< p >

First paragraph of text!

< A href = "
https://www.youtube.com/watch?v=VloUbNHTyew" > quran ! < / a >

< / p >

< p >

Second paragraph of text!

< A href = "https://thehackersnews.com" > Learn Hacking and CyberSecurity With the platform! < / a >

< / p >

< / body >

< / html >

and see what we done on your browser after sving the code ...

The label a indicates that they are links and tells the browser that it should make a link to another web page. Which web page? for that is the href tag , to indicate the link.

Understanding this, we will see the requirements / installation to perform spiders and / or web scraping

that is enough let's focus on the scarping ... i will give you an sample .........

REQUIREMENTS

We must download the following libraries (you can go to the link directly by clicking on each element):

BeautifulSouphttps://pypi.python.org/pypi/beautifulsoup4

requestshttps://pypi.python.org/pypi/requests

Python 3.xxhttps://www.python.org/ftp/python/3.6.3/

INSTALLATION

Assuming that we already have python in version3.x,we download and proceed to install our beautifulsoup (bs4) and our requests.

Once the libraries are installed, I will show you the example of how to use them and their function for which they are mainly designed.

REQUESTS ...

With our library of requests we can see the source content of any page and make requests by methods such as GET and POST. The requests.get method also returns a status code.

Code: Python

import requests

page_get = requests. get ( "https://example.com/index.html" )

The status code tells us if the page responded correctly, in other cases it can give other type of status codes, such as 500, which corresponds to an Internal Server Error.

We can print the content of the page using webpage.content:

Code: Python

import requests

page_get = requests. get ( "http://example.com/index.html" )

page. content

Where it will return all HTML content:

\ n \ n \ n

Here is some simple content for this page. \ n \ n '

Let's go with BS4

BS4 allows us to extract data in particular from the web page. We can use it for .html or .xml files. We can parse the document and extract all the existing paragraph labels. For this, we can use the following code:

Code: Python

from bs4 import BeautifulSoup

page = requests. get ("http://example.com")

soup = BeautifulSoup (page content, 'html.parser')

soup find_all ('p')

BeautifulSoup allows us to extract tags like href, paragraphs, even CSS selectors. If you want to see many of the examples that currently exist of this great bookstore, you can go to this link.https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Concluding, there are many more libraries a little more powerful than bs4, however, this well used also has great potential to perform scraping and / or information gathering.

Then the spider that I realized and that can understand through the different comments that I left in the code

Code: Python

# We import bookstores. #

import requests

import sys

from bs4 import BeautifulSoup

# We create our lists. #

urls =

urls2 =

# We receive the arguments. #

target_url = sys. argv 1

# We make a connection to the last argument and also read all the content of the source code that exists on the page. #

url = requests. get (target_url). content

# We use our bs4 library to later recrate what we want. #

soup = BeautifulSoup (url)

# Using the for and bs4 method called find_all, we collect all tags where there is a href. #

for line in soup. find_all ('a'):

new_line = line. get ('href')

try:

# If it exists in some line of the source code http, we store it in our list called urls. #

if new_line : 4 == "http":

if targeturl in newline:

urls. append (str (new_line))

# If it does not exist, we try to combine our argument (page url) + what we find. #

elif new_line : 1 == "/":

try:

combline = targeturl + new_line

urls. append (str (comb_line))

except:

pass

# We go through what we previously saved in our list (urls).

for get_this in urls:

# Since urls are all links, we make a connection to the url and read the source code. #

url = requests. get (get_this). content

# We use our bs4 library to later recrate what we want. #

soup = BeautifulSoup (url)

for line in soup. find_all ('a'):

new_line = line. get ('href')

try:

if new_line : 4 == "http":

if targeturl in newline:

urls2. append (str (newline))

elif new_line : 1 == "/":

combline = targeturl + new_line

urls2. append (str (comb_line))

except:

urls_3 = set (urls2)

# We go through our list called urls_3 and print all the links of our spider.

for value in urls_3:

print (value)

except:

pass

I hope it serves them, to run it, just put in our cmd or our terminal: python spider.py http://example.com

In a next post I will bring you how to recapture information when they are in tables or, well, how we can combine it with DataFrame libraries to make our data collection much more powerful.

Comments

No Comments Exist

Be the first, drop a comment!