Forum Thread: Web Scrapping Workshop 1, Consumption of Apis, Etc. By [Mohamed Ahmed]

well let's stared with the first post (which can send it to python? Or capable workshops nullbyte, where applicable)

Let's use python 2.7

To make http requests we'll use the library requests . It is easy to use and has good documentation (read it when you want to do something I did not explain):

http://docs.python-requests.org/en/master/user/quickstart/

When you see code run it, do not stay with what I tell them it does. It can also be useful and modified to see what happens.

Everyone knows it's http, right?

  • It is the protocol that is used in the webs. https is encrypted http ( secure http )and for our use it will be exactly the same because that is handled by the library.

Each time you open a web, the browser connects to the server and makes a request. Every time you send a form, the same thing. Also when loading css, images, javascript, etc. (yes, you can reuse connections but it is outside of this that I will explain).

Most of these requests are of the GET type. In the get requests, all the parameters are passed in the url. Sometimes, when you send a form, POST is used instead of GET. In POST, you can send data that does not go through the url, go on the other hand, and therefore can be larger.

There are more methods that are not usually used in webs.

Let's get a get to the index of any forum:
========================================================================================
Code: Python
import requests #imports the liberary requests

result = requests. get ( "https://anyforum.org/forum/index.php" ) #in the requests module, look for the get function

print result. text # result is an object with several methods and properties, for now we use .text
========================================================================================

How to do that?
is in the link that I passed the library
If you give an error of the library requests, you have to install it with pip install requests

That will print the html of the index of the forum. This web scrapping is about taking data, so let's try to get the amount of users online. We want to find out where that data is and what it has around, so that we can identify the program:

=========================================================================================
Code: HTML5
< p class = "inline stats" >
1738 Guests, 209 Members (16 Spiders, 2 Hidden)
< / p >
=========================================================================================

I want to get that 209, which is going to change, and show it with the program.

There are several ways to get that out. We are not going to try to parse the html, but we are going to do it in a slightly more crude way, with regular expressions. They saw when in a search engine you can put * and? to make them wild? the regular expressions are something like this, but much more powerful.

In python you have to use the re:
========================================================================================
Code: Python
import re
========================================================================================

Documentation:
https://pymotw.com/2/re/index.html
https://docs.python.org/2/library/re.html

Quote from: python
re.search ("regexp", "string"

Where regexp is the regular expression and string is the text in which it will look (in our case, result.text)
For example:
=========================================================================================
Code: Python
import re
print re . search ( "planet" , "hello world" )
print re . search ( "world" , "hello world" )
=========================================================================================

The first print will return None, because it did not find "planet" in "hello world" and the second one will return an object of type SRE_Match. This allows us to do an if:

Code: Python
import requests
import re

result = requests. get ( "https://anyforum.org/forum/index.php" )

if re . search ( "anyforum" , result. text ) :
print "anyforum was found in the index html"
else :
print "no anyforum found in the index html"

if re . search ( "elephant" , result. text ) :
print "elephant found in index html"
else :
print "no elephant found in index html"
=========================================================================================

So far it is the same as the search functions in strings, but we can use many characters that in a regular expression have a special meaning.

For example, a . is going to be a joker that matches any character:
=========================================================================================
Code: Python
import re
regexp = "anyforum"
text = "anyforum"
if re search (regex, text): print text + "matches with" + regexp
else: print text + "DO NOT MATCH WITH" + regexp
=========================================================================================
That 's gonna come true.
If I want the . be interpreted as a . literal, you have to escape it like this: \.
Replacing in the example above the variables for this, it will not match:
=========================================================================================
Code: Python
regexp = "any \. forum"
text = "anyforum"
=========================================================================================
But so:
========================================================================================
code
regexp = "any \. forum"
text = "any.forum"
=========================================================================================

With a * is going to fight the previous character between 0 and infinite times. Examples:
====================================================================================

  • Code: Python

regexp = "any * forum"

text = "anyforum"

  • Code: Python

regexp = "any * forum"
text = "anyforum"

  • Code: Python

regexp = "any * forum"
text = "anyforume"

All three matchean. We can also combine that with a . , which would check almost anything:

=======================================================================================
Code: Python
regexp = ". *"
text = "any forum"
========================================================================================

This is not reaching us, because we want to extract data. This is done by wrapping in parenthesis the part that is to be matched with the data that we want. To see what the parentesis matched, we use the .groups () method of the object that returns us .search ():

Code: Python
import re

result = re . search ( "un (. *)" , "anyforum" )
print result. groups ( )

The result:
Code: Select
('derc0',)

=======================================================
That is a tuple, which in this case has only one result. There may also be two:
Code: Python
import re

result = re . search ( "u (. ) r (. ) and" , "anyforum" )
print result. groups ( )
========================================================

Result:
Code: Select
('nde', 'c0d')

Let's go with another case:

==============================================================
Code: Python
import re

result = re . search ( "und (. *)" , "anyforum anyforum" )
print result. groups ( )

What will he give back? "morfu" or "morfu yna"? It turns out that the * is going to fight everything that can, so it returns something that probably is not what we are looking for:

Code: Select
('rofum anyforu',)

So that it matchee the possible minimum you have to add a ? :
Code: Python
import re

result = re . search ( "und (. *?)" , "anyforum anyforum" )
print result. groups ( )

This is very useful, for example, if we have an html like this:
Code: HTML5
< div id = "sidebar" > < div id = "whatdowe want" > THE ACA DATA < / div > < / div >

The regular expression we want is this:
Code: Select
<div id="whatdowe want">(.*?)</div>

Without ? it will also bring the </ div>

To access each particular data we can use .group (ID + 1), where ID is the index of the tuple. You have to add 1 because .group (0) returns everything that matcheo, not just what goes in parenthesis.

Let's get back to the thing to get out of here:
Code: HTML5

< p class = "inline stats" >
1738 Guests, 209 Members (16 Spiders, 2 Hidden)
< / p >

The regular expression can be this:
Code: Select
Visitors, (. *?) Users

And the code would look like this:
Code: Python
import requests
import re

web = requests. get ( "https://anyforum.org/forum/index.php" )
regexp = 'Visitors, (. *?) Users'
result = re . search ( regexp , web. text )
print result. group ( 1 )

That prints us the number of connected users

Now I want to get the list of the last published messages, the one that comes down. The html is like this:

========================================================================
Code: HTML5
<dl id = "ic_recentposts" class = "middletext">

Re: Occupy RAM inutiles process </ p> <p class = "on"> </ div> <div id = "src =" / </ a> </ h1> </ h1> </ h1> </ h1> </ h1> </ h1> </ h2> </ h1> </ h1> </ h1> / general-questions-121 / "> Doubts and general requests </ a>) </ dt>

<dd> <strong> Today </ strong> at 07:14:59 pm </ dd>

<dt> <strong> <a href = "/ general-questions-121 / kali-payload-and-no-ip-attack-out-lan/msg104687/?topicseen#msg104687 " Re: kali payload and non-lan ip attack outside </ a> </ strong> by <a href = "https://anyforum.org/forum/profile/blackdrake/"> blackdrake </ a> Doubts and general requests </ a>) </ p> <p> </ p> <p>

<dd> <strong> Today </ strong> at 07:11:33 pm </ dd>

<Dt> <strong> <a href = "doubts-121/android-doubt-base-of-date/msg104686/?topicseen#msg104686 " > re: Android - Duda Database for bartenders in a bar </ a> </ strong> by <a href = "https://anyforum.org/forum/profile/seth/"> seth </ a> (<a href Doubts and general requests </ a>) </ dt>

<dd> <strong> Today </ strong> at 07:10:00 pm </ dd>

<Dt> <strong> <a href = "general-questions-121 / doubts-without-clarifying/msg104685/?topicseen#msg104685 " > Re: Start at Phishing branches and programming </ a> </ strong> by <a href = "https://anyforum.org/forum/profile/mohamedx/"> blackdrake </ a> (<a href = "https: / / underc0de .org / forum / general-questions-121 / "> Doubts and general requests </ a>) </ dt>

<dd> <strong> Today </ strong> at 07:08:29 pm </ dd>

<dt> <strong> <a href = "https://anyforum.org/forum/general- 121 / "> Doubts and general requests/stub-clean-server-detected-by-that/msg104684/?topicseen#msg104684 " "> Re: stub clean server detected, why? Blackdrake </ a> (<a href = "https: // underc0de.Doubts and general </ a> </ h1> <a href =" https://anyforum.org/forum/profile/mohamedx/ " requests </ a>) </ dt>

<dd> <strong> Today </ strong> at 07:06:05 pm </ dd>

Re: Symlink </ a> </ div> <div id = / strong> by <a href = "https://anyforum.org/forum/profile/mohamedx/"> mohamedx </ a> (<a href = "https://anyforum.org/forum/general- 121 / "> Doubts and general requests </ a>) </ dt>

<dd> <strong> Today </ strong> at 07:03:10 pm </ dd>

<dt> <strong> <a href = "mohamedx/general-questions-121 / what-me-i advise-en-fans- and- samples-of-my-data / msg104682 / ? topicseen # msg104682 "rel =" nofollow "> Re: Who advise me on fans and samples of my data. <A href = "https://anyforum.org/forum/profile/mohamedx/"> mohamedx </ a> (<a href = "https://anyforum.org/forum" Doubts and general requests </ a>) </ dt>

<dd> <strong> Today </ strong> at 06:35:01 pm </ dd>
</ dl>
========================================================================================

The complication is that we want to hit several posts with a single regexp, for that we are going to use re.findall ():

Code: Python

import requests
import re

web = requests. get ( "https://anyforum.org/forum/index.php" )
regexp = '<dt> <strong> <a href = ". ?" > (. ?) </a> </ strong> '
result = re . findall ( regexp , web. text )
print result

Pay attention to what we use . *? so that it matchee with any url, but we are not interested in capturing them.
We are going to print a list. Each element of that list is one of the matches:

====================================================================================
Code: Select

u'Re: [Help Window is not mounted in the grub (est \ xe1 encrypted) ', u'Re: Knowing a JQuery object ID?', u'Re: Occupy the RAM of useless processes', u 'Re: kali payload and no-ip attack off lan', u'Re: Android - Duda Database for waiters in a bar ', u'Re: Start in the branches of Phishing and Programming', u'Re : stub clean server detected, why? '

===================================================================================

Now, if we add parentheses to capture the urls we are going to return a list of tuples, where each tuple has the url and title of the post:

import requests
import re

web = requests. get ( "https://anyforum.org/forum/index.php" )
regexp = '<dt> <strong> <a href="(.?)" > (. ?) </a> </ strong>'
result = re . findall ( regexp , web. text )
print result

=========================================================================================
code:

(u'https: //anyforum.org/forum/general doubts-121 / (help) -window-not-is-mounted-in-the-grub- (this-encrypted) / msg104690 /? topicseen; PHPSESSID = ub91qif2cdbng1itkjomd0g646 # msg104690 ', u'Re: [Help Window is not mounted on the grub (est \ xe1 encrypted)', (u'https: //underc0de.org/foro/dudas-generales-121/saber -id-of -a-jquery-object / msg104689 /? topicseen; PHPSESSID = ub91qif2cdbng1itkjomd0g646 # msg104689 ', u'Re: Knowing a JQuery object?'), (u'https: //anyforum.org/forum/general doubts-121 / the / msg104688 /? topicseen; PHPSESSID = ub91qif2cdbng1itkjomd0g646 # msg104688 ', u'Re: occupy the RAM of useless processes', (u'https: //anyforum.org/forum/general doubts-121 / kali-payload-and-no -ip-attack-out-lan / msg104687 /? topicseen; PHPSESSID = ub91qif2cdbng1itkjomd0g646 # msg104687 ', u'Re: kali payload and no-ip attack off lan'), (u'https: //anyforum.org/ forum / general doubts-121 / android-doubt-base-of-date / msg104686 /? topicseen; PHPSESSID = ub91qif2cdbng1itkjomd0g 646 # msg104686 ', u'Re: Android - Doubt Database to bed in a bar'), (u'https: //anyforum.org/forum/general doubts-121 / doubts-without-clarifyingr / msg104685 /? topicseen; PHPSESSID = ub91qif2cdbng1itkjomd0g646 # msg104685 ', u'Re: Start at the branches of Phishing and Programming), (u'https: //anyforum.org/forum/general doubts-121 / stub-clean-server-detected-by-that / msg104684 /? topicseen; PHPSESSID = ub91qif2cdbng1itkjomd0g646 # msg104684 ', u'Re: stub clean server detected, why?')

===================================================================================

I leave as an exercise to do a function that returns a list with all the names of the users that appear under "Active Users in the last 25 minutes:", call it and show the result. When they have it or they lock it post the code It

may be that they have to escape some special characters, although I doubt it. If necessary, remember that it is escaped with \

It may also be that they make a regular expression that matchee with more data than the ones we are looking for. For example, it might bring the names of the users that come out next to the titles of the posts. They can try to make a regular expression more complex, or capable two: one that matchee the div where the data are and another that pull the users from there in.

u may find some errors just leave a comment

greetings

Be the First to Respond

Share Your Thoughts

  • Hot
  • Active