Web Scraping with Python

Spread the word!

Install Beautiful Soup from terminal:

I wrote some personalized functions using such module. Note: since it can just work with HTML files, we need to use urllib to open any url and make it readable as done in following code. The first part, before the functions, must be always written even for all codes in this post.

Function’s explanation with examples:

imgs(src_content=””) list of imgs url (not exact) matching the given src value.
tag_attr(tag, attr_list,noattr_list=””) list of given tag having attrs given in ‘attr_list’ and not having attrs given in ‘noattr_list’.
list=tag_attr(“”,[“class”,”label”]) list of all tags having attributes “class” AND “label”.
tag(tag,_dict=””) list of given tag having given attributes with given values.
list=tag(“div”,{“class”:[“class1″,”class2”]}) list of divs having “class1” OR “class2” as classes
list=tag(“p”) list of every tag

Now let’s take a look at some functions directly from the module.
In the following code regular expressions can be used:

It gives the content (.string) of the first ([0]) div having any charachter (.) as class and label value.

It finds all links (a) tags having “price” as text.
As last example we assume our html to be the following

By executig this code:

Result for el2: address
Result for el: 18A
Note: more results will accumulated in “l” list!

Be the first to comment

Leave a Reply

Your email address will not be published.


*