BeautifulSoup is a Python module that extracts data from HTML, XML, and other markup languages. A classic example of the usage of BeautifulSoup is to use the requests library to get website sources and use the BeautifulSoup to parse the content and extract data as needed. Here is an example,
# You might have to install the following Python packages: beautifulsoup4, lxml, requests from bs4 import BeautifulSoup import requests # send GET request to the server on the URL specified response = requests.get(url="http://example.com/") # parse the response content using LXML syntax soup = BeautifulSoup(response.content, 'lxml') # print the soup content in prettified format print(soup.prettify()) print(type(soup))
<!DOCTYPE html> <html> … <h1> Example Domain </h1> <p> This domain is … </p> <p> <a href="https://www.iana.org/domains/example"> More information... </a> </p> </html> <class 'bs4.BeautifulSoup'>
Note: You can view the page source of a given website by visiting the site and clicking Ctrl+Shift+I or right-click your mouse and selecting “View Page Source.”
The output is a BeautifulSoup object. If you want to get the soup as a Python string, you can just cast it using the str function.
soup_string = str(soup) print(type(soup_string))
Note: If you are using requests to get web content parsing the response as a string can be done using the text attribute. In the example above, response.text will give the response as a string. In this case, you won’t need to use BeautifulSoup.
Let us work on another example where we want to convert a Tag object into a string.
from bs4 import BeautifulSoup # A string (think of it as web content). source_html = """<section class="reviewContent"> <time datetime="2022-09-28T07:37:15.000Z" class="time1" title="Wednesday, September 28, 2022 at 10:37:15 AM">A day ago</time> <a href="/reviews/6333dd8b113376521b9e6a3" class="link_internal"> <h2 class="heading">Title here</h2> </a> <p class="body" data-service="true">Some text here</p> </section>""" # Persing the source_html using LXML soup = BeautifulSoup(source_html, 'lxml') time_tag = soup.find("time") print(time_tag) print(type(time_tag))
<time class="time1" datetime="2022-09-28T07:37:15.000Z" title="Wednesday, September 28, 2022 at 10:37:15 AM">A day ago</time> <class 'bs4.element.Tag'>
If you want to convert the Tag element into a string, you can just cast it using the str function as we did before.
tag_string = str(time_tag) print(tag_string) print(type(tag_string))
<time class="time1" datetime="2022-09-28T07:37:15.000Z" title="Wednesday, September 28, 2022 at 10:37:15 AM">A day ago</time> <class 'str'>
To get the inner content of the tag as a string:
time_tag = soup.find("time") print(time_tag.text) # or time_tag.get_text()
A day ago
And you can use the function soup.find(<tag>).get(<attribute>) to get the value of the <attribute> inside <tag> as string
# get the value of the title attribute in the time tag. time1 = soup.find("time").get("title") # or soup.find("time")["title"] print(time1)
Wednesday, September 28, 2022 at 10:37:15 AM