Python Web Scrapper Example

Collecting and cleaning the necessary data for a data science project is often the most challenging part of the entire project.

There are many instances in which building a web scrapper will be a valuable asset for collecting the necessary data for one of your analytics projects.

This particular web scraper was built to collect page content to feed into a natural language processing model for page categorization. Keep in mind that you may need to make adjustments to the code depending on the website you are scraping and the data you would like to collect from the page.

#importing the libraries 
import requests
from bs4 import BeautifulSoup
import pandas
#open and create a list of strings from CSV URL's
with open("site_map_links.csv") as f:
content= f.readlines()
content=[x.strip() for x in content]

Pro Tip: any website that has the goal of ranking organically on Google will have a sitemap. Typically you will find a link to the sitemap in the footer of the website, however if you do not find a link to the sitemap on the home page try adding “/sitemap” to the end of the page URL.

If you are scrapping content from a large website you will want to first create a web scrapper to collect all of the URLs from the sitemap. You can then save these URLs to a csv. to be used in your full web site scrapper.

#creating full hyperlink from link list
link_list=[]
i=0
for link in content:
l="https://www.YOURWEBSITE{}".format(content[i]) i=i+1 link_list.append(l)
#loop to scrape the website
data ={"Page URL":[], "Page Content":[]}
for page in link_list:
page_content=[]
div=0
i=0
try:
r=requests.get(page)
c=r.content
soup=BeautifulSoup(c, "html.parser")
all =soup.find_all("div",{"class":"rich-text-component"})
for item in all:
try:
for item in all:
try
: d = all[div].find_all("p")[i].text
page_content.append(d)
except:
pass
i=i+1
except:
pass
div=div+1
i=0
except:
pass
print(page)
full_page_content = ["".join(page_content)]
data["Page URL"].append(page)
data["Page Content"].append(full_page_content)
#saving data to a pandas data frame
df=pandas.DataFrame(data)
#saving your data to a csv file
df.to_csv("Page_Content.csv", index=False)

There you have it! You have now successfully collected all of the content from your web pages and are now ready to build an NLP model to categorize them.

Leave a Reply

Your email address will not be published. Required fields are marked *