01/05/2024
5 min read

Web Scraping with Python and Selenium

Learn how to extract data from websites when APIs aren't available using Python and Selenium for automated browsing.

AutomationData CollectionPythonSeleniumWeb ScrapingAutomation

Web Scraping with Python and Selenium

Web scraping is a powerful technique for data collection when APIs aren't available. In this post, I'll show you how to use Python and Selenium to automate web browsing and data extraction.

When to Use Web Scraping

While APIs are generally preferred, web scraping becomes necessary when:

  • No API is available
  • API access is limited or expensive
  • You need data from diverse sources
  • The website's content changes frequently

Setting Up Selenium

First, install the necessary packages:

bash
1pip install selenium webdriver-manager pandas

Then, set up a basic scraper:

python
1from selenium import webdriver
2from selenium.webdriver.chrome.service import Service
3from webdriver_manager.chrome import ChromeDriverManager
4from selenium.webdriver.common.by import By
5from selenium.webdriver.chrome.options import Options
6import pandas as pd
7
8# Setup chrome options
9chrome_options = Options()
10chrome_options.add_argument("--headless")  # Run in background
11chrome_options.add_argument("--no-sandbox")
12chrome_options.add_argument("--disable-dev-shm-usage")
13
14# Setup webdriver
15driver = webdriver.Chrome(
16    service=Service(ChromeDriverManager().install()),
17    options=chrome_options
18)
19
20# Navigate to website
21driver.get("https://example.com")
22
23# Extract data
24elements = driver.find_elements(By.CSS_SELECTOR, ".product-item")
25data = []
26
27for element in elements:
28    name = element.find_element(By.CSS_SELECTOR, ".product-name").text
29    price = element.find_element(By.CSS_SELECTOR, ".product-price").text
30    data.append({"name": name, "price": price})
31
32# Convert to DataFrame
33df = pd.DataFrame(data)
34print(df.head())
35
36# Close the driver
37driver.quit()

Best Practices

  1. Be respectful

    • Check the website's robots.txt
    • Add delays between requests
    • Don't overload the server
  2. Handle dynamic content

    • Use explicit waits for elements to load
    • Handle AJAX requests
    • Navigate pagination properly
  3. Error handling

    • Implement try/except blocks
    • Log errors and continue
    • Plan for site structure changes

Real-World Application

At Ventask, I used web scraping to automate data collection from various job boards when APIs weren't available. This automation reduced manual data entry by 80% and ensured we had timely data for our operations.

Conclusion

Web scraping with Selenium is a powerful tool in your data collection arsenal. Use it responsibly and in conjunction with other techniques like API integration for comprehensive data solutions.

João Vicente

João Vicente

Developer & Data Analyst

Sharing insights on automation, data analysis, and web development. Based in Lisbon, Portugal.