dputils Documentation

Documentation for dputils

View the Project on GitHub digipodium/dputils

dputils Scraper Module Documentation

Introduction

The dputils library provides a powerful and easy-to-use web scraping module called scraper. This module allows users to extract data from web pages using a combination of Python’s httpx, BeautifulSoup, and custom data classes. The scraper module can handle various scraping tasks, including fetching data from single pages, extracting repeated data from lists of items, and more.

Data extraction from a page

Here’s a basic tutorial to help you get started with the scraper module.

  1. Import the required classes and functions:
from dputils.scrape import Scraper, Tag
  1. Initialize the Scraper class with the URL of the webpage you want to scrape:
url = "https://www.example.com"
scraper = Scraper(url)
  1. Define the tags you want to scrape using the Tag class:
title_tag = Tag(name='h1', cls='title', output='text')
price_tag = Tag(name='span', cls='price', output='text')
  1. Extract data from the page:
data = scraper.get_data_from_page(title=title_tag, price=price_tag)
print(data)

Advanced Tutorial - extracting list of items from a page

For more advanced usage, such as extracting repeated data from lists of items on a page, you can use the following approach:

  1. Initialize the Scraper class:
url = "https://www.example.com/products"
scraper = Scraper(url)
  1. Define the tags for the target section and the items within that section: For repeated data extraction, you need to define Target and item and pass it to get_repeating_data_from_page() method.
    • target - defines the Tag() for area of the page containing the list of items.
    • items - defines the Tag() for repeated items within the target section. Like a product-card in product grid/list.
      target_tag = Tag(name='div', cls='product-list')
      item_tag = Tag(name='div', cls='product-item')
      title_tag = Tag(name='h2', cls='product-title', output='text')
      price_tag = Tag(name='span', cls='product-price', output='text')
      link_tag = Tag(name='a', cls='product-link', output='href')
      
  2. Extract repeated data from the page:
products = scraper.get_repeating_data_from_page(
    target=target_tag,
    items=item_tag,
    title=title_tag,
    price=price_tag,
    link=link_tag
)
for product in products:
    print(product)

Detailed Description

Scraper Class

The Scraper class is the main class for initializing the scraper and handling the extraction of data from web pages.

Constructor

def __init__(self, webpage_url: str, user_agent: str = None, cookies: dict = None, clean: bool = False):

Methods

Tag Class

The Tag class is used to define the HTML tags and attributes to be extracted from the web page.

Constructor

@dataclass
class Tag:
    name: str = 'div'
    output: str = 'text'
    cls: str = None
    id: str = None
    attrs: dict = None

Methods

Helper Functions

Extract Function

The extract function is used to extract data from a BeautifulSoup object based on the given tags.

def extract(dom_item, tags, data, errors):

Example Usage

Here’s a complete example of using the scraper module to extract data from a webpage:

from dputils.scrape import Scraper, Tag

url = "https://www.example.com"
scraper = Scraper(url)

title_tag = Tag(name='h1', cls='title', output='text')
price_tag = Tag(name='span', cls='price', output='text')

data = scraper.get_data_from_page(title=title_tag, price=price_tag)
print(data)

This documentation provides an overview of the scraper module in the dputils library, including basic and advanced usage tutorials, detailed descriptions of classes and functions, and an example usage.