dputils Scraper Module Documentation
Introduction
The dputils library provides a powerful and easy-to-use web scraping module called scraper. This module allows users to extract data from web pages using a combination of Python’s httpx, BeautifulSoup, and custom data classes. The scraper module can handle various scraping tasks, including fetching data from single pages, extracting repeated data from lists of items, and more.
Data extraction from a page
Here’s a basic tutorial to help you get started with the scraper module.
- Import the required classes and functions:
from dputils.scrape import Scraper, Tag
- Initialize the
Scraperclass with the URL of the webpage you want to scrape:
url = "https://www.example.com"
scraper = Scraper(url)
- Define the tags you want to scrape using the
Tagclass:
title_tag = Tag(name='h1', cls='title', output='text')
price_tag = Tag(name='span', cls='price', output='text')
- Extract data from the page:
data = scraper.get_data_from_page(title=title_tag, price=price_tag)
print(data)
Advanced Tutorial - extracting list of items from a page
For more advanced usage, such as extracting repeated data from lists of items on a page, you can use the following approach:
- Initialize the
Scraperclass:
url = "https://www.example.com/products"
scraper = Scraper(url)
- Define the tags for the target section and the items within that section:
For repeated data extraction, you need to define
Targetanditemand pass it toget_repeating_data_from_page()method.- target - defines the
Tag()for area of the page containing the list of items. - items - defines the
Tag()for repeated items within the target section. Like a product-card in product grid/list.target_tag = Tag(name='div', cls='product-list') item_tag = Tag(name='div', cls='product-item') title_tag = Tag(name='h2', cls='product-title', output='text') price_tag = Tag(name='span', cls='product-price', output='text') link_tag = Tag(name='a', cls='product-link', output='href')
- target - defines the
- Extract repeated data from the page:
products = scraper.get_repeating_data_from_page(
target=target_tag,
items=item_tag,
title=title_tag,
price=price_tag,
link=link_tag
)
for product in products:
print(product)
Detailed Description
Scraper Class
The Scraper class is the main class for initializing the scraper and handling the extraction of data from web pages.
Constructor
def __init__(self, webpage_url: str, user_agent: str = None, cookies: dict = None, clean: bool = False):
webpage_url(str): URL of the webpage to scrape.user_agent(str): User agent string (optional).cookies(dict): Cookies for the request (optional).clean(bool): Flag to clean the URL (optional, default isFalse).
Methods
_validate_url(self) -> bool: Validates the URL._clean_url(self): Cleans the URL by removing query parameters._get_soup(self, headers=None, cookies=None, clean=False) -> BeautifulSoup: Fetches the webpage content and returns a BeautifulSoup object.get_data_from_page(self, errors=False, **tags) -> dict: Extracts data based on given tags and returns a dictionary.get_repeating_data_from_page(self, target: Tag = None, items: Tag = None, errors=False, info=False, **tags) -> list: Extracts data for multiple items and returns a list of dictionaries.get_tag(self, tag: Tag, errors=False): Extracts data for a single Tag object and returns a dictionary.get_all_tags(self, tags: list, errors=False): Extracts data for multiple Tag objects and returns a dictionary.
Tag Class
The Tag class is used to define the HTML tags and attributes to be extracted from the web page.
Constructor
@dataclass
class Tag:
name: str = 'div'
output: str = 'text'
cls: str = None
id: str = None
attrs: dict = None
name(str): The name of the HTML tag (default is ‘div’).output(str): The type of data to extract (default is ‘text’).cls(str): The class attribute of the HTML tag (optional).id(str): The ID attribute of the HTML tag (optional).attrs(dict): Additional attributes for the HTML tag (optional).
Methods
__post_init__(self): Validates the output type and sets the attributes.__str__(self): Returns a string representation of the Tag object.__repr__(self): Returns a string representation of the Tag object.
Helper Functions
_get_random_user_agent(): Returns a random User-Agent string from a predefined list.
Extract Function
The extract function is used to extract data from a BeautifulSoup object based on the given tags.
def extract(dom_item, tags, data, errors):
dom_item: The BeautifulSoup object to extract data from.tags: A dictionary of Tag objects.data: A dictionary to store the extracted data.errors: A flag to indicate whether to print errors.
Example Usage
Here’s a complete example of using the scraper module to extract data from a webpage:
from dputils.scrape import Scraper, Tag
url = "https://www.example.com"
scraper = Scraper(url)
title_tag = Tag(name='h1', cls='title', output='text')
price_tag = Tag(name='span', cls='price', output='text')
data = scraper.get_data_from_page(title=title_tag, price=price_tag)
print(data)
This documentation provides an overview of the scraper module in the dputils library, including basic and advanced usage tutorials, detailed descriptions of classes and functions, and an example usage.