dputils Scraper Module Documentation
Introduction
The dputils
library provides a powerful and easy-to-use web scraping module called scraper
. This module allows users to extract data from web pages using a combination of Python’s httpx
, BeautifulSoup
, and custom data classes. The scraper module can handle various scraping tasks, including fetching data from single pages, extracting repeated data from lists of items, and more.
Data extraction from a page
Here’s a basic tutorial to help you get started with the scraper
module.
- Import the required classes and functions:
from dputils.scrape import Scraper, Tag
- Initialize the
Scraper
class with the URL of the webpage you want to scrape:
url = "https://www.example.com"
scraper = Scraper(url)
- Define the tags you want to scrape using the
Tag
class:
title_tag = Tag(name='h1', cls='title', output='text')
price_tag = Tag(name='span', cls='price', output='text')
- Extract data from the page:
data = scraper.get_data_from_page(title=title_tag, price=price_tag)
print(data)
Advanced Tutorial - extracting list of items from a page
For more advanced usage, such as extracting repeated data from lists of items on a page, you can use the following approach:
- Initialize the
Scraper
class:
url = "https://www.example.com/products"
scraper = Scraper(url)
- Define the tags for the target section and the items within that section:
For repeated data extraction, you need to define
Target
anditem
and pass it toget_repeating_data_from_page()
method.- target - defines the
Tag()
for area of the page containing the list of items. - items - defines the
Tag()
for repeated items within the target section. Like a product-card in product grid/list.target_tag = Tag(name='div', cls='product-list') item_tag = Tag(name='div', cls='product-item') title_tag = Tag(name='h2', cls='product-title', output='text') price_tag = Tag(name='span', cls='product-price', output='text') link_tag = Tag(name='a', cls='product-link', output='href')
- target - defines the
- Extract repeated data from the page:
products = scraper.get_repeating_data_from_page(
target=target_tag,
items=item_tag,
title=title_tag,
price=price_tag,
link=link_tag
)
for product in products:
print(product)
Detailed Description
Scraper Class
The Scraper
class is the main class for initializing the scraper and handling the extraction of data from web pages.
Constructor
def __init__(self, webpage_url: str, user_agent: str = None, cookies: dict = None, clean: bool = False):
webpage_url
(str): URL of the webpage to scrape.user_agent
(str): User agent string (optional).cookies
(dict): Cookies for the request (optional).clean
(bool): Flag to clean the URL (optional, default isFalse
).
Methods
_validate_url(self) -> bool
: Validates the URL._clean_url(self)
: Cleans the URL by removing query parameters._get_soup(self, headers=None, cookies=None, clean=False) -> BeautifulSoup
: Fetches the webpage content and returns a BeautifulSoup object.get_data_from_page(self, errors=False, **tags) -> dict
: Extracts data based on given tags and returns a dictionary.get_repeating_data_from_page(self, target: Tag = None, items: Tag = None, errors=False, info=False, **tags) -> list
: Extracts data for multiple items and returns a list of dictionaries.get_tag(self, tag: Tag, errors=False)
: Extracts data for a single Tag object and returns a dictionary.get_all_tags(self, tags: list, errors=False)
: Extracts data for multiple Tag objects and returns a dictionary.
Tag Class
The Tag
class is used to define the HTML tags and attributes to be extracted from the web page.
Constructor
@dataclass
class Tag:
name: str = 'div'
output: str = 'text'
cls: str = None
id: str = None
attrs: dict = None
name
(str): The name of the HTML tag (default is ‘div’).output
(str): The type of data to extract (default is ‘text’).cls
(str): The class attribute of the HTML tag (optional).id
(str): The ID attribute of the HTML tag (optional).attrs
(dict): Additional attributes for the HTML tag (optional).
Methods
__post_init__(self)
: Validates the output type and sets the attributes.__str__(self)
: Returns a string representation of the Tag object.__repr__(self)
: Returns a string representation of the Tag object.
Helper Functions
_get_random_user_agent()
: Returns a random User-Agent string from a predefined list.
Extract Function
The extract
function is used to extract data from a BeautifulSoup object based on the given tags.
def extract(dom_item, tags, data, errors):
dom_item
: The BeautifulSoup object to extract data from.tags
: A dictionary of Tag objects.data
: A dictionary to store the extracted data.errors
: A flag to indicate whether to print errors.
Example Usage
Here’s a complete example of using the scraper
module to extract data from a webpage:
from dputils.scrape import Scraper, Tag
url = "https://www.example.com"
scraper = Scraper(url)
title_tag = Tag(name='h1', cls='title', output='text')
price_tag = Tag(name='span', cls='price', output='text')
data = scraper.get_data_from_page(title=title_tag, price=price_tag)
print(data)
This documentation provides an overview of the scraper
module in the dputils
library, including basic and advanced usage tutorials, detailed descriptions of classes and functions, and an example usage.