AsyncURLCrawler package

AsyncURLCrawler.crawler module

class AsyncURLCrawler.crawler.Crawler(seed_urls: List[str], parser: Parser, deep: bool = False, exact: bool = True, delay: float = 0)

Bases: object

Extracts URLs from target websites using a Breadth-First Search (BFS) algorithm.

Args:
seed_urls (List[str]):

Initial URLs to start crawling. Must follow a valid URL pattern, e.g., ‘https://example.com’.

parser (Parser):

Instance of the Parser class, responsible for fetching and extracting URLs from a given URL.

deep (bool, optional):

If True, crawls all discovered URLs recursively. Defaults to False. Not recommended due to high resource usage.

exact (bool, optional):

If True, restricts crawling to URLs with the same subdomain as the seed URL. Ignored if ‘deep’ is True. Defaults to True.

delay (float, optional):

Time delay (in seconds) between requests to prevent overwhelming the target server. Defaults to 0.

async crawl() Dict

Asynchronously crawls all seed URLs using BFS.

Returns:

Dict: A dictionary where each key is a seed URL and each value is a set of visited URLs for that seed.

get_visited_urls() Dict

Returns the visited URLs.

Returns:

Dict: A dictionary where each key is a seed URL and each value is a set of visited URLs for that seed.

async yielded_crawl() str

Asynchronously crawls seed URLs using BFS and yields each visited URL.

Yields:

str: Each URL as it is visited.

AsyncURLCrawler.parser module

class AsyncURLCrawler.parser.Parser(delay_start: float = 0.1, max_retries: int = 5, request_timeout: float = 1, user_agent: str = 'Mozilla/5.0')

Bases: object

Fetches a URL, parses its HTML content, and extracts URLs from <a> tags. Implements exponential backoff for retrying failed requests.

Args:
delay_start (float, optional):

Initial delay in the exponential backoff strategy. Defaults to 0.1 seconds.

max_retries (int, optional):

Maximum number of retry attempts. Defaults to 5.

request_timeout (float, optional):

Timeout for each HTTP request in seconds. Defaults to 1 second.

user_agent (str, optional):

User-Agent string for HTTP request headers. Defaults to ‘Mozilla/5.0’.

async probe(url: str) List[str]

Fetches a URL and extracts URLs using an exponential backoff strategy on failures.

Args:
url (str):

The URL to probe.

Returns:

List[str]: A list of extracted URLs. Returns an empty list if the fetch fails after retries.

reset()

Resets the backoff state for a new URL fetch attempt. Must be called before each new URL fetch.

AsyncURLCrawler.url_utils module

exception AsyncURLCrawler.url_utils.InvalidURL(url)

Bases: Exception

Raised when a URL does not match the expected pattern.

Args:
url (str):

The invalid URL.

Attributes:

message (str): Explanation of the error.

AsyncURLCrawler.url_utils.have_exact_domain(url1: str, url2: str) bool

Checks if two URLs share the exact same domain.

Args:
url1 (str):

The first URL.

url2 (str):

The second URL.

Returns:

bool: True if both URLs have the exact same domain including subdomains.

AsyncURLCrawler.url_utils.have_exact_subdomain(url1: str, url2: str) bool

Checks if two URLs share the same subdomain.

Args:
url1 (str):

The first URL.

url2 (str):

The second URL.

Returns:

bool: True if both URLs have the same subdomain, domain, and suffix.

AsyncURLCrawler.url_utils.normalize_url(url: str, base_url: str) str

Converts a relative URL to an absolute URL based on a base URL.

Args:
url (str):

The URL to normalize, which may be relative.

base_url (str):

The base URL to resolve relative URLs against.

Returns:

str: The absolute URL.

AsyncURLCrawler.url_utils.validate_url(url: str) bool

Validates a single URL against a predefined regex pattern.

Args:
url (str):

The URL to validate.

Returns:

bool: True if the URL matches the pattern, otherwise False.

AsyncURLCrawler.url_utils.validate_urls(urls: List[str]) None

Validates a list of URLs against a predefined regex pattern.

Args:
urls (List[str]):

A list of URLs to validate.

Raises:

InvalidURL: If any URL does not match the pattern.