AsyncURLCrawler package

AsyncURLCrawler.crawler module

class AsyncURLCrawler.crawler.Crawler(seed_urls: List[str], parser: Parser, deep: bool = False, exact: bool = True, delay: float = 0)

Bases: object

Extracts URLs from target websites using a Breadth-First Search (BFS) algorithm.

Args:

seed_urls (List[str]):: Initial URLs to start crawling. Must follow a valid URL pattern, e.g., ‘https://example.com’.
parser (Parser):: Instance of the Parser class, responsible for fetching and extracting URLs from a given URL.
deep (bool, optional):: If True, crawls all discovered URLs recursively. Defaults to False. Not recommended due to high resource usage.
exact (bool, optional):: If True, restricts crawling to URLs with the same subdomain as the seed URL. Ignored if ‘deep’ is True. Defaults to True.
delay (float, optional):: Time delay (in seconds) between requests to prevent overwhelming the target server. Defaults to 0.

async crawl() → Dict

Asynchronously crawls all seed URLs using BFS.

Returns:: Dict: A dictionary where each key is a seed URL and each value is a set of visited URLs for that seed.

get_visited_urls() → Dict

Returns the visited URLs.

Returns:: Dict: A dictionary where each key is a seed URL and each value is a set of visited URLs for that seed.

async yielded_crawl() → str

Asynchronously crawls seed URLs using BFS and yields each visited URL.

Yields:: str: Each URL as it is visited.

AsyncURLCrawler.parser module

class AsyncURLCrawler.parser.Parser(delay_start: float = 0.1, max_retries: int = 5, request_timeout: float = 1, user_agent: str = 'Mozilla/5.0')

Bases: object

Fetches a URL, parses its HTML content, and extracts URLs from <a> tags. Implements exponential backoff for retrying failed requests.

Args:

delay_start (float, optional):: Initial delay in the exponential backoff strategy. Defaults to 0.1 seconds.
max_retries (int, optional):: Maximum number of retry attempts. Defaults to 5.
request_timeout (float, optional):: Timeout for each HTTP request in seconds. Defaults to 1 second.
user_agent (str, optional):: User-Agent string for HTTP request headers. Defaults to ‘Mozilla/5.0’.

async probe(url: str) → List[str]

Fetches a URL and extracts URLs using an exponential backoff strategy on failures.

Args:

url (str):: The URL to probe.

Returns:

List[str]: A list of extracted URLs. Returns an empty list if the fetch fails after retries.

reset(): Resets the backoff state for a new URL fetch attempt. Must be called before each new URL fetch.

AsyncURLCrawler.url_utils module

exception AsyncURLCrawler.url_utils.InvalidURL(url)

Bases: Exception

Raised when a URL does not match the expected pattern.

Args:

url (str):: The invalid URL.

Attributes:

message (str): Explanation of the error.

AsyncURLCrawler.url_utils.have_exact_domain(url1: str, url2: str) → bool

Checks if two URLs share the exact same domain.

Args:

url1 (str):: The first URL.
url2 (str):: The second URL.

Returns:

bool: True if both URLs have the exact same domain including subdomains.

AsyncURLCrawler.url_utils.have_exact_subdomain(url1: str, url2: str) → bool

Checks if two URLs share the same subdomain.

Args:

url1 (str):: The first URL.
url2 (str):: The second URL.

Returns:

bool: True if both URLs have the same subdomain, domain, and suffix.

AsyncURLCrawler.url_utils.normalize_url(url: str, base_url: str) → str

Converts a relative URL to an absolute URL based on a base URL.

Args:

url (str):: The URL to normalize, which may be relative.
base_url (str):: The base URL to resolve relative URLs against.

Returns:

str: The absolute URL.

AsyncURLCrawler.url_utils.validate_url(url: str) → bool

Validates a single URL against a predefined regex pattern.

Args:

url (str):: The URL to validate.

Returns:

bool: True if the URL matches the pattern, otherwise False.

AsyncURLCrawler.url_utils.validate_urls(urls: List[str]) → None

Validates a list of URLs against a predefined regex pattern.

Args:

urls (List[str]):: A list of URLs to validate.

Raises:

InvalidURL: If any URL does not match the pattern.