AsyncURLCrawler package
AsyncURLCrawler.crawler module
- class AsyncURLCrawler.crawler.Crawler(seed_urls: List[str], parser: Parser, deep: bool = False, exact: bool = True, delay: float = 0)
Bases:
object
Extracts URLs from target websites using a Breadth-First Search (BFS) algorithm.
- Args:
- seed_urls (List[str]):
Initial URLs to start crawling. Must follow a valid URL pattern, e.g., ‘https://example.com’.
- parser (Parser):
Instance of the Parser class, responsible for fetching and extracting URLs from a given URL.
- deep (bool, optional):
If True, crawls all discovered URLs recursively. Defaults to False. Not recommended due to high resource usage.
- exact (bool, optional):
If True, restricts crawling to URLs with the same subdomain as the seed URL. Ignored if ‘deep’ is True. Defaults to True.
- delay (float, optional):
Time delay (in seconds) between requests to prevent overwhelming the target server. Defaults to 0.
- async crawl() Dict
Asynchronously crawls all seed URLs using BFS.
- Returns:
Dict: A dictionary where each key is a seed URL and each value is a set of visited URLs for that seed.
- get_visited_urls() Dict
Returns the visited URLs.
- Returns:
Dict: A dictionary where each key is a seed URL and each value is a set of visited URLs for that seed.
- async yielded_crawl() str
Asynchronously crawls seed URLs using BFS and yields each visited URL.
- Yields:
str: Each URL as it is visited.
AsyncURLCrawler.parser module
- class AsyncURLCrawler.parser.Parser(delay_start: float = 0.1, max_retries: int = 5, request_timeout: float = 1, user_agent: str = 'Mozilla/5.0')
Bases:
object
Fetches a URL, parses its HTML content, and extracts URLs from <a> tags. Implements exponential backoff for retrying failed requests.
- Args:
- delay_start (float, optional):
Initial delay in the exponential backoff strategy. Defaults to 0.1 seconds.
- max_retries (int, optional):
Maximum number of retry attempts. Defaults to 5.
- request_timeout (float, optional):
Timeout for each HTTP request in seconds. Defaults to 1 second.
- user_agent (str, optional):
User-Agent string for HTTP request headers. Defaults to ‘Mozilla/5.0’.
- async probe(url: str) List[str]
Fetches a URL and extracts URLs using an exponential backoff strategy on failures.
- Args:
- url (str):
The URL to probe.
- Returns:
List[str]: A list of extracted URLs. Returns an empty list if the fetch fails after retries.
- reset()
Resets the backoff state for a new URL fetch attempt. Must be called before each new URL fetch.
AsyncURLCrawler.url_utils module
- exception AsyncURLCrawler.url_utils.InvalidURL(url)
Bases:
Exception
Raised when a URL does not match the expected pattern.
- Args:
- url (str):
The invalid URL.
- Attributes:
message (str): Explanation of the error.
- AsyncURLCrawler.url_utils.have_exact_domain(url1: str, url2: str) bool
Checks if two URLs share the exact same domain.
- Args:
- url1 (str):
The first URL.
- url2 (str):
The second URL.
- Returns:
bool: True if both URLs have the exact same domain including subdomains.
- AsyncURLCrawler.url_utils.have_exact_subdomain(url1: str, url2: str) bool
Checks if two URLs share the same subdomain.
- Args:
- url1 (str):
The first URL.
- url2 (str):
The second URL.
- Returns:
bool: True if both URLs have the same subdomain, domain, and suffix.
- AsyncURLCrawler.url_utils.normalize_url(url: str, base_url: str) str
Converts a relative URL to an absolute URL based on a base URL.
- Args:
- url (str):
The URL to normalize, which may be relative.
- base_url (str):
The base URL to resolve relative URLs against.
- Returns:
str: The absolute URL.
- AsyncURLCrawler.url_utils.validate_url(url: str) bool
Validates a single URL against a predefined regex pattern.
- Args:
- url (str):
The URL to validate.
- Returns:
bool: True if the URL matches the pattern, otherwise False.
- AsyncURLCrawler.url_utils.validate_urls(urls: List[str]) None
Validates a list of URLs against a predefined regex pattern.
- Args:
- urls (List[str]):
A list of URLs to validate.
- Raises:
InvalidURL: If any URL does not match the pattern.