Beautiful Soup
Beautiful Soup is a Python library for pulling data out of HTML and XML files, widely used for web scraping and screen scraping tasks. It provides a parse tree API with simple methods for navigating, searching, and modifying parsed HTML/XML documents. Beautiful Soup automatically handles encoding, supports multiple parsers (html.parser, lxml, html5lib), and integrates with CSS selectors via the Soup Sieve library. Current stable version is 4.14.3.
APIs
Beautiful Soup
Beautiful Soup 4 is a Python library providing a parse tree API for HTML and XML documents. It exposes Tag, NavigableString, BeautifulSoup, and Comment objects with navigation m...
Features
Supports html.parser (built-in), lxml (fast), and html5lib (browser-like) parsers for flexible HTML/XML parsing.
Full CSS4 selector support via the Soup Sieve library for familiar CSS-based element selection.
Rich API for navigating the parse tree upward, downward, and sideways including find(), find_all(), parents, children, and siblings.
Automatically detects and handles document encoding using Unicode, Dammit, ensuring correct text extraction.
Full tree modification support including append, insert, extract, decompose, replace_with, wrap, and unwrap operations.
Multiple output formatters including prettify(), get_text(), and custom formatters for controlled serialization.
Use Cases
Extract data from websites by parsing HTML pages with Beautiful Soup and navigating the DOM tree to find target elements.
Mine structured data from HTML tables, lists, and other markup patterns across large numbers of web pages.
Extract article text, product information, or other content from web pages for NLP pipelines and data analysis.
Automate data extraction from legacy HTML web interfaces that lack modern APIs.
Parse and clean HTML documents by removing unwanted tags, scripts, and formatting.
Parse and query XML documents using Beautiful Soup's tree navigation and search capabilities.
Integrations
Python HTTP library used in combination with Beautiful Soup to fetch and parse web pages.
Python web crawling framework that can use Beautiful Soup selectors for content extraction.
Fast XML and HTML parsing library used as an alternate parser backend for Beautiful Soup.
Pure-Python HTML5 parser used with Beautiful Soup for browser-compatible HTML parsing.
DataFrame library commonly used with Beautiful Soup to convert scraped HTML tables into structured data.
Browser automation tool used with Beautiful Soup to scrape JavaScript-rendered pages.