Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files, widely used for web scraping and screen scraping tasks. It provides a parse tree API with simple methods for navigating, searching, and modifying parsed HTML/XML documents. Beautiful Soup automatically handles encoding, supports multiple parsers (html.parser, lxml, html5lib), and integrates with CSS selectors via the Soup Sieve library. Current stable version is 4.14.3.

1 APIs 6 Features

Data ExtractionHTML ParsingPythonScrapingWeb ScrapingXML Parsing

APIs

Beautiful Soup

Beautiful Soup 4 is a Python library providing a parse tree API for HTML and XML documents. It exposes Tag, NavigableString, BeautifulSoup, and Comment objects with navigation m...

Features

Multi-Parser Support

Supports html.parser (built-in), lxml (fast), and html5lib (browser-like) parsers for flexible HTML/XML parsing.

CSS Selector Support

Full CSS4 selector support via the Soup Sieve library for familiar CSS-based element selection.

Tree Navigation API

Rich API for navigating the parse tree upward, downward, and sideways including find(), find_all(), parents, children, and siblings.

Automatic Encoding Detection

Automatically detects and handles document encoding using Unicode, Dammit, ensuring correct text extraction.

Tree Modification

Full tree modification support including append, insert, extract, decompose, replace_with, wrap, and unwrap operations.

Output Formatting

Multiple output formatters including prettify(), get_text(), and custom formatters for controlled serialization.

Use Cases

Web Scraping

Extract data from websites by parsing HTML pages with Beautiful Soup and navigating the DOM tree to find target elements.

Data Mining

Mine structured data from HTML tables, lists, and other markup patterns across large numbers of web pages.

Content Extraction

Extract article text, product information, or other content from web pages for NLP pipelines and data analysis.

Screen Scraping Legacy Systems

Automate data extraction from legacy HTML web interfaces that lack modern APIs.

HTML Sanitization

Parse and clean HTML documents by removing unwanted tags, scripts, and formatting.

XML Processing

Parse and query XML documents using Beautiful Soup's tree navigation and search capabilities.

Integrations

Requests

Python HTTP library used in combination with Beautiful Soup to fetch and parse web pages.

Scrapy

Python web crawling framework that can use Beautiful Soup selectors for content extraction.

lxml

Fast XML and HTML parsing library used as an alternate parser backend for Beautiful Soup.

html5lib

Pure-Python HTML5 parser used with Beautiful Soup for browser-compatible HTML parsing.

Pandas

DataFrame library commonly used with Beautiful Soup to convert scraped HTML tables into structured data.

Selenium

Browser automation tool used with Beautiful Soup to scrape JavaScript-rendered pages.

Resources

GitHubOrganization

Sources

aid: beautiful-soup
name: Beautiful Soup
description: >-
  Beautiful Soup is a Python library for pulling data out of HTML and XML files,
  widely used for web scraping and screen scraping tasks. It provides a parse tree
  API with simple methods for navigating, searching, and modifying parsed HTML/XML
  documents. Beautiful Soup automatically handles encoding, supports multiple parsers
  (html.parser, lxml, html5lib), and integrates with CSS selectors via the Soup Sieve
  library. Current stable version is 4.14.3.
type: Index
image: https://kinlane-productions.s3.amazonaws.com/apis-json/apis-json-logo.jpg
tags:
  - Data Extraction
  - HTML Parsing
  - Python
  - Scraping
  - Web Scraping
  - XML Parsing
url: >-
  https://raw.githubusercontent.com/api-evangelist/beautiful-soup/refs/heads/main/apis.yml
created: '2026-03-29'
modified: '2026-04-19'
specificationVersion: '0.19'
apis:
  - aid: beautiful-soup:beautiful-soup
    name: Beautiful Soup
    description: >-
      Beautiful Soup 4 is a Python library providing a parse tree API for HTML and
      XML documents. It exposes Tag, NavigableString, BeautifulSoup, and Comment
      objects with navigation methods (find, find_all, CSS selectors), tree traversal
      (parents, children, siblings), and modification methods (append, extract, replace).
      Supports html.parser, lxml, and html5lib parsers with automatic encoding detection.
    humanURL: https://www.crummy.com/software/BeautifulSoup/
    tags:
      - Data Extraction
      - HTML Parsing
      - Python
      - Scraping
      - Web Scraping
      - XML Parsing
    properties:
      - type: Documentation
        url: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
      - type: GettingStarted
        url: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start
      - type: SDK
        url: https://pypi.org/project/beautifulsoup4/
        title: Python Package (PyPI)

common:
  - type: Website
    url: https://www.crummy.com/software/BeautifulSoup/
  - type: Documentation
    url: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  - type: SDK
    url: https://pypi.org/project/beautifulsoup4/
    title: PyPI Package
  - type: GitHubOrganization
    url: https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4
  - type: ChangeLog
    url: https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/CHANGELOG
  - type: Features
    data:
      - name: Multi-Parser Support
        description: Supports html.parser (built-in), lxml (fast), and html5lib (browser-like) parsers for flexible HTML/XML parsing.
      - name: CSS Selector Support
        description: Full CSS4 selector support via the Soup Sieve library for familiar CSS-based element selection.
      - name: Tree Navigation API
        description: Rich API for navigating the parse tree upward, downward, and sideways including find(), find_all(), parents, children, and siblings.
      - name: Automatic Encoding Detection
        description: Automatically detects and handles document encoding using Unicode, Dammit, ensuring correct text extraction.
      - name: Tree Modification
        description: Full tree modification support including append, insert, extract, decompose, replace_with, wrap, and unwrap operations.
      - name: Output Formatting
        description: Multiple output formatters including prettify(), get_text(), and custom formatters for controlled serialization.
  - type: UseCases
    data:
      - name: Web Scraping
        description: Extract data from websites by parsing HTML pages with Beautiful Soup and navigating the DOM tree to find target elements.
      - name: Data Mining
        description: Mine structured data from HTML tables, lists, and other markup patterns across large numbers of web pages.
      - name: Content Extraction
        description: Extract article text, product information, or other content from web pages for NLP pipelines and data analysis.
      - name: Screen Scraping Legacy Systems
        description: Automate data extraction from legacy HTML web interfaces that lack modern APIs.
      - name: HTML Sanitization
        description: Parse and clean HTML documents by removing unwanted tags, scripts, and formatting.
      - name: XML Processing
        description: Parse and query XML documents using Beautiful Soup's tree navigation and search capabilities.
  - type: Integrations
    data:
      - name: Requests
        description: Python HTTP library used in combination with Beautiful Soup to fetch and parse web pages.
      - name: Scrapy
        description: Python web crawling framework that can use Beautiful Soup selectors for content extraction.
      - name: lxml
        description: Fast XML and HTML parsing library used as an alternate parser backend for Beautiful Soup.
      - name: html5lib
        description: Pure-Python HTML5 parser used with Beautiful Soup for browser-compatible HTML parsing.
      - name: Pandas
        description: DataFrame library commonly used with Beautiful Soup to convert scraped HTML tables into structured data.
      - name: Selenium
        description: Browser automation tool used with Beautiful Soup to scrape JavaScript-rendered pages.
maintainers:
  - FN: Kin Lane
    email: [email protected]