A lightweight Python library that converts BeautifulSoup4 HTML elements into structured JSON. Parse any HTML and get clean, traversable dictionaries — preserving document order, with full control over comments, whitespace, and label naming.
Python 3.8+ | Only dependency: beautifulsoup4
Table of Contents
| Section | Description |
|---|---|
| Installation | How to install |
| Quick Start | Basic usage example |
| Output Format | How HTML maps to JSON |
| Conversion | Converting tags, multiple tags, from BeautifulSoup |
| Options | group_by_tag, comments, whitespace, labels, config |
| Output | Save to file, pretty print |
| Advanced Usage | Context manager, callable, extension mode |
| API Reference | BS2Json methods, ConversionConfig fields |
| Contributing | How to contribute |
pip install -U bs2jsonfrom bs2json import BS2Json
html = """
<html>
<head><title>My Page</title></head>
<body>
<h1>Welcome</h1>
<p class="intro">Hello <b>world</b></p>
<a href="/link1">Link 1</a>
<a href="/link2">Link 2</a>
</body>
</html>
"""
converter = BS2Json(html)
result = converter.convert()
converter.prettify()Elements preserve their original document order. The JSON structure follows these rules:
| HTML | JSON |
|---|---|
<h1>text</h1> |
{"h1": "text"} |
<p class="x">text</p> |
{"p": {"attrs": {"class": ["x"]}, "text": "text"}} |
<div><h1>A</h1><p>B</p></div> |
{"div": {"children": [{"h1": "A"}, {"p": "B"}]}} |
<a href="/">link</a> |
{"a": {"attrs": {"href": "/"}, "text": "link"}} |
<!-- note --> |
{"comment": "<!-- note -->"} |
- Single text child stays simple:
{"tag": "text"} - Multiple children use:
{"tag": {"children": [...]}} - Attributes appear under the
"attrs"key - Mixed content (text + tags) preserves order in
children
Full output example
{'html': {'head': {'title': 'My Page'},
'body': {'children': [{'h1': 'Welcome'},
{'p': {'attrs': {'class': ['intro']},
'children': [{'text': 'Hello'},
{'b': 'world'}]}},
{'a': {'attrs': {'href': '/link1'},
'text': 'Link 1'}},
{'a': {'attrs': {'href': '/link2'},
'text': 'Link 2'}}]}}}Convert Specific Tags
converter = BS2Json(html)
# By tag name
converter.convert('body')
# By CSS class
converter.convert(class_='intro')
# By attribute
converter.convert('a', href='/link1')
# {'a': {'attrs': {'href': '/link1'}, 'text': 'Link 1'}}Convert Multiple Tags
converter = BS2Json(html)
# As a list of individual results
converter.convert_all('a')
# [{'a': {'attrs': {'href': '/link1'}, 'text': 'Link 1'}},
# {'a': {'attrs': {'href': '/link2'}, 'text': 'Link 2'}}]
# Grouped by tag name into a single dict
converter.convert_all('a', join=True)
# [{'a': [{'attrs': {'href': '/link1'}, 'text': 'Link 1'},
# {'attrs': {'href': '/link2'}, 'text': 'Link 2'}]}]From BeautifulSoup Objects
You can pass an existing BeautifulSoup object or Tag instead of raw HTML:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# From a soup object
BS2Json(soup).convert()
# From a specific tag
BS2Json(soup.find('body')).convert()
# Convert on-the-fly with no soup
converter = BS2Json()
converter.convert(soup.body)Group by Tag Name
By default, elements preserve document order. Use group_by_tag=True to group siblings by tag name — useful when you don't care about order and want quick access by tag:
html = '<html><body><h3>First</h3><p>Text</p><h3>Second</h3></body></html>'
# Default: preserves document order
BS2Json(html).convert()
# {'html': {'body': {'children': [{'h3': 'First'}, {'p': 'Text'}, {'h3': 'Second'}]}}}
# Grouped: siblings merged by tag name
BS2Json(html, group_by_tag=True).convert()
# {'html': {'body': {'h3': ['First', 'Second'], 'p': 'Text'}}}Comments
comment_html = '<html><body><!-- TODO --><p>text</p></body></html>'
# Included by default
BS2Json(comment_html).convert()
# {'html': {'body': {'children': [{'comment': '<!-- TODO -->'}, {'p': 'text'}]}}}
# Exclude comments
BS2Json(comment_html, include_comments=False).convert()
# {'html': {'body': {'p': 'text'}}}Whitespace
ws_html = '<html><body><p> hello </p></body></html>'
# Stripped by default
BS2Json(ws_html).convert()
# {'html': {'body': {'p': 'hello'}}}
# Preserve whitespace
BS2Json(ws_html, strip=False).convert()
# {'html': {'body': {'p': ' hello '}}}Custom Labels
Change the JSON key names for attributes, text content, and comments:
converter = BS2Json('<html><body><p class="x">hello</p></body></html>')
converter.labels(attrs='attributes', text='content', comment='notes')
result = converter.convert()
# {'html': {'body': {'p': {'attributes': {'class': ['x']}, 'content': 'hello'}}}}Or via constructor:
BS2Json(html, attr_name='@', text_name='#text', comment_name='#comment')Configuration Object
All options are stored in a ConversionConfig dataclass, accessible and modifiable at any time:
from bs2json import BS2Json, ConversionConfig
converter = BS2Json(html, strip=False)
print(converter.config)
# ConversionConfig(attr_name='attrs', text_name='text', comment_name='comment',
# include_comments=True, strip=False, group_by_tag=False)
# Modify config directly
converter.config.group_by_tag = True
converter.config.include_comments = FalseSave to File
converter = BS2Json(html)
converter.convert()
# Save to JSON file (pretty-printed, 4-space indent)
converter.save('output.json')
# Save compact
converter.save('compact.json', prettify=False)
# Custom indent
converter.save('indented.json', indent=2)
# Save to a file-like object
import io
buf = io.StringIO()
converter.save(buf)Pretty Print
converter = BS2Json(html)
converter.convert()
converter.prettify() # prints to stdoutContext Manager and Callable
# Use as context manager
with BS2Json(html) as converter:
result = converter.convert()
# Use as callable (shortcut for .convert())
converter = BS2Json(html)
result = converter()Extension Mode
Monkey-patch .to_json() directly onto every BeautifulSoup Tag element:
from bs4 import BeautifulSoup
from bs2json import install, remove
install()
soup = BeautifulSoup(html, 'html.parser')
# Now every tag has .to_json()
soup.find('body').to_json()
soup.find('a').to_json(include_comments=False, strip=False)
remove() # clean up when doneBS2Json
| Method | Description |
|---|---|
BS2Json(soup, features, *, include_comments, strip, group_by_tag, **kwargs) |
Initialize from HTML string, Tag, or BeautifulSoup object |
.convert(element=None, json=None, *, inplace=False, **kwargs) |
Convert a single tag to a dict |
.convert_all(elements=None, lst=None, *, join=False, **kwargs) |
Convert multiple tags to a list of dicts |
.labels(attrs=..., text=..., comment=...) |
Change JSON key names |
.save(file, /, mode='w', *, prettify=True, indent=4) |
Save last result to file path or file object |
.prettify() |
Pretty-print last result to stdout |
.config |
ConversionConfig dataclass with all options |
.last_obj |
Result of the most recent conversion |
.soup |
The underlying BeautifulSoup object |
ConversionConfig
| Field | Default | Description |
|---|---|---|
attr_name |
"attrs" |
JSON key for element attributes |
text_name |
"text" |
JSON key for text content |
comment_name |
"comment" |
JSON key for HTML comments |
include_comments |
True |
Whether to include HTML comments |
strip |
True |
Strip leading/trailing whitespace from text |
group_by_tag |
False |
Group siblings by tag name instead of preserving order |
See CONTRIBUTING.md for development setup, versioning guide, and how to submit changes.