We're excited to announce the release of bx-jsoup, a powerful new BoxLang module that brings enterprise-grade HTML parsing and cleaning capabilities to your applications. Built on top of the proven Jsoup library, this module provides developers with safe, flexible tools for handling HTML content while maintaining BoxLang's signature ease of use. It also enhances the core document classes to provide you with a fluent BoxDocument result that you can navigate, query, and even convert your HTML representation to XML or JSON.
Why HTML Parsing and Cleaning Matters
In today's web applications, handling HTML content safely is crucial. Whether you're building a content management system, processing user-generated content, scraping data from websites, or simply need to transform HTML for different output formats, you need tools that are both powerful and secure. The bx-jsoup module addresses these needs with:
- XSS Protection: Built-in safeguards against Cross-Site Scripting attacks
- Flexible Content Processing: Parse, manipulate, and transform HTML with ease
- Data Extraction: Use familiar CSS selectors to extract specific content
- Multiple Output Formats: Convert HTML to JSON, XML, or plain text
Getting Started
Installation is simple using either CommandBox or the BoxLang Installer Scripts:
# BoxLang Installer Script
install-bx-module bx-jsoup
# CommandBox
box install bx-jsoup
Core Functions
The module provides two primary Built-in Functions (BIFs) that handle the most common HTML processing scenarios:
htmlParse() - Transform HTML into Manipulable Objects
The htmlParse() function converts HTML strings into BoxDocument objects, giving you programmatic access to the document structure:
// Parse HTML content
htmlContent = "<html><head><title>My Blog</title></head><body><h1>Welcome</h1><p>Latest posts...</p></body></html>";
doc = htmlParse( htmlContent );
// Extract information
pageTitle = doc.title(); // Returns "My Blog"
mainHeading = doc.select( "h1" ).text(); // Returns "Welcome"
allText = doc.text(); // Returns plain text without HTML tags
htmlClean() - Sanitize Content for Safe Display
The htmlClean() function removes malicious content while preserving safe HTML elements:
// Clean potentially dangerous content
userContent = "<p>Great article!</p><script>alert('XSS attempt')</script>";
safeContent = htmlClean( userContent );
// Result: "<p>Great article!</p>"
// Different safety levels for different needs
strictContent = htmlClean( userContent, "basic" );
textOnly = htmlClean( userContent, "none" );
Advanced Features
Enhanced BoxDocument Methods
Beyond standard Jsoup functionality, BoxDocument includes BoxLang-specific enhancements:
// Convert to structured JSON
htmlContent = `
<article class="blog-post">
<h1>BoxLang 1.0 Released</h1>
<p class="author">By: BoxLang Team</p>
<div class="content">
<p>We're excited to announce BoxLang 1.0...</p>
</div>
</article>
`;
doc = htmlParse( htmlContent );
// Generate JSON for API responses
jsonData = doc.toJSON( true ); // Pretty-printed JSON
// Perfect for REST APIs or data processing pipelines
// Generate XML for legacy systems
xmlData = doc.toXML( true, 2 ); // Pretty-printed with 2-space indentation
// Ideal for XML-based integrations
Flexible Safety Levels
Choose the right balance between security and functionality:
none: Maximum security - plain text onlysimpletext: Basic inline formatting (<b>,<i>,<br>)basic: Standard safe tags without imagesbasicwithimages: Standard safe tags plus imagesrelaxed: More permissive for trusted content (default)
Real-World Use Cases
Content Management Systems
// Clean user submissions before storage
userSubmission = `
<h2>My Article</h2>
<p>This is my content with <script>maliciousCode()</script> embedded.</p>
<img src="photo.jpg" alt="My photo">
`;
// Clean with image support
cleanContent = htmlClean(
html: userSubmission,
safeList: "basicwithimages"
);
// Safe to store and display: removes script but keeps images
Web Scraping and Data Extraction
// Extract product information from scraped content
productHtml = `
<div class="product-card">
<h3 class="product-name">BoxLang Pro License</h3>
<span class="price">$199.00</span>
<div class="description">
<p>Professional BoxLang development license with premium features.</p>
</div>
<ul class="features">
<li>Advanced debugging tools</li>
<li>Performance profiling</li>
<li>Premium support</li>
</ul>
</div>
`;
doc = htmlParse( productHtml );
// Extract structured data
productInfo = {
"name": doc.select( ".product-name" ).text(),
"price": doc.select( ".price" ).text(),
"description": doc.select( ".description p" ).text(),
"features": doc.select( ".features li" ).map( ( element ) => element.text() )
};
// Result: Clean, structured data ready for your application
Email Template Processing
// Process email templates safely
emailTemplate = `
<div class="email-container">
<h1>Welcome {{customerName}}!</h1>
<p>Thank you for joining our service.</p>
<script>trackOpen()</script>
<a href="{{confirmationLink}}">Confirm your account</a>
</div>
`;
// Clean template while preserving placeholders
cleanTemplate = htmlClean(
html: emailTemplate,
safeList: "basic"
);
// Safe template ready for placeholder replacement and sending
Data Transformation Pipelines
// Transform HTML content for different systems
newsArticle = `
<article>
<header>
<h1>BoxLang Adoption Grows</h1>
<time datetime="2024-01-15">January 15, 2024</time>
<span class="author">Tech Reporter</span>
</header>
<section class="content">
<p>Enterprise adoption of BoxLang continues to accelerate...</p>
<blockquote>
"BoxLang has transformed our development process" - CTO, Fortune 500 Company
</blockquote>
</section>
</article>
`;
doc = htmlParse( newsArticle );
// For search indexing (plain text)
searchableText = doc.text();
// For API responses (structured JSON)
apiResponse = doc.toJSON( true );
// For legacy XML systems
xmlFeed = doc.toXML( true, 4 );
// Each format optimized for its intended use
Security by Design
The bx-jsoup module prioritizes security without sacrificing functionality:
- Whitelist-based Cleaning: Only explicitly allowed elements and attributes are preserved
- XSS Prevention: Automatically removes dangerous scripts and event handlers
- Configurable Safety: Choose the right balance for your security requirements
- Link Handling: Control how relative links are processed and resolved
// Example of comprehensive XSS protection
maliciousContent = `
<p>Legitimate content</p>
<script>stealCookies()</script>
<img src="x" onerror="alert('XSS')">
<a href="javascript:malicious()">Dangerous link</a>
<div onclick="badStuff()">Clickjacking attempt</div>
`;
safeContent = htmlClean( maliciousContent, "relaxed" );
// Result: Only the legitimate paragraph remains
Getting Help and Contributing
The bx-jsoup module is part of the growing BoxLang ecosystem. For support, documentation, and contributions:
- Documentation: Complete API reference at BoxLang Documentation
- Issues and Features: Report bugs or request features via Jira
- Source Code: View and contribute on GitHub
- Community: Join the BoxLang community for discussions and support
Professional Open Source
BoxLang is a professional open-source product, with three different licences:
- Open-Source Apache2
- BoxLang +
- BoxLang ++
BoxLang is free, open-source software under the Apache 2.0 license. We encourage and support community contributions. BoxLang+ and BoxLang ++ are commercial versions offering support and enterprise features. Our licensing model is based on fairness and the golden rule: Do to others as you want them to do to you. No hidden pricing or pricing on cores, RAM, SaaS, multi-domain or ridiculous ways to get your money. Transparent and fair.
BoxLang is more than just a language; it's a movement.
Join us and redefine development on the JVM Ready to learn more? Explore BoxLang's Features, Documentation, and Community.
Join the BoxLang Community ⚡️
Be part of the movement shaping the future of web development. Stay connected and receive the latest updates on surrounding anything BoxLang
Subscribe to our newsletter for exclusive content.
Follow Us on Social media and don’t miss any news and updates:
Add Your Comment