Blog

Luis Majano

July 15, 2025

Spread the word


Share your thoughts

We're excited to announce the release of bx-jsoup, a powerful new BoxLang module that brings enterprise-grade HTML parsing and cleaning capabilities to your applications. Built on top of the proven Jsoup library, this module provides developers with safe, flexible tools for handling HTML content while maintaining BoxLang's signature ease of use. It also enhances the core document classes to provide you with a fluent BoxDocument result that you can navigate, query, and even convert your HTML representation to XML or JSON.

Why HTML Parsing and Cleaning Matters

In today's web applications, handling HTML content safely is crucial. Whether you're building a content management system, processing user-generated content, scraping data from websites, or simply need to transform HTML for different output formats, you need tools that are both powerful and secure. The bx-jsoup module addresses these needs with:

  • XSS Protection: Built-in safeguards against Cross-Site Scripting attacks
  • Flexible Content Processing: Parse, manipulate, and transform HTML with ease
  • Data Extraction: Use familiar CSS selectors to extract specific content
  • Multiple Output Formats: Convert HTML to JSON, XML, or plain text

Getting Started

Installation is simple using either CommandBox or the BoxLang Installer Scripts:

# BoxLang Installer Script
install-bx-module bx-jsoup

# CommandBox
box install bx-jsoup

Core Functions

The module provides two primary Built-in Functions (BIFs) that handle the most common HTML processing scenarios:

htmlParse() - Transform HTML into Manipulable Objects

The htmlParse() function converts HTML strings into BoxDocument objects, giving you programmatic access to the document structure:

// Parse HTML content
htmlContent = "<html><head><title>My Blog</title></head><body><h1>Welcome</h1><p>Latest posts...</p></body></html>";
doc = htmlParse( htmlContent );

// Extract information
pageTitle = doc.title(); // Returns "My Blog"
mainHeading = doc.select( "h1" ).text(); // Returns "Welcome"
allText = doc.text(); // Returns plain text without HTML tags

htmlClean() - Sanitize Content for Safe Display

The htmlClean() function removes malicious content while preserving safe HTML elements:

// Clean potentially dangerous content
userContent = "<p>Great article!</p><script>alert('XSS attempt')</script>";
safeContent = htmlClean( userContent );
// Result: "<p>Great article!</p>"

// Different safety levels for different needs
strictContent = htmlClean( userContent, "basic" );
textOnly = htmlClean( userContent, "none" );

Advanced Features

Enhanced BoxDocument Methods

Beyond standard Jsoup functionality, BoxDocument includes BoxLang-specific enhancements:

// Convert to structured JSON
htmlContent = `
<article class="blog-post">
    <h1>BoxLang 1.0 Released</h1>
    <p class="author">By: BoxLang Team</p>
    <div class="content">
        <p>We're excited to announce BoxLang 1.0...</p>
    </div>
</article>
`;
doc = htmlParse( htmlContent );

// Generate JSON for API responses
jsonData = doc.toJSON( true ); // Pretty-printed JSON
// Perfect for REST APIs or data processing pipelines

// Generate XML for legacy systems
xmlData = doc.toXML( true, 2 ); // Pretty-printed with 2-space indentation
// Ideal for XML-based integrations

Flexible Safety Levels

Choose the right balance between security and functionality:

  • none: Maximum security - plain text only
  • simpletext: Basic inline formatting (<b>, <i>, <br>)
  • basic: Standard safe tags without images
  • basicwithimages: Standard safe tags plus images
  • relaxed: More permissive for trusted content (default)

Real-World Use Cases

Content Management Systems

// Clean user submissions before storage
userSubmission = `
    <h2>My Article</h2>
    <p>This is my content with <script>maliciousCode()</script> embedded.</p>
    <img src="photo.jpg" alt="My photo">
`;

// Clean with image support
cleanContent = htmlClean( 
    html: userSubmission,
    safeList: "basicwithimages"
);
// Safe to store and display: removes script but keeps images

Web Scraping and Data Extraction

// Extract product information from scraped content
productHtml = `
<div class="product-card">
    <h3 class="product-name">BoxLang Pro License</h3>
    <span class="price">$199.00</span>
    <div class="description">
        <p>Professional BoxLang development license with premium features.</p>
    </div>
    <ul class="features">
        <li>Advanced debugging tools</li>
        <li>Performance profiling</li>
        <li>Premium support</li>
    </ul>
</div>
`;

doc = htmlParse( productHtml );

// Extract structured data
productInfo = {
    "name": doc.select( ".product-name" ).text(),
    "price": doc.select( ".price" ).text(),
    "description": doc.select( ".description p" ).text(),
    "features": doc.select( ".features li" ).map( ( element ) => element.text() )
};

// Result: Clean, structured data ready for your application

Email Template Processing

// Process email templates safely
emailTemplate = `
<div class="email-container">
    <h1>Welcome {{customerName}}!</h1>
    <p>Thank you for joining our service.</p>
    <script>trackOpen()</script>
    <a href="{{confirmationLink}}">Confirm your account</a>
</div>
`;

// Clean template while preserving placeholders
cleanTemplate = htmlClean( 
    html: emailTemplate,
    safeList: "basic"
);
// Safe template ready for placeholder replacement and sending

Data Transformation Pipelines

// Transform HTML content for different systems
newsArticle = `
<article>
    <header>
        <h1>BoxLang Adoption Grows</h1>
        <time datetime="2024-01-15">January 15, 2024</time>
        <span class="author">Tech Reporter</span>
    </header>
    <section class="content">
        <p>Enterprise adoption of BoxLang continues to accelerate...</p>
        <blockquote>
            "BoxLang has transformed our development process" - CTO, Fortune 500 Company
        </blockquote>
    </section>
</article>
`;

doc = htmlParse( newsArticle );

// For search indexing (plain text)
searchableText = doc.text();

// For API responses (structured JSON)
apiResponse = doc.toJSON( true );

// For legacy XML systems
xmlFeed = doc.toXML( true, 4 );

// Each format optimized for its intended use

Security by Design

The bx-jsoup module prioritizes security without sacrificing functionality:

  • Whitelist-based Cleaning: Only explicitly allowed elements and attributes are preserved
  • XSS Prevention: Automatically removes dangerous scripts and event handlers
  • Configurable Safety: Choose the right balance for your security requirements
  • Link Handling: Control how relative links are processed and resolved
// Example of comprehensive XSS protection
maliciousContent = `
<p>Legitimate content</p>
<script>stealCookies()</script>
<img src="x" onerror="alert('XSS')">
<a href="javascript:malicious()">Dangerous link</a>
<div onclick="badStuff()">Clickjacking attempt</div>
`;

safeContent = htmlClean( maliciousContent, "relaxed" );
// Result: Only the legitimate paragraph remains

Getting Help and Contributing

The bx-jsoup module is part of the growing BoxLang ecosystem. For support, documentation, and contributions:

  • Documentation: Complete API reference at BoxLang Documentation
  • Issues and Features: Report bugs or request features via Jira
  • Source Code: View and contribute on GitHub
  • Community: Join the BoxLang community for discussions and support

Professional Open Source

BoxLang is a professional open-source product, with three different licences:

  1. Open-Source Apache2
  2. BoxLang +
  3. BoxLang ++

BoxLang is free, open-source software under the Apache 2.0 license. We encourage and support community contributions. BoxLang+ and BoxLang ++ are commercial versions offering support and enterprise features. Our licensing model is based on fairness and the golden rule: Do to others as you want them to do to you. No hidden pricing or pricing on cores, RAM, SaaS, multi-domain or ridiculous ways to get your money. Transparent and fair.

BoxLang Subscription Plans

BoxLang is more than just a language; it's a movement.

Join us and redefine development on the JVM Ready to learn more? Explore BoxLang's Features, Documentation, and Community.

Join the BoxLang Community ⚡️

Be part of the movement shaping the future of web development. Stay connected and receive the latest updates on surrounding anything BoxLang

Subscribe to our newsletter for exclusive content.

Follow Us on Social media and don’t miss any news and updates:

Add Your Comment

Recent Entries

BoxLang v1.8.0 : Revolutionary HTTP Client, SOAP Integration, and Production-Grade Stability

BoxLang v1.8.0 : Revolutionary HTTP Client, SOAP Integration, and Production-Grade Stability

The BoxLang team is excited to announce BoxLang 1.8.0, a massive release that revolutionizes HTTP capabilities, introduces comprehensive SOAP/WSDL integration, and delivers over 100 critical bug fixes for production-grade stability. This release focuses on modern web application development with fluent APIs, streaming support, persistent connection management, and extensive CFML compatibility improvements.

Luis Majano
Luis Majano
December 05, 2025
Ortus & BoxLang November Recap 2025

Ortus & BoxLang November Recap 2025

November 2025 was a big month at Ortus. BoxLang 1.7.0 arrived with real-time streaming, distributed caching, and faster compiler internals. ColdBox gained a cleaner debugging experience with full Whoops support, while CBWIRE 5 launched with stronger security, smarter lifecycles, and easier uploads.

Victor Campos
Victor Campos
December 02, 2025