How to do structured HTML-aware comparison (e.g., with Diff subclass or tokenizer)? #606

AliRezaBeitari · 2025-04-29T13:29:56Z

Hi,

I’m currently using diffArrays with a custom tokenizer to compare two HTML documents, treating tags and text separately. Here’s a simplified version of my setup:

I tokenize the HTML into words and tags (preserving spaces).
I use diffArrays() with a custom comparator that tries to ignore formatting-only changes.
I wrap added/removed tokens in <span> tags to render diffs inline in the browser.

However, I’m running into a few key issues:

Formatting-only changes (like switching from <b> to <strong>) are still flagged as additions/removals, even when they are semantically equivalent.
Sentence-level changes are broken down into many small word-level diffs instead of treating them as grouped phrases.
I saw in the README that extending the Diff class could allow deeper customization, but I couldn’t find any examples or guidance on how to use it in this context.

Here’s the code I’m using:

export const compareTwoDocuments = (original: string, modified: string) => {
  // Minify HTML by removing extra whitespace and normalizing newlines
  const minifyHtml = (html: string): string => {
    return html
      .replace(/&nbsp;/g, ' ') // Replace &nbsp; with space
      .replace(/\s+/g, ' ') // Replace multiple spaces with single space
      .replace(/>\s+</g, '><') // Remove spaces between tags
      .replace(/\s+>/g, '>') // Remove spaces before closing tags
      .replace(/<\s+/g, '<') // Remove spaces after opening tags
      .trim()
  }

  // Custom tokenizer that separates HTML tags and content while preserving spaces
  const tokenizeHtml = (text: string): string[] => {
    const tokens: string[] = []
    let currentIndex = 0
    const tagRegex = /<[^>]+>|<\/[^>]+>/g
    let match

    while ((match = tagRegex.exec(text)) !== null) {
      // Add content before the tag if it exists
      if (match.index > currentIndex) {
        const content = text.slice(currentIndex, match.index)
        if (content) {
          // Split content into words and spaces
          const words = content.split(/(\s+)/)
          tokens.push(...words.filter((w) => w.length > 0))
        }
      }
      // Add the tag
      tokens.push(match[0])
      currentIndex = match.index + match[0].length
    }

    // Add any remaining content after the last tag
    if (currentIndex < text.length) {
      const remainingContent = text.slice(currentIndex)
      if (remainingContent) {
        // Split remaining content into words and spaces
        const words = remainingContent.split(/(\s+)/)
        tokens.push(...words.filter((w) => w.length > 0))
      }
    }

    return tokens
  }

  const originalTokens = tokenizeHtml(minifyHtml(original))
  const modifiedTokens = tokenizeHtml(minifyHtml(modified))

  const differences = jsdiff.diffArrays(originalTokens, modifiedTokens, {
    comparator: (left, right) => {
      // Skip spaces (cause we have so many of them)
      if (left === ' ' && right === ' ') return true

      // Compare HTML tags exactly
      if (left.startsWith('<') && right.startsWith('<')) {
        return left === right
      }

      // For content, compare case-insensitively and normalize whitespace
      const normalize = (str: string) => str.replace(/\s+/g, ' ').trim().toLowerCase()
      return normalize(left) === normalize(right)
    },
  })

  const finalHtmlResult = differences
    .map((part) => {
      const value = part.value.join('')
      if (part.added) return `<span class="text-green-600 bg-green-100">${value}</span>`
      if (part.removed) return `<span class="text-red-600 bg-red-100 line-through">${value}</span>`
      return value
    })
    .join('')

  return finalHtmlResult
}

My questions:

Is there a recommended way to improve grouping of similar phrases/sentences instead of word-by-word diffs?
Is there any documentation or example for subclassing Diff to improve diff scoring or heuristics?
Would it make sense to preprocess HTML into block-level elements and diff those instead? Or is there a more robust way to make jsdiff HTML-aware?

Any guidance or references would be really appreciated.

Thanks.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do structured HTML-aware comparison (e.g., with Diff subclass or tokenizer)? #606

How to do structured HTML-aware comparison (e.g., with Diff subclass or tokenizer)? #606

AliRezaBeitari commented Apr 29, 2025

How to do structured HTML-aware comparison (e.g., with Diff subclass or tokenizer)? #606

How to do structured HTML-aware comparison (e.g., with Diff subclass or tokenizer)? #606

Comments

AliRezaBeitari commented Apr 29, 2025