Skip to content

How to do structured HTML-aware comparison (e.g., with Diff subclass or tokenizer)? #606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
AliRezaBeitari opened this issue Apr 29, 2025 · 0 comments

Comments

@AliRezaBeitari
Copy link

Hi,

I’m currently using diffArrays with a custom tokenizer to compare two HTML documents, treating tags and text separately. Here’s a simplified version of my setup:

  • I tokenize the HTML into words and tags (preserving spaces).
  • I use diffArrays() with a custom comparator that tries to ignore formatting-only changes.
  • I wrap added/removed tokens in <span> tags to render diffs inline in the browser.

However, I’m running into a few key issues:

  1. Formatting-only changes (like switching from <b> to <strong>) are still flagged as additions/removals, even when they are semantically equivalent.
  2. Sentence-level changes are broken down into many small word-level diffs instead of treating them as grouped phrases.
  3. I saw in the README that extending the Diff class could allow deeper customization, but I couldn’t find any examples or guidance on how to use it in this context.

Here’s the code I’m using:

export const compareTwoDocuments = (original: string, modified: string) => {
  // Minify HTML by removing extra whitespace and normalizing newlines
  const minifyHtml = (html: string): string => {
    return html
      .replace(/&nbsp;/g, ' ') // Replace &nbsp; with space
      .replace(/\s+/g, ' ') // Replace multiple spaces with single space
      .replace(/>\s+</g, '><') // Remove spaces between tags
      .replace(/\s+>/g, '>') // Remove spaces before closing tags
      .replace(/<\s+/g, '<') // Remove spaces after opening tags
      .trim()
  }

  // Custom tokenizer that separates HTML tags and content while preserving spaces
  const tokenizeHtml = (text: string): string[] => {
    const tokens: string[] = []
    let currentIndex = 0
    const tagRegex = /<[^>]+>|<\/[^>]+>/g
    let match

    while ((match = tagRegex.exec(text)) !== null) {
      // Add content before the tag if it exists
      if (match.index > currentIndex) {
        const content = text.slice(currentIndex, match.index)
        if (content) {
          // Split content into words and spaces
          const words = content.split(/(\s+)/)
          tokens.push(...words.filter((w) => w.length > 0))
        }
      }
      // Add the tag
      tokens.push(match[0])
      currentIndex = match.index + match[0].length
    }

    // Add any remaining content after the last tag
    if (currentIndex < text.length) {
      const remainingContent = text.slice(currentIndex)
      if (remainingContent) {
        // Split remaining content into words and spaces
        const words = remainingContent.split(/(\s+)/)
        tokens.push(...words.filter((w) => w.length > 0))
      }
    }

    return tokens
  }

  const originalTokens = tokenizeHtml(minifyHtml(original))
  const modifiedTokens = tokenizeHtml(minifyHtml(modified))

  const differences = jsdiff.diffArrays(originalTokens, modifiedTokens, {
    comparator: (left, right) => {
      // Skip spaces (cause we have so many of them)
      if (left === ' ' && right === ' ') return true

      // Compare HTML tags exactly
      if (left.startsWith('<') && right.startsWith('<')) {
        return left === right
      }

      // For content, compare case-insensitively and normalize whitespace
      const normalize = (str: string) => str.replace(/\s+/g, ' ').trim().toLowerCase()
      return normalize(left) === normalize(right)
    },
  })

  const finalHtmlResult = differences
    .map((part) => {
      const value = part.value.join('')
      if (part.added) return `<span class="text-green-600 bg-green-100">${value}</span>`
      if (part.removed) return `<span class="text-red-600 bg-red-100 line-through">${value}</span>`
      return value
    })
    .join('')

  return finalHtmlResult
}

My questions:

  • Is there a recommended way to improve grouping of similar phrases/sentences instead of word-by-word diffs?
  • Is there any documentation or example for subclassing Diff to improve diff scoring or heuristics?
  • Would it make sense to preprocess HTML into block-level elements and diff those instead? Or is there a more robust way to make jsdiff HTML-aware?

Any guidance or references would be really appreciated.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant