You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m currently using diffArrays with a custom tokenizer to compare two HTML documents, treating tags and text separately. Here’s a simplified version of my setup:
I tokenize the HTML into words and tags (preserving spaces).
I use diffArrays() with a custom comparator that tries to ignore formatting-only changes.
I wrap added/removed tokens in <span> tags to render diffs inline in the browser.
However, I’m running into a few key issues:
Formatting-only changes (like switching from <b> to <strong>) are still flagged as additions/removals, even when they are semantically equivalent.
Sentence-level changes are broken down into many small word-level diffs instead of treating them as grouped phrases.
I saw in the README that extending the Diff class could allow deeper customization, but I couldn’t find any examples or guidance on how to use it in this context.
Here’s the code I’m using:
exportconstcompareTwoDocuments=(original: string,modified: string)=>{// Minify HTML by removing extra whitespace and normalizing newlinesconstminifyHtml=(html: string): string=>{returnhtml.replace(/ /g,' ')// Replace with space.replace(/\s+/g,' ')// Replace multiple spaces with single space.replace(/>\s+</g,'><')// Remove spaces between tags.replace(/\s+>/g,'>')// Remove spaces before closing tags.replace(/<\s+/g,'<')// Remove spaces after opening tags.trim()}// Custom tokenizer that separates HTML tags and content while preserving spacesconsttokenizeHtml=(text: string): string[]=>{consttokens: string[]=[]letcurrentIndex=0consttagRegex=/<[^>]+>|<\/[^>]+>/gletmatchwhile((match=tagRegex.exec(text))!==null){// Add content before the tag if it existsif(match.index>currentIndex){constcontent=text.slice(currentIndex,match.index)if(content){// Split content into words and spacesconstwords=content.split(/(\s+)/)tokens.push(...words.filter((w)=>w.length>0))}}// Add the tagtokens.push(match[0])currentIndex=match.index+match[0].length}// Add any remaining content after the last tagif(currentIndex<text.length){constremainingContent=text.slice(currentIndex)if(remainingContent){// Split remaining content into words and spacesconstwords=remainingContent.split(/(\s+)/)tokens.push(...words.filter((w)=>w.length>0))}}returntokens}constoriginalTokens=tokenizeHtml(minifyHtml(original))constmodifiedTokens=tokenizeHtml(minifyHtml(modified))constdifferences=jsdiff.diffArrays(originalTokens,modifiedTokens,{comparator: (left,right)=>{// Skip spaces (cause we have so many of them)if(left===' '&&right===' ')returntrue// Compare HTML tags exactlyif(left.startsWith('<')&&right.startsWith('<')){returnleft===right}// For content, compare case-insensitively and normalize whitespaceconstnormalize=(str: string)=>str.replace(/\s+/g,' ').trim().toLowerCase()returnnormalize(left)===normalize(right)},})constfinalHtmlResult=differences.map((part)=>{constvalue=part.value.join('')if(part.added)return`<span class="text-green-600 bg-green-100">${value}</span>`if(part.removed)return`<span class="text-red-600 bg-red-100 line-through">${value}</span>`returnvalue}).join('')returnfinalHtmlResult}
My questions:
Is there a recommended way to improve grouping of similar phrases/sentences instead of word-by-word diffs?
Is there any documentation or example for subclassing Diff to improve diff scoring or heuristics?
Would it make sense to preprocess HTML into block-level elements and diff those instead? Or is there a more robust way to make jsdiff HTML-aware?
Any guidance or references would be really appreciated.
Thanks.
The text was updated successfully, but these errors were encountered:
Hi,
I’m currently using
diffArrays
with a custom tokenizer to compare two HTML documents, treating tags and text separately. Here’s a simplified version of my setup:diffArrays()
with a custom comparator that tries to ignore formatting-only changes.<span>
tags to render diffs inline in the browser.However, I’m running into a few key issues:
<b>
to<strong>
) are still flagged as additions/removals, even when they are semantically equivalent.Diff
class could allow deeper customization, but I couldn’t find any examples or guidance on how to use it in this context.Here’s the code I’m using:
My questions:
Any guidance or references would be really appreciated.
Thanks.
The text was updated successfully, but these errors were encountered: