Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: matthewwithanm/python-markdownify
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 0.13.1
Choose a base ref
...
head repository: matthewwithanm/python-markdownify
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 0.14.0
Choose a head ref
  • 13 commits
  • 9 files changed
  • 3 contributors

Commits on Apr 9, 2024

  1. More carefully separate inline text from block content

    There are various cases in which inline text fails to be separated by
    (sufficiently many) newlines from adjacent block content.  A paragraph
    needs a blank line (two newlines) separating it from prior text, as
    does an underlined header; an ATX header needs a single newline
    separating it from prior text.  A list needs at least one newline
    separating it from prior text, but in general two newlines (for an
    ordered list starting other than at 1, which will only be recognized
    given a blank line before).
    
    To avoid accumulation of more newlines than necessary, take care when
    concatenating the results of converting consecutive tags to remove
    redundant newlines (keeping the greater of the number ending the prior
    text and the number starting the subsequent text).
    
    This is thus an alternative to #108 that tries to avoid the excess
    newline accumulation that was a concern there, as well as fixing more
    cases than just paragraphs, and updating tests.
    
    Fixes #92
    
    Fixes #98
    jsm28 committed Apr 9, 2024
    Configuration menu
    Copy the full SHA
    60d8666 View commit details
    Browse the repository at this point in the history

Commits on Sep 30, 2024

  1. Configuration menu
    Copy the full SHA
    4399ee7 View commit details
    Browse the repository at this point in the history

Commits on Oct 2, 2024

  1. More selective escaping of -#.) (alternative approach)

    This is a partial alternative to #122 (open since April) for more
    selective escaping of some special characters.
    
    Here, we fix the test function naming (as noted in that PR) so the
    tests are actually run (and fix some incorrect test assertions so they
    pass).  We also make escaping of `-#.)` (the most common cases of
    unnecessary escaping in my use case) more selective, while still being
    conservatively safe in escaping all cases of those characters that
    might have Markdown significance (including in the presence of
    wrapping, unlike in #122).  (Being conservatively safe doesn't include
    the cases where `.` or `)` start a fragment, where the existing code
    already was not conservatively safe.)
    
    There are certainly more cases where the code could also be made more
    selective while remaining conservatively safe (including in the
    presence of wrapping), so this is not a complete replacement for #122,
    but by fixing some of the most common cases in a safe way, and getting
    the tests actually running, I hope this allows progress to be made
    where the previous attempt appears to have stalled, while still
    allowing further incremental progress with appropriately safe logic
    for other characters where useful.
    jsm28 committed Oct 2, 2024
    Configuration menu
    Copy the full SHA
    a369e07 View commit details
    Browse the repository at this point in the history

Commits on Oct 3, 2024

  1. Fix whitespace issues around wrapping

    This fixes various issues relating to how input whitespace is handled
    and how wrapping handles whitespace resulting from hard line breaks.
    
    This PR uses a branch based on that for #120 to avoid conflicts with
    the fixes and associated test changes there.  My suggestion is thus
    first to merge #120 (which fixes two open issues), then to merge the
    remaining changes from this PR.
    
    Wrapping paragraphs has the effect of losing all newlines including
    those from `<br>` tags, contrary to HTML semantics (wrapping should be
    a matter of pretty-printing the output; input whitespace from the HTML
    input should be normalized, but `<br>` should remain as a hard line
    break).  To fix this, we need to wrap the portions of a paragraph
    between hard line breaks separately.  For this to work, ensure that
    when wrapping, all input whitespace is normalized at an early stage,
    including turning newlines into spaces.  (Only ASCII whitespace is
    handled this way; `\s` is not used as it's not clear Unicode
    whitespace should get such normalization.)
    
    When not wrapping, there is still too much input whitespace
    preservation.  If the input contains a blank line, that ends up as a
    paragraph break in the output, or breaks the header formatting when
    appearing in a header tag, though in terms of HTML semantics such a
    blank line is no different from a space.  In the case of an ATX
    header, even a single newline appearing in the output breaks the
    Markdown.  Thus, when not wrapping, arrange for input whitespace
    containing at least one `\r` or `\n` to be normalized to a single
    newline, and in the ATX header case, normalize to a space.
    
    Fixes #130
    (probably, not sure exactly what the HTML input there is)
    
    Fixes #88
    (a related case, anyway; the actual input in #88 has already been fixed)
    jsm28 committed Oct 3, 2024
    Configuration menu
    Copy the full SHA
    c2ffe46 View commit details
    Browse the repository at this point in the history
  2. More thorough cleanup of input whitespace

    This improves the markdownify logic for cleaning up input whitespace
    that has no semantic significance in HTML.
    
    This PR uses a branch based on that for #150 (which in turn is based
    on that for #120) to avoid conflicts with those fixes.  The suggested
    order of merging is just first to merge #120, then the rest of #150,
    then the rest of this PR.
    
    Whitespace in HTML input isn't generally significant before or after
    block-level elements, or at the start of end of such an element other
    than `<pre>`.  There is some limited logic in markdownify for removing
    it, (a) for whitespace-only nodes in conjunction with a limited list
    of elements (and with questionable logic that ony removes whitespace
    adjacent to such an element when also inside such an element) and (b)
    only for trailing whitespace, in certain places in relation to lists.
    
    Replace both those places with more thorough logic using a common list
    of block-level elements (which could be expanded more).
    
    In general, this reduces the number of unnecessary blank lines in
    output from markdownify (sometimes lines with just a newline,
    sometimes lines containing a space as well as that newline).  There
    are open issues about cases where propagating such input whitespace to
    the output actually results in badly formed Markdown output (wrongly
    indented output), but #120 (which this builds on) fixes those issues,
    sometimes leaving unnecessary lines with just a space on them in the
    output, which are dealt with fully by the present PR.
    
    There are a few testcases that are affected because they were relying
    on such whitespace for good output from bad HTML input that used `<p>`
    or `<blockquote>` inside header tags.  To keep reasonable output in
    those cases of bad input now input whitespace adjacent to those two
    tags is ignored, make the `<p>` and `<blockquote>` output explicitly
    include leading and trailing spaces if `convert_as_inline`; such
    explicit spaces seem the best that can be done for such bad input.
    Given those fixes, all the remaining changes needed to the
    expectations of existing tests seem like improvements (removing
    useless spaces or newlines from the output).
    jsm28 committed Oct 3, 2024
    Configuration menu
    Copy the full SHA
    340aecb View commit details
    Browse the repository at this point in the history
  3. Fix logic for indentation inside list items

    This fixes problems with the markdownify logic for indentation inside
    list items.
    
    This PR uses a branch building on that for #120, #150 and #151, so
    those three PRs should be merged first before merging this one.
    
    There is limited logic in markdownify for handling indentation in the
    case of nested lists.  There are two major problems with this logic:
    
    * As it's in `convert_list`, causing a list to be indented when inside
      another list, it does not add indentation for any other elements
      such as paragraphs that may be found inside list items (or `<pre>`,
      `<blockquote>`, etc.), so such elements are wrongly not indented and
      terminate the list in the output.
    
    * It uses fixed indentation of one tab.  Following CommonMark, a tab
      in Markdown is considered equivalent to four spaces, which is not
      sufficient indentation in ordered list items with a number of three
      or more digits.
    
    Fix both of these issues by making `convert_li` handle indentation for
    the contents of `<li>`, based on the length of the list item marker,
    rather than doing it in `convert_list` at all.
    jsm28 committed Oct 3, 2024
    Configuration menu
    Copy the full SHA
    c13bdd5 View commit details
    Browse the repository at this point in the history

Commits on Oct 9, 2024

  1. Configuration menu
    Copy the full SHA
    7ff4d83 View commit details
    Browse the repository at this point in the history

Commits on Nov 20, 2024

  1. Configuration menu
    Copy the full SHA
    9bf4ff1 View commit details
    Browse the repository at this point in the history

Commits on Nov 24, 2024

  1. ignore bs4 warnings in tests

    AlexVonB committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    9202027 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    1978083 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    54c7ca9 View commit details
    Browse the repository at this point in the history
  4. bump to version v0.14.0

    AlexVonB committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    fe8a821 View commit details
    Browse the repository at this point in the history
  5. Merge branch 'develop'

    AlexVonB committed Nov 24, 2024
    Configuration menu
    Copy the full SHA
    e935ce8 View commit details
    Browse the repository at this point in the history
Loading