Menu

#66 allow more characters when transforming "names" to "ids".

Default
pending
nobody
None
5
2025-04-29
2019-09-30
No
`` encloses a role. There is a default role, else :<role>:`text`

_ in front, is the special target role. For one word the backtick can be dropped.

_`__init__` should produce a target named "__init__".

But instead the produced target is "init".
The backtick avoids ambiguity. There is no need for this behavior.

Discussion

  • David Goodger

    David Goodger - 2019-09-30
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,7 +1,8 @@
    -`` encloses a role. There is a default role, else :&lt;role&gt;:`text`
    +    `` encloses a role. There is a default role, else :&lt;role&gt;:`text`
    
    -_ in front, is the special target role. For one word the backtick can be dropped.
    +    _ in front, is the special target role. For one word the backtick can be dropped.
    
    -_`__init__` should produce a target named &#34;__init__&#34;.
    +    _`__init__` should produce a target named &#34;__init__&#34;.
    +
     But instead the produced target is &#34;init&#34;.
     The backtick avoids ambiguity. There is no need for this behavior.
    

    Please be careful with using raw markup in a web form like this. SourceForge expects MarkDown, which has enough similarities to reStructuredText that the markup will be interpreted/misinterpreted. Use MarkDown to quote any markup, and check that the result makes sense when rendered (use the preview function).

     

    Last edit: Günter Milde 2025-04-29
  • David Goodger

    David Goodger - 2019-09-30

    When you say, "There is no need for this behavior", what behavior do you mean, exactly?

    It works fine for me. This input:

    $ rst2pseudoxml.py<<'EOF'
    a target _`__init__` in a paragraph
    
    EOF
    

    Produces this output:

    <document source="<stdin>">
        <paragraph>
            a target
            <target ids="init" names="__init__">
                __init__
             in a paragraph
    

    The target name is __init__. The ID drops the underscores, for the reasons explained in docutils.nodes.Element and docutils.nodes.make_id, e.g.:

    Docutils identifiers will conform to the regular expression
    [a-z](-?[a-z0-9]+)*. For CSS compatibility, identifiers (the "class"
    and "id" attributes) should have no underscores, colons, or periods.
    Hyphens may be used.

     

    Last edit: Günter Milde 2025-04-29
  • David Goodger

    David Goodger - 2019-09-30
    • status: open --> pending-works-for-me
    • assigned_to: David Goodger
     
  • Roland Puntaier

    Roland Puntaier - 2019-10-01

    According
    https://www.w3.org/TR/CSS21/syndata.html#characters
    an identifier can start with two underscores in CSS.

    HTML5 allows the id value to start with two underscores (https://html.spec.whatwg.org/multipage/dom.html#the-id-attribute).

    HTML5 id is specified to not contain spaces, but some browsers do support spaces nevertheless.
    HTML5 does not specify why it disallows spaces. It should therefore allow spaces.

    I made a related post about docutils changing IDs in 11/2018: https://sourceforge.net/p/docutils/mailman/message/36453416/

    My position is this:

    • id is used to name targets in the RST.
    • id is the identifier of a content item.
    • The same content item should have the same id independent of format (rst, html, pdf, ...)
    • How should a user target his content item, if every formatter chooses to modify his chosen id?

    The id should not be changed.

    Docutils should even keep spaces despite HTML5 disallowing them.
    If the user runs into a problem with a browser, he will change the id himself and know about it.
    Maybe he converts to just pdf, anyway.

    To summarize:

    RST is not html and does not need restrictions from HTML (or CSS) altogether.

    Docutils should develop in that direction.
    Relaxing rules does not produce backward incompatibility, either.

     
  • Günter Milde

    Günter Milde - 2019-10-01

    Ticket moved from /p/docutils/bugs/379/

     
  • Günter Milde

    Günter Milde - 2019-10-12

    My position is this:

    • id is used to name targets in the RST.
    • id is the identifier of a content item.

    In rST/Docutils, it is a bit more complicated:

    Docutils doctree elements may have multiple ids and names.

    In the reStructuredText source, only reference names are used for naming
    elements as well as referring to them. IDs are only used in generated documents.

    Reference names may be auto-derived from the content (e.g. section
    titles) or specified by the author via rST syntax (:name: option of
    directives, content of hyperlink targets, label of footnotes or citations).

    IDs_ are generated by Docutils (sometimes using names as base) when
    parsing rST or in transformations.

    • The same content item should have the same id independent of format
      (rst, html, pdf, ...)

    To achieve this, the id must be valid in all output formats supported by
    Docutils (HTML4.1/XHTML1, HTML5, LaTeX, troff (manpage), XML, ODF/ODT).

    HTML4.1:
    IDs must begin with a letter [A-Za-z] and may be followed by
    any number of letters, digits [0-9], or any of the characters -_:.

    HTML5:
    no whitespace

    LaTeX:
    only ASCII characters (32-127) except "%~#{}"
    * "{" and "}" might be used if balanced but this is not recommended.
    * Use of certain LaTeX packages results in more exceptions.
    * Spaces are allowed.
    * With XeTeX/LuaTeX, Unicode characters are allowed, too.

    https://tex.stackexchange.com/questions/18311/what-are-the-valid-names-as-labels

    ODT/ODF: <to be="" completed=""></to>

    troff: <to be="" completed=""></to>

    • How should a user target his content item, if every formatter chooses
      to modify his chosen id?

    Internal (rST source and included files/parent documents):
    use the reference name. This works independent of the output format.

    External:
    HTML: Use the generated id (when unsure about the transformation of a
    given name to id, look it up in the output).

    LaTeX: Use the id as label (e.g. in \ref{}). This works only if the
    external LaTeX source is combined with the Docutils-generated
    LaTeX source (i.e. one must include the other or both included in a
    common parent).

    PDF: named destinations are currently not supported in PDFs
    generated from Docutils-generated LaTeX.
    https://tex.stackexchange.com/questions/213860/how-to-generate-a-named-destination-in-pdf

    The id should not be changed.

    The id is (currently) generated once and used unchanged by the writers.

    Docutils should even keep spaces despite HTML5 disallowing them.

    Docutils policy is to create valid output. Until this restriction is
    lifted in the HTML5 standard, Docutils will not use spaces in HTML-IDs.
    Spaces are allowed in reference names.

    If the user runs into a problem with a browser, he will change the id
    himself and know about it.

    The author cannot change IDs nor implicit reference names directly. If we
    would keep spaces, any document with a section title containing whitespace
    would also contain spaces in the id of the corresponding section element.

    Maybe he converts to just pdf, anyway.

    Even worse: Accented characters, Umlauts, Greek, Cyrillic, etc. in section
    titles would lead to compilation errors with pdflatex.

    To summarize:

    RST is not html and does not need restrictions from HTML (or CSS) altogether.

    This is why the internal identifiers (reference names) don't
    have these limitations. The rules for reference names (whitespace
    normalization and downcasing) are solely based on practicability for rST.

    Identifiers in the generated documents must comply with the restrictions of
    the output document format.

    Docutils should develop in that direction.

    There are two alternatives:

    a) Keep ids identical across output formats. This would allow only the
    intersection of valid element identifiers.

    We could lift the restrictions of CSS1, as generated documents would
    still be valid XHTML1 and CSS selectors may use escaping or
    [argument] syntax.
    This would relax the requirements to complying with the regexp
    [A-Za-z][-_:.A-Za-z0-9]* (i.e. also allow underscore, colon,
    and full stop).

    b) Allow less restrictive identifiers in some formats:

    HTML is the format most probably linked to.

    The "html5" writer could use the name as ID, just replacing spaces.
    This would allow external links like
    http://example.com/parrot.html#1.Ιανουάριος.

    Or the restriction on the first character may be dropped with an exception
    for "html4css1".

    Relaxing rules does not produce backward incompatibility, either.

    No problem for internal links (unless we also change the rules for reference names.

    However, external links adapted to the current rules may break.

    Example: a document, parrot.rst contains::

    Schöner Titel: warum nicht?
    =====================
    

    and I link to this section from somewhere on the net with the URL
    http://example.org/parrot.html#schoner-titel-warum-nicht.
    This link will be broken after re-processing the unchanged source with a Docutils
    version with relaxed id-rules.

    Therefore, I would only change the rules after careful consideration and an
    advance warning. Possibly with an opt-in setting.

     

    Last edit: Günter Milde 2025-04-29
  • Roland Puntaier

    Roland Puntaier - 2019-10-13

    I've abbreviated the general concept of identifier with ID.
    In this general meaning a reference name is an ID,
    because you reference something by uniquely identifying it.

    If in docutils there are more reference names and ids
    then there are more ways to reference an item.
    That is OK.

    I was referring only to user chosen reference name_.
    Let's keep out IDs generated from headers or form :name:.
    I personally never rely on these generated IDs,
    because I don't know them.
    Instead I put .. _`some_title_id`: in front of a header.

    User chosen target IDs (reference name_ in rst) should not be changed.

    How are more reference names translated to html,
    e.g. for the above additional some_title_id?
    More IDs would allow to keep the legacy ID and add
    the unchanged user reference name as additional ID.
    Else one could add a docutils.conf setting to tell docutils which method to use.

    About multiple IDs in html:
    https://stackoverflow.com/questions/192048/can-an-html-element-have-multiple-ids
    See comment by BoltClock or the answer by tvanfosson.

     
  • Günter Milde

    Günter Milde - 2019-10-13

    User chosen target IDs (reference name in rst) should not be changed.

    • If the user confines herself to valid names, no change is done.
    • If the user uses invalid names, the output would be buggy in some output formats. If we want
      consistent identifiers, the same rules must aply to all output formats.

    Anchors with unchecked user-specified ID value could be specified using raw input but this is not recommended, though.

    How are more reference names translated to html?

    Try yourself:

    .. _first explicit target:
    .. _other explicit target:
    
    .. note::
       :name: refname from directive option
    
       the object
    

    If you export to Docutils-XML or ~pseudoxml, you will see the three names and ids of the note element. In the HTML, spans are used as anchors for the additional identifiers.

     

    Last edit: Günter Milde 2025-04-29
  • Roland Puntaier

    Roland Puntaier - 2019-10-15

    Docutils has versions.
    A new version is allowed to behave differently, according semantic versioning.
    Everyone knows that.
    If someone uses a new version of docutils,
    it is that one's responsibility to integrate it into its context.
    Docutils should develop with the associated standards.
    HTML has standard 5 now.
    IDs should be modified only according standard 5.
    This means that only spaces can be replaced
    when deriving HTML IDs.

     
  • Günter Milde

    Günter Milde - 2019-10-30

    If someone uses a new version of docutils,
    it is that one's responsibility to integrate it into its context.

    There is one problem, though: "Cool URIs don't change"
    (https://www.w3.org/Provider/Style/URI.html).
    When a new Docutils version produces different URIs for the same input, we
    should offer users a way to keep the old URIs.

    Docutils should develop with the associated standards.
    HTML has standard 5 now.

    HTML comes in many different versions. Docutils supports HTML5 with the
    "html5_polyglot" writer and XHTML1.1/transitional with the default writer
    "html4css1". The default may change in future.

    IDs should be modified only according standard 5.
    This means that only spaces can be replaced
    when deriving HTML IDs.

    Identifier keys must be valid in all supported output formats.
    Therefore, they must comply with restrictions in the
    respective output formats (HTML4.1, HTML5, polyglot HTML,
    LaTeX, ODT, troff (manpage), XML).

    __ http://www.w3.org/TR/html401/types.html#type-name
    __ https://www.w3.org/TR/html50/dom.html#the-id-attribute
    __ https://www.w3.org/TR/html-polyglot/#id-attribute
    __ https://tex.stackexchange.com/questions/18311/what-are-the-valid-names-as-labels
    __ https://help.libreoffice.org/6.3/en-US/text/swriter/01/04040000.html?DbPAR=WRITER#bm_id4974211
    __ https://www.w3.org/TR/REC-xml/#id

    We may want to keep the "one ID format for all output formats". Then only
    the underscore (_) may be allowed in addition to the current
    transformation.

    +1 one rule is easier to remember than a set of different rules.
    -1 IDs must keep to a restrictive rule even in more relaxed output formats.

    Alternatively, we may allow different identifier transformations for each
    output format:

    +1 ID-transformation follows (almost) the relaxed rules of the output format.
    -1 More complex setup.
    -1 ID value used in the output is even harder to predict.

    A possible implementation would be via a new "identifier_restrictions"
    configuration setting that takes a list of rule sets (CSS1, HTML4, HTML5,
    XML, LaTeX, XeTeX/LuaTeX, ODT, troff) and combines them to form the required
    transition.

    Examples:
    The current transition would be identifier_restrictions: HTML4,CSS1.

    The "html5polyglot" section could use identifier_restrictions: XML, as
    polyglot HTML requires valid XML identifiers.

    A user may override this in a config file or with
    rst2html5 --identifier-restrictions=HTML5.

     

    Last edit: Günter Milde 2025-04-29
  • Roland Puntaier

    Roland Puntaier - 2019-10-31

    This is a nice solution. I would also have a special --identifier-restrictions=none to turn of all ID mappings.

     
  • Günter Milde

    Günter Milde - 2020-03-25

    I attach an experimental implementation draft and tests for exploration.

     
  • Roland Puntaier

    Roland Puntaier - 2020-03-26

    I like this:

    • It allows to use the same ID for output formats that support it,
      which are a lot considering HTML5, ODT, XeTeX and XML.

    • It also means that the generated documents of these formats all have the same ID for the same content,
      including the RST source

    • It stores the ID language restrictions of different target formats within docutils

    Regarding API, I would make your trim_name() the new make_id():

    ``` python
    
    #make_id_legacy = make_id
    def make_id(string, language="legacy"):
        #...
    
    ```
    

    The has_prefix shouldn't be needed because determined by the ID format data id_start and id_char.

    In the command line interface I would also default to legacy,
    because of "Cool URIs don't change" and to avoid the necessity to change people's scripts.

    I did not compare the ID language data in your py file with the documentation of the according formats.

     
  • Günter Milde

    Günter Milde - 2020-08-24
    • summary: .. ___init__: becomes <p id="init"> instead of <p id="init"> --> allow more characters when transforming "names" to "ids".
    • status: pending-works-for-me --> pending
     
  • Günter Milde

    Günter Milde - 2020-08-24

    See also bug #207 (closed as a duplicate of this request).

     
  • Günter Milde

    Günter Milde - 2024-03-28

    On 2024-03-25, a user posted a request to docutils-develop with a use case where the identifier-normalization of class names stands in the way:

    I've recently started using Plausible for analytics tracking. Unfortunately
    their naming schema for custom events is incompatible with docutils CSS
    class normalization, since Plausible relies on either "=" or "--" for
    naming custom events (plausible docs).

    Reading through the rationale for the chosen normalization,
    why is "=" excluded? Would an update ever be considered where "=" was
    allowed in class names for docutils?

    One may also consider uncoupling the handling of class names and
    identifiers, as restrictions in the output formats may differ and
    class names are more directly user-set, so there may be less side-effects.

     

    Last edit: Günter Milde 2024-03-28
  • Günter Milde

    Günter Milde - 2025-04-29
    • assigned_to: David Goodger --> nobody
     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.