Docutils: Documentation Utilities / Feature Requests / #66 allow more characters when transforming "names" to "ids".

Description has changed:

Diff:

--- old
+++ new
@@ -1,7 +1,8 @@
-`` encloses a role. There is a default role, else :&lt;role&gt;:`text`
+    `` encloses a role. There is a default role, else :&lt;role&gt;:`text`

-_ in front, is the special target role. For one word the backtick can be dropped.
+    _ in front, is the special target role. For one word the backtick can be dropped.

-_`__init__` should produce a target named &#34;__init__&#34;.
+    _`__init__` should produce a target named &#34;__init__&#34;.
+
 But instead the produced target is &#34;init&#34;.
 The backtick avoids ambiguity. There is no need for this behavior.

Please be careful with using raw markup in a web form like this. SourceForge expects MarkDown, which has enough similarities to reStructuredText that the markup will be interpreted/misinterpreted. Use MarkDown to quote any markup, and check that the result makes sense when rendered (use the preview function).

Last edit: Günter Milde 2025-04-29

David Goodger - 2019-09-30

When you say, "There is no need for this behavior", what behavior do you mean, exactly?

It works fine for me. This input:

$ rst2pseudoxml.py<<'EOF' a target _`__init__` in a paragraph EOF

Produces this output:

<document source="<stdin>"> <paragraph> a target <target ids="init" names="__init__"> __init__ in a paragraph

The target name is __init__. The ID drops the underscores, for the reasons explained in docutils.nodes.Element and docutils.nodes.make_id, e.g.:

Docutils identifiers will conform to the regular expression
[a-z](-?[a-z0-9]+)*. For CSS compatibility, identifiers (the "class"
and "id" attributes) should have no underscores, colons, or periods.
Hyphens may be used.

Last edit: Günter Milde 2025-04-29
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Goodger - 2019-09-30

status: open --> pending-works-for-me

assigned_to: David Goodger
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roland Puntaier - 2019-10-01

According
https://www.w3.org/TR/CSS21/syndata.html#characters
an identifier can start with two underscores in CSS.

HTML5 allows the id value to start with two underscores (https://html.spec.whatwg.org/multipage/dom.html#the-id-attribute).

HTML5 id is specified to not contain spaces, but some browsers do support spaces nevertheless.
HTML5 does not specify why it disallows spaces. It should therefore allow spaces.

I made a related post about docutils changing IDs in 11/2018: https://sourceforge.net/p/docutils/mailman/message/36453416/

My position is this:

id is used to name targets in the RST.

id is the identifier of a content item.

The same content item should have the same id independent of format (rst, html, pdf, ...)

How should a user target his content item, if every formatter chooses to modify his chosen id?

The id should not be changed.

Docutils should even keep spaces despite HTML5 disallowing them.
If the user runs into a problem with a browser, he will change the id himself and know about it.
Maybe he converts to just pdf, anyway.

To summarize:

RST is not html and does not need restrictions from HTML (or CSS) altogether.

Docutils should develop in that direction.
Relaxing rules does not produce backward incompatibility, either.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2019-10-01

Ticket moved from /p/docutils/bugs/379/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2019-10-12

My position is this:

id is used to name targets in the RST.

id is the identifier of a content item.

In rST/Docutils, it is a bit more complicated:

Docutils doctree elements may have multiple ids and names.

In the reStructuredText source, only reference names are used for naming
elements as well as referring to them. IDs are only used in generated documents.

Reference names may be auto-derived from the content (e.g. section
titles) or specified by the author via rST syntax (:name: option of
directives, content of hyperlink targets, label of footnotes or citations).

IDs_ are generated by Docutils (sometimes using names as base) when
parsing rST or in transformations.

The same content item should have the same id independent of format
(rst, html, pdf, ...)

To achieve this, the id must be valid in all output formats supported by
Docutils (HTML4.1/XHTML1, HTML5, LaTeX, troff (manpage), XML, ODF/ODT).

HTML4.1:
IDs must begin with a letter [A-Za-z] and may be followed by
any number of letters, digits [0-9], or any of the characters -_:.

HTML5:
no whitespace

LaTeX:
only ASCII characters (32-127) except "%~#{}"
* "{" and "}" might be used if balanced but this is not recommended.
* Use of certain LaTeX packages results in more exceptions.
* Spaces are allowed.
* With XeTeX/LuaTeX, Unicode characters are allowed, too.

https://tex.stackexchange.com/questions/18311/what-are-the-valid-names-as-labels

ODT/ODF: <to be="" completed=""></to>

troff: <to be="" completed=""></to>

How should a user target his content item, if every formatter chooses
to modify his chosen id?

Internal (rST source and included files/parent documents):
use the reference name. This works independent of the output format.

External:
HTML: Use the generated id (when unsure about the transformation of a
given name to id, look it up in the output).

LaTeX: Use the id as label (e.g. in \ref{}). This works only if the
external LaTeX source is combined with the Docutils-generated
LaTeX source (i.e. one must include the other or both included in a
common parent).

PDF: named destinations are currently not supported in PDFs
generated from Docutils-generated LaTeX.
https://tex.stackexchange.com/questions/213860/how-to-generate-a-named-destination-in-pdf

The id should not be changed.

The id is (currently) generated once and used unchanged by the writers.

Docutils should even keep spaces despite HTML5 disallowing them.

Docutils policy is to create valid output. Until this restriction is
lifted in the HTML5 standard, Docutils will not use spaces in HTML-IDs.
Spaces are allowed in reference names.

If the user runs into a problem with a browser, he will change the id
himself and know about it.

The author cannot change IDs nor implicit reference names directly. If we
would keep spaces, any document with a section title containing whitespace
would also contain spaces in the id of the corresponding section element.

Maybe he converts to just pdf, anyway.

Even worse: Accented characters, Umlauts, Greek, Cyrillic, etc. in section
titles would lead to compilation errors with pdflatex.

To summarize:

RST is not html and does not need restrictions from HTML (or CSS) altogether.

This is why the internal identifiers (reference names) don't
have these limitations. The rules for reference names (whitespace
normalization and downcasing) are solely based on practicability for rST.

Identifiers in the generated documents must comply with the restrictions of
the output document format.

Docutils should develop in that direction.

There are two alternatives:

a) Keep ids identical across output formats. This would allow only the
intersection of valid element identifiers.

We could lift the restrictions of CSS1, as generated documents would
still be valid XHTML1 and CSS selectors may use escaping or
[argument] syntax.
This would relax the requirements to complying with the regexp
[A-Za-z][-_:.A-Za-z0-9]* (i.e. also allow underscore, colon,
and full stop).

b) Allow less restrictive identifiers in some formats:

HTML is the format most probably linked to.

The "html5" writer could use the name as ID, just replacing spaces.
This would allow external links like
http://example.com/parrot.html#1.Ιανουάριος.

Or the restriction on the first character may be dropped with an exception
for "html4css1".

Relaxing rules does not produce backward incompatibility, either.

No problem for internal links (unless we also change the rules for reference names.

However, external links adapted to the current rules may break.

Example: a document, parrot.rst contains::

Schöner Titel: warum nicht? =====================

and I link to this section from somewhere on the net with the URL
http://example.org/parrot.html#schoner-titel-warum-nicht.
This link will be broken after re-processing the unchanged source with a Docutils
version with relaxed id-rules.

Therefore, I would only change the rules after careful consideration and an
advance warning. Possibly with an opt-in setting.

Last edit: Günter Milde 2025-04-29
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roland Puntaier - 2019-10-13

I've abbreviated the general concept of identifier with ID.
In this general meaning a reference name is an ID,
because you reference something by uniquely identifying it.

If in docutils there are more reference names and ids
then there are more ways to reference an item.
That is OK.

I was referring only to user chosen reference name_.
Let's keep out IDs generated from headers or form :name:.
I personally never rely on these generated IDs,
because I don't know them.
Instead I put .. _`some_title_id`: in front of a header.

User chosen target IDs (reference name_ in rst) should not be changed.

How are more reference names translated to html,
e.g. for the above additional some_title_id?
More IDs would allow to keep the legacy ID and add
the unchanged user reference name as additional ID.
Else one could add a docutils.conf setting to tell docutils which method to use.

About multiple IDs in html:
https://stackoverflow.com/questions/192048/can-an-html-element-have-multiple-ids
See comment by BoltClock or the answer by tvanfosson.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2019-10-13

User chosen target IDs (reference name in rst) should not be changed.

If the user confines herself to valid names, no change is done.

If the user uses invalid names, the output would be buggy in some output formats. If we want
consistent identifiers, the same rules must aply to all output formats.

Anchors with unchecked user-specified ID value could be specified using raw input but this is not recommended, though.

How are more reference names translated to html?

Try yourself:

.. _first explicit target: .. _other explicit target: .. note:: :name: refname from directive option the object

If you export to Docutils-XML or ~pseudoxml, you will see the three names and ids of the note element. In the HTML, spans are used as anchors for the additional identifiers.

Last edit: Günter Milde 2025-04-29
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roland Puntaier - 2019-10-15

Docutils has versions.
A new version is allowed to behave differently, according semantic versioning.
Everyone knows that.
If someone uses a new version of docutils,
it is that one's responsibility to integrate it into its context.
Docutils should develop with the associated standards.
HTML has standard 5 now.
IDs should be modified only according standard 5.
This means that only spaces can be replaced
when deriving HTML IDs.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2019-10-30

If someone uses a new version of docutils,
it is that one's responsibility to integrate it into its context.

There is one problem, though: "Cool URIs don't change"
(https://www.w3.org/Provider/Style/URI.html).
When a new Docutils version produces different URIs for the same input, we
should offer users a way to keep the old URIs.

Docutils should develop with the associated standards.
HTML has standard 5 now.

HTML comes in many different versions. Docutils supports HTML5 with the
"html5_polyglot" writer and XHTML1.1/transitional with the default writer
"html4css1". The default may change in future.

IDs should be modified only according standard 5.
This means that only spaces can be replaced
when deriving HTML IDs.

Identifier keys must be valid in all supported output formats.
Therefore, they must comply with restrictions in the
respective output formats (HTML4.1, HTML5, polyglot HTML,
LaTeX, ODT, troff (manpage), XML).

__ http://www.w3.org/TR/html401/types.html#type-name
__ https://www.w3.org/TR/html50/dom.html#the-id-attribute
__ https://www.w3.org/TR/html-polyglot/#id-attribute
__ https://tex.stackexchange.com/questions/18311/what-are-the-valid-names-as-labels
__ https://help.libreoffice.org/6.3/en-US/text/swriter/01/04040000.html?DbPAR=WRITER#bm_id4974211
__ https://www.w3.org/TR/REC-xml/#id

We may want to keep the "one ID format for all output formats". Then only
the underscore (_) may be allowed in addition to the current
transformation.

+1 one rule is easier to remember than a set of different rules.
-1 IDs must keep to a restrictive rule even in more relaxed output formats.

Alternatively, we may allow different identifier transformations for each
output format:

+1 ID-transformation follows (almost) the relaxed rules of the output format.
-1 More complex setup.
-1 ID value used in the output is even harder to predict.

A possible implementation would be via a new "identifier_restrictions"
configuration setting that takes a list of rule sets (CSS1, HTML4, HTML5,
XML, LaTeX, XeTeX/LuaTeX, ODT, troff) and combines them to form the required
transition.

Examples:
The current transition would be identifier_restrictions: HTML4,CSS1.

The "html5polyglot" section could use identifier_restrictions: XML, as
polyglot HTML requires valid XML identifiers.

A user may override this in a config file or with
rst2html5 --identifier-restrictions=HTML5.

Last edit: Günter Milde 2025-04-29

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roland Puntaier - 2019-10-31

This is a nice solution. I would also have a special --identifier-restrictions=none to turn of all ID mappings.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2020-03-25

I attach an experimental implementation draft and tests for exploration.

id_chars.py

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roland Puntaier - 2020-03-26

I like this:

It allows to use the same ID for output formats that support it,
which are a lot considering HTML5, ODT, XeTeX and XML.

It also means that the generated documents of these formats all have the same ID for the same content,
including the RST source

It stores the ID language restrictions of different target formats within docutils

Regarding API, I would make your trim_name() the new make_id():

``` python #make_id_legacy = make_id def make_id(string, language="legacy"): #... ```

The has_prefix shouldn't be needed because determined by the ID format data id_start and id_char.

In the command line interface I would also default to legacy,
because of "Cool URIs don't change" and to avoid the necessity to change people's scripts.

I did not compare the ID language data in your py file with the documentation of the according formats.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2020-08-24

summary: .. ___init__: becomes <p id="init"> instead of <p id="init"> --> allow more characters when transforming "names" to "ids".

status: pending-works-for-me --> pending
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2020-08-24

See also bug #207 (closed as a duplicate of this request).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2024-03-28

On 2024-03-25, a user posted a request to docutils-develop with a use case where the identifier-normalization of class names stands in the way:

I've recently started using Plausible for analytics tracking. Unfortunately
their naming schema for custom events is incompatible with docutils CSS
class normalization, since Plausible relies on either "=" or "--" for
naming custom events (plausible docs).

Reading through the rationale for the chosen normalization,
why is "=" excluded? Would an update ever be considered where "=" was
allowed in class names for docutils?

One may also consider uncoupling the handling of class names and
identifiers, as restrictions in the output formats may differ and
class names are more directly user-set, so there may be less side-effects.

Last edit: Günter Milde 2024-03-28

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Günter Milde - 2025-04-29

assigned_to: David Goodger --> nobody
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

allow more characters when transforming "names" to "ids".

Group

Searches

Help

#66 allow more characters when transforming "names" to "ids".

Discussion