[lex] Reorganize contents to follow grammar and phases of translation

AlisdairM · AlisdairM · commit 417dbb49a651 · 2024-09-27T15:37:17.000-04:00
This PR purely moves existing words around, and does not create any
new content.  It would be the precursor to a larger change set that
might integrate [cpp] into lex, or move it adjactent, and similarly
move [modules] adjactent.

This may be more significant than we want to land in C++26, but I
offer it while time is still available, and as inspiration for the
C++29 reorg otherwise.
diff --git a/source/lex.tex b/source/lex.tex
@@ -465,30 +465,48 @@
 for the respective Unicode encoding form.
 \indextext{character set|)}
 
+\rSec1[lex.comment]{Comments}
+
+\pnum
+\indextext{comment|(}%
+\indextext{comment!\tcode{/*} \tcode{*/}}%
+\indextext{comment!\tcode{//}}%
+The characters \tcode{/*} start a comment, which terminates with the
+characters \tcode{*/}. These comments do not nest.
+\indextext{comment!\tcode{//}}%
+The characters \tcode{//} start a comment, which terminates immediately before the
+next new-line character. If there is a form-feed or a vertical-tab
+character in such a comment, only whitespace characters shall appear
+between it and the new-line that terminates the comment; no diagnostic
+is required.
+\begin{note}
+The comment characters \tcode{//}, \tcode{/*},
+and \tcode{*/} have no special meaning within a \tcode{//} comment and
+are treated just like other characters. Similarly, the comment
+characters \tcode{//} and \tcode{/*} have no special meaning within a
+\tcode{/*} comment.
+\end{note}
+\indextext{comment|)}
+
 \rSec1[lex.pptoken]{Preprocessing tokens}
 
 \indextext{token!preprocessing|(}%
 \begin{bnf}
 \nontermdef{preprocessing-token}\br
-    header-name\br
     import-keyword\br
     module-keyword\br
     export-keyword\br
-    identifier\br
+    header-name\br
     pp-number\br
+    preprocessing-op-or-punc\br
+    identifier\br
     character-literal\br
     user-defined-character-literal\br
     string-literal\br
     user-defined-string-literal\br
-    preprocessing-op-or-punc\br
     \textnormal{each non-whitespace character that cannot be one of the above}
 \end{bnf}
 
-\pnum
-Each preprocessing token that is converted to a token\iref{lex.token}
-shall have the lexical form of a keyword, an identifier, a literal,
-or an operator or punctuator.
-
 \pnum
 A preprocessing token is the minimal lexical element of the language in translation
 phases 3 through 6.
@@ -523,6 +541,22 @@
 between the quotation characters in a character literal or
 string literal.
 
+\pnum
+Each preprocessing token that is converted to a token\iref{lex.token}
+shall have the lexical form of a keyword, an identifier, a literal,
+or an operator or punctuator.
+
+\pnum
+The \grammarterm{import-keyword} is produced
+by processing an \keyword{import} directive\iref{cpp.import},
+the \grammarterm{module-keyword} is produced
+by preprocessing a \keyword{module} directive\iref{cpp.module}, and
+the \grammarterm{export-keyword} is produced
+by preprocessing either of the previous two directives.
+\begin{note}
+None has any observable spelling.
+\end{note}
+
 \pnum
 If the input stream has been parsed into preprocessing tokens up to a
 given character:
@@ -562,24 +596,14 @@
 \end{itemize}
 \end{itemize}
 
+\pnum
 \begin{example}
 \begin{codeblock}
 #define R "x"
 const char* s = R"y";           // ill-formed raw string, not \tcode{"x" "y"}
 \end{codeblock}
 \end{example}
 
-\pnum
-The \grammarterm{import-keyword} is produced
-by preprocessing an \keyword{import} directive\iref{cpp.import},
-the \grammarterm{module-keyword} is produced
-by preprocessing a \keyword{module} directive\iref{cpp.module}, and
-the \grammarterm{export-keyword} is produced
-by preprocessing either of the previous two directives.
-\begin{note}
-None has any observable spelling.
-\end{note}
-
 \pnum
 \begin{example}
 The program fragment \tcode{0xe+foo} is parsed as a
@@ -602,106 +626,6 @@
 \end{example}
 \indextext{token!preprocessing|)}
 
-\rSec1[lex.digraph]{Alternative tokens}
-
-\pnum
-\indextext{token!alternative|(}%
-Alternative token representations are provided for some operators and
-punctuators.
-\begin{footnote}
-\indextext{digraph}%
-These include ``digraphs'' and additional reserved words. The term
-``digraph'' (token consisting of two characters) is not perfectly
-descriptive, since one of the alternative \grammarterm{preprocessing-token}s is
-\tcode{\%:\%:} and of course several primary tokens contain two
-characters. Nonetheless, those alternative tokens that aren't lexical
-keywords are colloquially known as ``digraphs''.
-\end{footnote}
-
-\pnum
-In all respects of the language, each alternative token behaves the
-same, respectively, as its primary token, except for its spelling.
-\begin{footnote}
-Thus the ``stringized'' values\iref{cpp.stringize} of
-\tcode{[} and \tcode{<:} will be different, maintaining the source
-spelling, but the tokens can otherwise be freely interchanged.
-\end{footnote}
-The set of alternative tokens is defined in
-\tref{lex.digraph}.
-
-\begin{tokentable}{Alternative tokens}{lex.digraph}{Alternative}{Primary}
-\tcode{<\%}             &   \tcode{\{}         &
-\keyword{and}           &   \tcode{\&\&}       &
-\keyword{and_eq}        &   \tcode{\&=}        \\ \rowsep
-\tcode{\%>}             &   \tcode{\}}         &
-\keyword{bitor}         &   \tcode{|}          &
-\keyword{or_eq}         &   \tcode{|=}         \\ \rowsep
-\tcode{<:}              &   \tcode{[}          &
-\keyword{or}            &   \tcode{||}         &
-\keyword{xor_eq}        &   \tcode{\caret=}    \\ \rowsep
-\tcode{:>}              &   \tcode{]}          &
-\keyword{xor}           &   \tcode{\caret}     &
-\keyword{not}           &   \tcode{!}          \\ \rowsep
-\tcode{\%:}             &   \tcode{\#}         &
-\keyword{compl}         &   \tcode{\~}         &
-\keyword{not_eq}        &   \tcode{!=}         \\ \rowsep
-\tcode{\%:\%:}          &   \tcode{\#\#}       &
-\keyword{bitand}        &   \tcode{\&}         &
-                        &                      \\
-\end{tokentable}%
-\indextext{token!alternative|)}
-
-\rSec1[lex.token]{Tokens}
-
-\indextext{token|(}%
-\begin{bnf}
-\nontermdef{token}\br
-    identifier\br
-    keyword\br
-    literal\br
-    operator-or-punctuator
-\end{bnf}
-
-\pnum
-\indextext{\idxgram{token}}%
-There are five kinds of tokens: identifiers, keywords, literals,%
-\begin{footnote}
-Literals include strings and character and numeric literals.
-\end{footnote}
-operators, and other separators.
-\indextext{whitespace}%
-Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments
-(collectively, ``whitespace''), as described below, are ignored except
-as they serve to separate tokens.
-\begin{note}
-Whitespace can separate otherwise adjacent identifiers, keywords, numeric
-literals, and alternative tokens containing alphabetic characters.
-\end{note}
-\indextext{token|)}
-
-\rSec1[lex.comment]{Comments}
-
-\pnum
-\indextext{comment|(}%
-\indextext{comment!\tcode{/*} \tcode{*/}}%
-\indextext{comment!\tcode{//}}%
-The characters \tcode{/*} start a comment, which terminates with the
-characters \tcode{*/}. These comments do not nest.
-\indextext{comment!\tcode{//}}%
-The characters \tcode{//} start a comment, which terminates immediately before the
-next new-line character. If there is a form-feed or a vertical-tab
-character in such a comment, only whitespace characters shall appear
-between it and the new-line that terminates the comment; no diagnostic
-is required.
-\begin{note}
-The comment characters \tcode{//}, \tcode{/*},
-and \tcode{*/} have no special meaning within a \tcode{//} comment and
-are treated just like other characters. Similarly, the comment
-characters \tcode{//} and \tcode{/*} have no special meaning within a
-\tcode{/*} comment.
-\end{note}
-\indextext{comment|)}
-
 \rSec1[lex.header]{Header names}
 
 \indextext{header!name|(}%
@@ -791,6 +715,96 @@
 a \grammarterm{floating-point-literal} token.%
 \indextext{number!preprocessing|)}
 
+\rSec1[lex.operators]{Operators and punctuators}
+
+\pnum
+\indextext{operator|(}%
+\indextext{punctuator|(}%
+The lexical representation of \Cpp{} programs includes a number of
+preprocessing tokens that are used in the syntax of the preprocessor or
+are converted into tokens for operators and punctuators:
+
+\begin{bnf}
+\nontermdef{preprocessing-op-or-punc}\br
+    preprocessing-operator\br
+    operator-or-punctuator
+\end{bnf}
+
+\begin{bnf}
+%% Ed. note: character protrusion would misalign various operators.
+\microtypesetup{protrusion=false}\obeyspaces
+\nontermdef{preprocessing-operator} \textnormal{one of}\br
+    \terminal{\#        \#\#       \%:       \%:\%:}
+\end{bnf}
+
+\begin{bnf}
+\microtypesetup{protrusion=false}\obeyspaces
+\nontermdef{operator-or-punctuator} \textnormal{one of}\br
+    \terminal{\{        \}        [        ]        (        )}\br
+    \terminal{<:       :>       <\%       \%>       ;        :        ...}\br
+    \terminal{?        ::       .        .*       ->       ->*      \~}\br
+    \terminal{!        +        -        *        /        \%        \caret{}        \&        |}\br
+    \terminal{=        +=       -=       *=       /=       \%=       \caret{}=       \&=       |=}\br
+    \terminal{==       !=       <        >        <=       >=       <=>      \&\&       ||}\br
+    \terminal{<<       >>       <<=      >>=      ++       --       ,}\br
+    \terminal{\keyword{and}      \keyword{or}       \keyword{xor}      \keyword{not}      \keyword{bitand}   \keyword{bitor}    \keyword{compl}}\br
+    \terminal{\keyword{and_eq}   \keyword{or_eq}    \keyword{xor_eq}   \keyword{not_eq}}
+\end{bnf}
+
+Each \grammarterm{operator-or-punctuator} is converted to a single token
+in translation phase 7\iref{lex.phases}.%
+\indextext{punctuator|)}%
+\indextext{operator|)}
+
+\rSec1[lex.digraph]{Alternative tokens}
+
+\pnum
+\indextext{token!alternative|(}%
+Alternative token representations are provided for some operators and
+punctuators.
+\begin{footnote}
+\indextext{digraph}%
+These include ``digraphs'' and additional reserved words. The term
+``digraph'' (token consisting of two characters) is not perfectly
+descriptive, since one of the alternative \grammarterm{preprocessing-token}s is
+\tcode{\%:\%:} and of course several primary tokens contain two
+characters. Nonetheless, those alternative tokens that aren't lexical
+keywords are colloquially known as ``digraphs''.
+\end{footnote}
+
+\pnum
+In all respects of the language, each alternative token behaves the
+same, respectively, as its primary token, except for its spelling.
+\begin{footnote}
+Thus the ``stringized'' values\iref{cpp.stringize} of
+\tcode{[} and \tcode{<:} will be different, maintaining the source
+spelling, but the tokens can otherwise be freely interchanged.
+\end{footnote}
+The set of alternative tokens is defined in
+\tref{lex.digraph}.
+
+\begin{tokentable}{Alternative tokens}{lex.digraph}{Alternative}{Primary}
+\tcode{<\%}             &   \tcode{\{}         &
+\keyword{and}           &   \tcode{\&\&}       &
+\keyword{and_eq}        &   \tcode{\&=}        \\ \rowsep
+\tcode{\%>}             &   \tcode{\}}         &
+\keyword{bitor}         &   \tcode{|}          &
+\keyword{or_eq}         &   \tcode{|=}         \\ \rowsep
+\tcode{<:}              &   \tcode{[}          &
+\keyword{or}            &   \tcode{||}         &
+\keyword{xor_eq}        &   \tcode{\caret=}    \\ \rowsep
+\tcode{:>}              &   \tcode{]}          &
+\keyword{xor}           &   \tcode{\caret}     &
+\keyword{not}           &   \tcode{!}          \\ \rowsep
+\tcode{\%:}             &   \tcode{\#}         &
+\keyword{compl}         &   \tcode{\~}         &
+\keyword{not_eq}        &   \tcode{!=}         \\ \rowsep
+\tcode{\%:\%:}          &   \tcode{\#\#}       &
+\keyword{bitand}        &   \tcode{\&}         &
+                        &                      \\
+\end{tokentable}%
+\indextext{token!alternative|)}
+
 \rSec1[lex.name]{Identifiers}
 
 \indextext{identifier|(}%
@@ -912,6 +926,34 @@
 \end{itemize}%
 \indextext{identifier|)}
 
+\rSec1[lex.token]{Tokens}
+
+\indextext{token|(}%
+\begin{bnf}
+\nontermdef{token}\br
+    identifier\br
+    keyword\br
+    literal\br
+    operator-or-punctuator
+\end{bnf}
+
+\pnum
+\indextext{\idxgram{token}}%
+There are five kinds of tokens: identifiers, keywords, literals,%
+\begin{footnote}
+Literals include strings and character and numeric literals.
+\end{footnote}
+operators, and other separators.
+\indextext{whitespace}%
+Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments
+(collectively, ``whitespace''), as described below, are ignored except
+as they serve to separate tokens.
+\begin{note}
+Whitespace can separate otherwise adjacent identifiers, keywords, numeric
+literals, and alternative tokens containing alphabetic characters.
+\end{note}
+\indextext{token|)}
+
 \rSec1[lex.key]{Keywords}
 
 \begin{bnf}
@@ -1036,47 +1078,6 @@
 \indextext{keyword|)}%
 
 
-\rSec1[lex.operators]{Operators and punctuators}
-
-\pnum
-\indextext{operator|(}%
-\indextext{punctuator|(}%
-The lexical representation of \Cpp{} programs includes a number of
-preprocessing tokens that are used in the syntax of the preprocessor or
-are converted into tokens for operators and punctuators:
-
-\begin{bnf}
-\nontermdef{preprocessing-op-or-punc}\br
-    preprocessing-operator\br
-    operator-or-punctuator
-\end{bnf}
-
-\begin{bnf}
-%% Ed. note: character protrusion would misalign various operators.
-\microtypesetup{protrusion=false}\obeyspaces
-\nontermdef{preprocessing-operator} \textnormal{one of}\br
-    \terminal{\#        \#\#       \%:       \%:\%:}
-\end{bnf}
-
-\begin{bnf}
-\microtypesetup{protrusion=false}\obeyspaces
-\nontermdef{operator-or-punctuator} \textnormal{one of}\br
-    \terminal{\{        \}        [        ]        (        )}\br
-    \terminal{<:       :>       <\%       \%>       ;        :        ...}\br
-    \terminal{?        ::       .        .*       ->       ->*      \~}\br
-    \terminal{!        +        -        *        /        \%        \caret{}        \&        |}\br
-    \terminal{=        +=       -=       *=       /=       \%=       \caret{}=       \&=       |=}\br
-    \terminal{==       !=       <        >        <=       >=       <=>      \&\&       ||}\br
-    \terminal{<<       >>       <<=      >>=      ++       --       ,}\br
-    \terminal{\keyword{and}      \keyword{or}       \keyword{xor}      \keyword{not}      \keyword{bitand}   \keyword{bitor}    \keyword{compl}}\br
-    \terminal{\keyword{and_eq}   \keyword{or_eq}    \keyword{xor_eq}   \keyword{not_eq}}
-\end{bnf}
-
-Each \grammarterm{operator-or-punctuator} is converted to a single token
-in translation phase 7\iref{lex.phases}.%
-\indextext{punctuator|)}%
-\indextext{operator|)}
-
 \rSec1[lex.literal]{Literals}%
 \indextext{literal|(}