| <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> |
| <html> |
| <!-- Created by GNU Texinfo 5.2, http://www.gnu.org/software/texinfo/ --> |
| <head> |
| <title>The GNU C Preprocessor Internals: Lexer</title> |
| |
| <meta name="description" content="The GNU C Preprocessor Internals: Lexer"> |
| <meta name="keywords" content="The GNU C Preprocessor Internals: Lexer"> |
| <meta name="resource-type" content="document"> |
| <meta name="distribution" content="global"> |
| <meta name="Generator" content="makeinfo"> |
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| <link href="index.html#Top" rel="start" title="Top"> |
| <link href="Concept-Index.html#Concept-Index" rel="index" title="Concept Index"> |
| <link href="index.html#SEC_Contents" rel="contents" title="Table of Contents"> |
| <link href="index.html#Top" rel="up" title="Top"> |
| <link href="Hash-Nodes.html#Hash-Nodes" rel="next" title="Hash Nodes"> |
| <link href="Conventions.html#Conventions" rel="prev" title="Conventions"> |
| <style type="text/css"> |
| <!-- |
| a.summary-letter {text-decoration: none} |
| blockquote.smallquotation {font-size: smaller} |
| div.display {margin-left: 3.2em} |
| div.example {margin-left: 3.2em} |
| div.indentedblock {margin-left: 3.2em} |
| div.lisp {margin-left: 3.2em} |
| div.smalldisplay {margin-left: 3.2em} |
| div.smallexample {margin-left: 3.2em} |
| div.smallindentedblock {margin-left: 3.2em; font-size: smaller} |
| div.smalllisp {margin-left: 3.2em} |
| kbd {font-style:oblique} |
| pre.display {font-family: inherit} |
| pre.format {font-family: inherit} |
| pre.menu-comment {font-family: serif} |
| pre.menu-preformatted {font-family: serif} |
| pre.smalldisplay {font-family: inherit; font-size: smaller} |
| pre.smallexample {font-size: smaller} |
| pre.smallformat {font-family: inherit; font-size: smaller} |
| pre.smalllisp {font-size: smaller} |
| span.nocodebreak {white-space:nowrap} |
| span.nolinebreak {white-space:nowrap} |
| span.roman {font-family:serif; font-weight:normal} |
| span.sansserif {font-family:sans-serif; font-weight:normal} |
| ul.no-bullet {list-style: none} |
| --> |
| </style> |
| |
| |
| </head> |
| |
| <body lang="en" bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#800080" alink="#FF0000"> |
| <a name="Lexer"></a> |
| <div class="header"> |
| <p> |
| Next: <a href="Hash-Nodes.html#Hash-Nodes" accesskey="n" rel="next">Hash Nodes</a>, Previous: <a href="Conventions.html#Conventions" accesskey="p" rel="prev">Conventions</a>, Up: <a href="index.html#Top" accesskey="u" rel="up">Top</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p> |
| </div> |
| <hr> |
| <a name="The-Lexer"></a> |
| <h2 class="unnumbered">The Lexer</h2> |
| <a name="index-lexer"></a> |
| <a name="index-newlines"></a> |
| <a name="index-escaped-newlines"></a> |
| |
| <a name="Overview"></a> |
| <h3 class="section">Overview</h3> |
| <p>The lexer is contained in the file <samp>lex.c</samp>. It is a hand-coded |
| lexer, and not implemented as a state machine. It can understand C, C++ |
| and Objective-C source code, and has been extended to allow reasonably |
| successful preprocessing of assembly language. The lexer does not make |
| an initial pass to strip out trigraphs and escaped newlines, but handles |
| them as they are encountered in a single pass of the input file. It |
| returns preprocessing tokens individually, not a line at a time. |
| </p> |
| <p>It is mostly transparent to users of the library, since the library’s |
| interface for obtaining the next token, <code>cpp_get_token</code>, takes care |
| of lexing new tokens, handling directives, and expanding macros as |
| necessary. However, the lexer does expose some functionality so that |
| clients of the library can easily spell a given token, such as |
| <code>cpp_spell_token</code> and <code>cpp_token_len</code>. These functions are |
| useful when generating diagnostics, and for emitting the preprocessed |
| output. |
| </p> |
| <a name="Lexing-a-token"></a> |
| <h3 class="section">Lexing a token</h3> |
| <p>Lexing of an individual token is handled by <code>_cpp_lex_direct</code> and |
| its subroutines. In its current form the code is quite complicated, |
| with read ahead characters and such-like, since it strives to not step |
| back in the character stream in preparation for handling non-ASCII file |
| encodings. The current plan is to convert any such files to UTF-8 |
| before processing them. This complexity is therefore unnecessary and |
| will be removed, so I’ll not discuss it further here. |
| </p> |
| <p>The job of <code>_cpp_lex_direct</code> is simply to lex a token. It is not |
| responsible for issues like directive handling, returning lookahead |
| tokens directly, multiple-include optimization, or conditional block |
| skipping. It necessarily has a minor rôle to play in memory |
| management of lexed lines. I discuss these issues in a separate section |
| (see <a href="#Lexing-a-line">Lexing a line</a>). |
| </p> |
| <p>The lexer places the token it lexes into storage pointed to by the |
| variable <code>cur_token</code>, and then increments it. This variable is |
| important for correct diagnostic positioning. Unless a specific line |
| and column are passed to the diagnostic routines, they will examine the |
| <code>line</code> and <code>col</code> values of the token just before the location |
| that <code>cur_token</code> points to, and use that location to report the |
| diagnostic. |
| </p> |
| <p>The lexer does not consider whitespace to be a token in its own right. |
| If whitespace (other than a new line) precedes a token, it sets the |
| <code>PREV_WHITE</code> bit in the token’s flags. Each token has its |
| <code>line</code> and <code>col</code> variables set to the line and column of the |
| first character of the token. This line number is the line number in |
| the translation unit, and can be converted to a source (file, line) pair |
| using the line map code. |
| </p> |
| <p>The first token on a logical, i.e. unescaped, line has the flag |
| <code>BOL</code> set for beginning-of-line. This flag is intended for |
| internal use, both to distinguish a ‘<samp>#</samp>’ that begins a directive |
| from one that doesn’t, and to generate a call-back to clients that want |
| to be notified about the start of every non-directive line with tokens |
| on it. Clients cannot reliably determine this for themselves: the first |
| token might be a macro, and the tokens of a macro expansion do not have |
| the <code>BOL</code> flag set. The macro expansion may even be empty, and the |
| next token on the line certainly won’t have the <code>BOL</code> flag set. |
| </p> |
| <p>New lines are treated specially; exactly how the lexer handles them is |
| context-dependent. The C standard mandates that directives are |
| terminated by the first unescaped newline character, even if it appears |
| in the middle of a macro expansion. Therefore, if the state variable |
| <code>in_directive</code> is set, the lexer returns a <code>CPP_EOF</code> token, |
| which is normally used to indicate end-of-file, to indicate |
| end-of-directive. In a directive a <code>CPP_EOF</code> token never means |
| end-of-file. Conveniently, if the caller was <code>collect_args</code>, it |
| already handles <code>CPP_EOF</code> as if it were end-of-file, and reports an |
| error about an unterminated macro argument list. |
| </p> |
| <p>The C standard also specifies that a new line in the middle of the |
| arguments to a macro is treated as whitespace. This white space is |
| important in case the macro argument is stringified. The state variable |
| <code>parsing_args</code> is nonzero when the preprocessor is collecting the |
| arguments to a macro call. It is set to 1 when looking for the opening |
| parenthesis to a function-like macro, and 2 when collecting the actual |
| arguments up to the closing parenthesis, since these two cases need to |
| be distinguished sometimes. One such time is here: the lexer sets the |
| <code>PREV_WHITE</code> flag of a token if it meets a new line when |
| <code>parsing_args</code> is set to 2. It doesn’t set it if it meets a new |
| line when <code>parsing_args</code> is 1, since then code like |
| </p> |
| <div class="smallexample"> |
| <pre class="smallexample">#define foo() bar |
| foo |
| baz |
| </pre></div> |
| |
| <p>would be output with an erroneous space before ‘<samp>baz</samp>’: |
| </p> |
| <div class="smallexample"> |
| <pre class="smallexample">foo |
| baz |
| </pre></div> |
| |
| <p>This is a good example of the subtlety of getting token spacing correct |
| in the preprocessor; there are plenty of tests in the testsuite for |
| corner cases like this. |
| </p> |
| <p>The lexer is written to treat each of ‘<samp>\r</samp>’, ‘<samp>\n</samp>’, ‘<samp>\r\n</samp>’ |
| and ‘<samp>\n\r</samp>’ as a single new line indicator. This allows it to |
| transparently preprocess MS-DOS, Macintosh and Unix files without their |
| needing to pass through a special filter beforehand. |
| </p> |
| <p>We also decided to treat a backslash, either ‘<samp>\</samp>’ or the trigraph |
| ‘<samp>??/</samp>’, separated from one of the above newline indicators by |
| non-comment whitespace only, as intending to escape the newline. It |
| tends to be a typing mistake, and cannot reasonably be mistaken for |
| anything else in any of the C-family grammars. Since handling it this |
| way is not strictly conforming to the ISO standard, the library issues a |
| warning wherever it encounters it. |
| </p> |
| <p>Handling newlines like this is made simpler by doing it in one place |
| only. The function <code>handle_newline</code> takes care of all newline |
| characters, and <code>skip_escaped_newlines</code> takes care of arbitrarily |
| long sequences of escaped newlines, deferring to <code>handle_newline</code> |
| to handle the newlines themselves. |
| </p> |
| <p>The most painful aspect of lexing ISO-standard C and C++ is handling |
| trigraphs and backlash-escaped newlines. Trigraphs are processed before |
| any interpretation of the meaning of a character is made, and unfortunately |
| there is a trigraph representation for a backslash, so it is possible for |
| the trigraph ‘<samp>??/</samp>’ to introduce an escaped newline. |
| </p> |
| <p>Escaped newlines are tedious because theoretically they can occur |
| anywhere—between the ‘<samp>+</samp>’ and ‘<samp>=</samp>’ of the ‘<samp>+=</samp>’ token, |
| within the characters of an identifier, and even between the ‘<samp>*</samp>’ |
| and ‘<samp>/</samp>’ that terminates a comment. Moreover, you cannot be sure |
| there is just one—there might be an arbitrarily long sequence of them. |
| </p> |
| <p>So, for example, the routine that lexes a number, <code>parse_number</code>, |
| cannot assume that it can scan forwards until the first non-number |
| character and be done with it, because this could be the ‘<samp>\</samp>’ |
| introducing an escaped newline, or the ‘<samp>?</samp>’ introducing the trigraph |
| sequence that represents the ‘<samp>\</samp>’ of an escaped newline. If it |
| encounters a ‘<samp>?</samp>’ or ‘<samp>\</samp>’, it calls <code>skip_escaped_newlines</code> |
| to skip over any potential escaped newlines before checking whether the |
| number has been finished. |
| </p> |
| <p>Similarly code in the main body of <code>_cpp_lex_direct</code> cannot simply |
| check for a ‘<samp>=</samp>’ after a ‘<samp>+</samp>’ character to determine whether it |
| has a ‘<samp>+=</samp>’ token; it needs to be prepared for an escaped newline of |
| some sort. Such cases use the function <code>get_effective_char</code>, which |
| returns the first character after any intervening escaped newlines. |
| </p> |
| <p>The lexer needs to keep track of the correct column position, including |
| counting tabs as specified by the <samp>-ftabstop=</samp> option. This |
| should be done even within C-style comments; they can appear in the |
| middle of a line, and we want to report diagnostics in the correct |
| position for text appearing after the end of the comment. |
| </p> |
| <a name="Invalid-identifiers"></a><p>Some identifiers, such as <code>__VA_ARGS__</code> and poisoned identifiers, |
| may be invalid and require a diagnostic. However, if they appear in a |
| macro expansion we don’t want to complain with each use of the macro. |
| It is therefore best to catch them during the lexing stage, in |
| <code>parse_identifier</code>. In both cases, whether a diagnostic is needed |
| or not is dependent upon the lexer’s state. For example, we don’t want |
| to issue a diagnostic for re-poisoning a poisoned identifier, or for |
| using <code>__VA_ARGS__</code> in the expansion of a variable-argument macro. |
| Therefore <code>parse_identifier</code> makes use of state flags to determine |
| whether a diagnostic is appropriate. Since we change state on a |
| per-token basis, and don’t lex whole lines at a time, this is not a |
| problem. |
| </p> |
| <p>Another place where state flags are used to change behavior is whilst |
| lexing header names. Normally, a ‘<samp><</samp>’ would be lexed as a single |
| token. After a <code>#include</code> directive, though, it should be lexed as |
| a single token as far as the nearest ‘<samp>></samp>’ character. Note that we |
| don’t allow the terminators of header names to be escaped; the first |
| ‘<samp>"</samp>’ or ‘<samp>></samp>’ terminates the header name. |
| </p> |
| <p>Interpretation of some character sequences depends upon whether we are |
| lexing C, C++ or Objective-C, and on the revision of the standard in |
| force. For example, ‘<samp>::</samp>’ is a single token in C++, but in C it is |
| two separate ‘<samp>:</samp>’ tokens and almost certainly a syntax error. Such |
| cases are handled by <code>_cpp_lex_direct</code> based upon command-line |
| flags stored in the <code>cpp_options</code> structure. |
| </p> |
| <p>Once a token has been lexed, it leads an independent existence. The |
| spelling of numbers, identifiers and strings is copied to permanent |
| storage from the original input buffer, so a token remains valid and |
| correct even if its source buffer is freed with <code>_cpp_pop_buffer</code>. |
| The storage holding the spellings of such tokens remains until the |
| client program calls cpp_destroy, probably at the end of the translation |
| unit. |
| </p> |
| <a name="Lexing-a-line"></a><a name="Lexing-a-line-1"></a> |
| <h3 class="section">Lexing a line</h3> |
| <a name="index-token-run"></a> |
| |
| <p>When the preprocessor was changed to return pointers to tokens, one |
| feature I wanted was some sort of guarantee regarding how long a |
| returned pointer remains valid. This is important to the stand-alone |
| preprocessor, the future direction of the C family front ends, and even |
| to cpplib itself internally. |
| </p> |
| <p>Occasionally the preprocessor wants to be able to peek ahead in the |
| token stream. For example, after the name of a function-like macro, it |
| wants to check the next token to see if it is an opening parenthesis. |
| Another example is that, after reading the first few tokens of a |
| <code>#pragma</code> directive and not recognizing it as a registered pragma, |
| it wants to backtrack and allow the user-defined handler for unknown |
| pragmas to access the full <code>#pragma</code> token stream. The stand-alone |
| preprocessor wants to be able to test the current token with the |
| previous one to see if a space needs to be inserted to preserve their |
| separate tokenization upon re-lexing (paste avoidance), so it needs to |
| be sure the pointer to the previous token is still valid. The |
| recursive-descent C++ parser wants to be able to perform tentative |
| parsing arbitrarily far ahead in the token stream, and then to be able |
| to jump back to a prior position in that stream if necessary. |
| </p> |
| <p>The rule I chose, which is fairly natural, is to arrange that the |
| preprocessor lex all tokens on a line consecutively into a token buffer, |
| which I call a <em>token run</em>, and when meeting an unescaped new line |
| (newlines within comments do not count either), to start lexing back at |
| the beginning of the run. Note that we do <em>not</em> lex a line of |
| tokens at once; if we did that <code>parse_identifier</code> would not have |
| state flags available to warn about invalid identifiers (see <a href="#Invalid-identifiers">Invalid identifiers</a>). |
| </p> |
| <p>In other words, accessing tokens that appeared earlier in the current |
| line is valid, but since each logical line overwrites the tokens of the |
| previous line, tokens from prior lines are unavailable. In particular, |
| since a directive only occupies a single logical line, this means that |
| the directive handlers like the <code>#pragma</code> handler can jump around |
| in the directive’s tokens if necessary. |
| </p> |
| <p>Two issues remain: what about tokens that arise from macro expansions, |
| and what happens when we have a long line that overflows the token run? |
| </p> |
| <p>Since we promise clients that we preserve the validity of pointers that |
| we have already returned for tokens that appeared earlier in the line, |
| we cannot reallocate the run. Instead, on overflow it is expanded by |
| chaining a new token run on to the end of the existing one. |
| </p> |
| <p>The tokens forming a macro’s replacement list are collected by the |
| <code>#define</code> handler, and placed in storage that is only freed by |
| <code>cpp_destroy</code>. So if a macro is expanded in the line of tokens, |
| the pointers to the tokens of its expansion that are returned will always |
| remain valid. However, macros are a little trickier than that, since |
| they give rise to three sources of fresh tokens. They are the built-in |
| macros like <code>__LINE__</code>, and the ‘<samp>#</samp>’ and ‘<samp>##</samp>’ operators |
| for stringification and token pasting. I handled this by allocating |
| space for these tokens from the lexer’s token run chain. This means |
| they automatically receive the same lifetime guarantees as lexed tokens, |
| and we don’t need to concern ourselves with freeing them. |
| </p> |
| <p>Lexing into a line of tokens solves some of the token memory management |
| issues, but not all. The opening parenthesis after a function-like |
| macro name might lie on a different line, and the front ends definitely |
| want the ability to look ahead past the end of the current line. So |
| cpplib only moves back to the start of the token run at the end of a |
| line if the variable <code>keep_tokens</code> is zero. Line-buffering is |
| quite natural for the preprocessor, and as a result the only time cpplib |
| needs to increment this variable is whilst looking for the opening |
| parenthesis to, and reading the arguments of, a function-like macro. In |
| the near future cpplib will export an interface to increment and |
| decrement this variable, so that clients can share full control over the |
| lifetime of token pointers too. |
| </p> |
| <p>The routine <code>_cpp_lex_token</code> handles moving to new token runs, |
| calling <code>_cpp_lex_direct</code> to lex new tokens, or returning |
| previously-lexed tokens if we stepped back in the token stream. It also |
| checks each token for the <code>BOL</code> flag, which might indicate a |
| directive that needs to be handled, or require a start-of-line call-back |
| to be made. <code>_cpp_lex_token</code> also handles skipping over tokens in |
| failed conditional blocks, and invalidates the control macro of the |
| multiple-include optimization if a token was successfully lexed outside |
| a directive. In other words, its callers do not need to concern |
| themselves with such issues. |
| </p> |
| <hr> |
| <div class="header"> |
| <p> |
| Next: <a href="Hash-Nodes.html#Hash-Nodes" accesskey="n" rel="next">Hash Nodes</a>, Previous: <a href="Conventions.html#Conventions" accesskey="p" rel="prev">Conventions</a>, Up: <a href="index.html#Top" accesskey="u" rel="up">Top</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p> |
| </div> |
| |
| |
| |
| </body> |
| </html> |