| <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> |
| <html> |
| <!-- Copyright (C) 1987-2015 Free Software Foundation, Inc. |
| |
| Permission is granted to copy, distribute and/or modify this document |
| under the terms of the GNU Free Documentation License, Version 1.3 or |
| any later version published by the Free Software Foundation. A copy of |
| the license is included in the |
| section entitled "GNU Free Documentation License". |
| |
| This manual contains no Invariant Sections. The Front-Cover Texts are |
| (a) (see below), and the Back-Cover Texts are (b) (see below). |
| |
| (a) The FSF's Front-Cover Text is: |
| |
| A GNU Manual |
| |
| (b) The FSF's Back-Cover Text is: |
| |
| You have freedom to copy and modify this GNU Manual, like GNU |
| software. Copies published by the Free Software Foundation raise |
| funds for GNU development. --> |
| <!-- Created by GNU Texinfo 5.2, http://www.gnu.org/software/texinfo/ --> |
| <head> |
| <title>The C Preprocessor: Tokenization</title> |
| |
| <meta name="description" content="The C Preprocessor: Tokenization"> |
| <meta name="keywords" content="The C Preprocessor: Tokenization"> |
| <meta name="resource-type" content="document"> |
| <meta name="distribution" content="global"> |
| <meta name="Generator" content="makeinfo"> |
| <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| <link href="index.html#Top" rel="start" title="Top"> |
| <link href="Index-of-Directives.html#Index-of-Directives" rel="index" title="Index of Directives"> |
| <link href="index.html#SEC_Contents" rel="contents" title="Table of Contents"> |
| <link href="Overview.html#Overview" rel="up" title="Overview"> |
| <link href="The-preprocessing-language.html#The-preprocessing-language" rel="next" title="The preprocessing language"> |
| <link href="Initial-processing.html#Initial-processing" rel="prev" title="Initial processing"> |
| <style type="text/css"> |
| <!-- |
| a.summary-letter {text-decoration: none} |
| blockquote.smallquotation {font-size: smaller} |
| div.display {margin-left: 3.2em} |
| div.example {margin-left: 3.2em} |
| div.indentedblock {margin-left: 3.2em} |
| div.lisp {margin-left: 3.2em} |
| div.smalldisplay {margin-left: 3.2em} |
| div.smallexample {margin-left: 3.2em} |
| div.smallindentedblock {margin-left: 3.2em; font-size: smaller} |
| div.smalllisp {margin-left: 3.2em} |
| kbd {font-style:oblique} |
| pre.display {font-family: inherit} |
| pre.format {font-family: inherit} |
| pre.menu-comment {font-family: serif} |
| pre.menu-preformatted {font-family: serif} |
| pre.smalldisplay {font-family: inherit; font-size: smaller} |
| pre.smallexample {font-size: smaller} |
| pre.smallformat {font-family: inherit; font-size: smaller} |
| pre.smalllisp {font-size: smaller} |
| span.nocodebreak {white-space:nowrap} |
| span.nolinebreak {white-space:nowrap} |
| span.roman {font-family:serif; font-weight:normal} |
| span.sansserif {font-family:sans-serif; font-weight:normal} |
| ul.no-bullet {list-style: none} |
| --> |
| </style> |
| |
| |
| </head> |
| |
| <body lang="en" bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#800080" alink="#FF0000"> |
| <a name="Tokenization"></a> |
| <div class="header"> |
| <p> |
| Next: <a href="The-preprocessing-language.html#The-preprocessing-language" accesskey="n" rel="next">The preprocessing language</a>, Previous: <a href="Initial-processing.html#Initial-processing" accesskey="p" rel="prev">Initial processing</a>, Up: <a href="Overview.html#Overview" accesskey="u" rel="up">Overview</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index-of-Directives.html#Index-of-Directives" title="Index" rel="index">Index</a>]</p> |
| </div> |
| <hr> |
| <a name="Tokenization-1"></a> |
| <h3 class="section">1.3 Tokenization</h3> |
| |
| <a name="index-tokens"></a> |
| <a name="index-preprocessing-tokens"></a> |
| <p>After the textual transformations are finished, the input file is |
| converted into a sequence of <em>preprocessing tokens</em>. These mostly |
| correspond to the syntactic tokens used by the C compiler, but there are |
| a few differences. White space separates tokens; it is not itself a |
| token of any kind. Tokens do not have to be separated by white space, |
| but it is often necessary to avoid ambiguities. |
| </p> |
| <p>When faced with a sequence of characters that has more than one possible |
| tokenization, the preprocessor is greedy. It always makes each token, |
| starting from the left, as big as possible before moving on to the next |
| token. For instance, <code>a+++++b</code> is interpreted as |
| <code>a ++ ++ + b<!-- /@w --></code>, not as <code>a ++ + ++ b<!-- /@w --></code>, even though the |
| latter tokenization could be part of a valid C program and the former |
| could not. |
| </p> |
| <p>Once the input file is broken into tokens, the token boundaries never |
| change, except when the ‘<samp>##</samp>’ preprocessing operator is used to paste |
| tokens together. See <a href="Concatenation.html#Concatenation">Concatenation</a>. For example, |
| </p> |
| <div class="smallexample"> |
| <pre class="smallexample">#define foo() bar |
| foo()baz |
| → bar baz |
| <em>not</em> |
| → barbaz |
| </pre></div> |
| |
| <p>The compiler does not re-tokenize the preprocessor’s output. Each |
| preprocessing token becomes one compiler token. |
| </p> |
| <a name="index-identifiers"></a> |
| <p>Preprocessing tokens fall into five broad classes: identifiers, |
| preprocessing numbers, string literals, punctuators, and other. An |
| <em>identifier</em> is the same as an identifier in C: any sequence of |
| letters, digits, or underscores, which begins with a letter or |
| underscore. Keywords of C have no significance to the preprocessor; |
| they are ordinary identifiers. You can define a macro whose name is a |
| keyword, for instance. The only identifier which can be considered a |
| preprocessing keyword is <code>defined</code>. See <a href="Defined.html#Defined">Defined</a>. |
| </p> |
| <p>This is mostly true of other languages which use the C preprocessor. |
| However, a few of the keywords of C++ are significant even in the |
| preprocessor. See <a href="C_002b_002b-Named-Operators.html#C_002b_002b-Named-Operators">C++ Named Operators</a>. |
| </p> |
| <p>In the 1999 C standard, identifiers may contain letters which are not |
| part of the “basic source character set”, at the implementation’s |
| discretion (such as accented Latin letters, Greek letters, or Chinese |
| ideograms). This may be done with an extended character set, or the |
| ‘<samp>\u</samp>’ and ‘<samp>\U</samp>’ escape sequences. GCC only accepts such |
| characters in the ‘<samp>\u</samp>’ and ‘<samp>\U</samp>’ forms. |
| </p> |
| <p>As an extension, GCC treats ‘<samp>$</samp>’ as a letter. This is for |
| compatibility with some systems, such as VMS, where ‘<samp>$</samp>’ is commonly |
| used in system-defined function and object names. ‘<samp>$</samp>’ is not a |
| letter in strictly conforming mode, or if you specify the <samp>-$</samp> |
| option. See <a href="Invocation.html#Invocation">Invocation</a>. |
| </p> |
| <a name="index-numbers"></a> |
| <a name="index-preprocessing-numbers"></a> |
| <p>A <em>preprocessing number</em> has a rather bizarre definition. The |
| category includes all the normal integer and floating point constants |
| one expects of C, but also a number of other things one might not |
| initially recognize as a number. Formally, preprocessing numbers begin |
| with an optional period, a required decimal digit, and then continue |
| with any sequence of letters, digits, underscores, periods, and |
| exponents. Exponents are the two-character sequences ‘<samp>e+</samp>’, |
| ‘<samp>e-</samp>’, ‘<samp>E+</samp>’, ‘<samp>E-</samp>’, ‘<samp>p+</samp>’, ‘<samp>p-</samp>’, ‘<samp>P+</samp>’, and |
| ‘<samp>P-</samp>’. (The exponents that begin with ‘<samp>p</samp>’ or ‘<samp>P</samp>’ are new |
| to C99. They are used for hexadecimal floating-point constants.) |
| </p> |
| <p>The purpose of this unusual definition is to isolate the preprocessor |
| from the full complexity of numeric constants. It does not have to |
| distinguish between lexically valid and invalid floating-point numbers, |
| which is complicated. The definition also permits you to split an |
| identifier at any position and get exactly two tokens, which can then be |
| pasted back together with the ‘<samp>##</samp>’ operator. |
| </p> |
| <p>It’s possible for preprocessing numbers to cause programs to be |
| misinterpreted. For example, <code>0xE+12</code> is a preprocessing number |
| which does not translate to any valid numeric constant, therefore a |
| syntax error. It does not mean <code>0xE + 12<!-- /@w --></code>, which is what you |
| might have intended. |
| </p> |
| <a name="index-string-literals"></a> |
| <a name="index-string-constants"></a> |
| <a name="index-character-constants"></a> |
| <a name="index-header-file-names"></a> |
| <p><em>String literals</em> are string constants, character constants, and |
| header file names (the argument of ‘<samp>#include</samp>’).<a name="DOCF2" href="#FOOT2"><sup>2</sup></a> String constants and character |
| constants are straightforward: <tt>"…"</tt> or <tt>'…'</tt>. In |
| either case embedded quotes should be escaped with a backslash: |
| <tt>'\''</tt> is the character constant for ‘<samp>'</samp>’. There is no limit on |
| the length of a character constant, but the value of a character |
| constant that contains more than one character is |
| implementation-defined. See <a href="Implementation-Details.html#Implementation-Details">Implementation Details</a>. |
| </p> |
| <p>Header file names either look like string constants, <tt>"…"</tt>, or are |
| written with angle brackets instead, <tt><…></tt>. In either case, |
| backslash is an ordinary character. There is no way to escape the |
| closing quote or angle bracket. The preprocessor looks for the header |
| file in different places depending on which form you use. See <a href="Include-Operation.html#Include-Operation">Include Operation</a>. |
| </p> |
| <p>No string literal may extend past the end of a line. Older versions |
| of GCC accepted multi-line string constants. You may use continued |
| lines instead, or string constant concatenation. See <a href="Differences-from-previous-versions.html#Differences-from-previous-versions">Differences from previous versions</a>. |
| </p> |
| <a name="index-punctuators"></a> |
| <a name="index-digraphs"></a> |
| <a name="index-alternative-tokens"></a> |
| <p><em>Punctuators</em> are all the usual bits of punctuation which are |
| meaningful to C and C++. All but three of the punctuation characters in |
| ASCII are C punctuators. The exceptions are ‘<samp>@</samp>’, ‘<samp>$</samp>’, and |
| ‘<samp>`</samp>’. In addition, all the two- and three-character operators are |
| punctuators. There are also six <em>digraphs</em>, which the C++ standard |
| calls <em>alternative tokens</em>, which are merely alternate ways to spell |
| other punctuators. This is a second attempt to work around missing |
| punctuation in obsolete systems. It has no negative side effects, |
| unlike trigraphs, but does not cover as much ground. The digraphs and |
| their corresponding normal punctuators are: |
| </p> |
| <div class="smallexample"> |
| <pre class="smallexample">Digraph: <% %> <: :> %: %:%: |
| Punctuator: { } [ ] # ## |
| </pre></div> |
| |
| <a name="index-other-tokens"></a> |
| <p>Any other single character is considered “other”. It is passed on to |
| the preprocessor’s output unmolested. The C compiler will almost |
| certainly reject source code containing “other” tokens. In ASCII, the |
| only other characters are ‘<samp>@</samp>’, ‘<samp>$</samp>’, ‘<samp>`</samp>’, and control |
| characters other than NUL (all bits zero). (Note that ‘<samp>$</samp>’ is |
| normally considered a letter.) All characters with the high bit set |
| (numeric range 0x7F–0xFF) are also “other” in the present |
| implementation. This will change when proper support for international |
| character sets is added to GCC. |
| </p> |
| <p>NUL is a special case because of the high probability that its |
| appearance is accidental, and because it may be invisible to the user |
| (many terminals do not display NUL at all). Within comments, NULs are |
| silently ignored, just as any other character would be. In running |
| text, NUL is considered white space. For example, these two directives |
| have the same meaning. |
| </p> |
| <div class="smallexample"> |
| <pre class="smallexample">#define X^@1 |
| #define X 1 |
| </pre></div> |
| |
| <p>(where ‘<samp>^@</samp>’ is ASCII NUL). Within string or character constants, |
| NULs are preserved. In the latter two cases the preprocessor emits a |
| warning message. |
| </p> |
| <div class="footnote"> |
| <hr> |
| <h4 class="footnotes-heading">Footnotes</h4> |
| |
| <h3><a name="FOOT2" href="#DOCF2">(2)</a></h3> |
| <p>The C |
| standard uses the term <em>string literal</em> to refer only to what we are |
| calling <em>string constants</em>.</p> |
| </div> |
| <hr> |
| <div class="header"> |
| <p> |
| Next: <a href="The-preprocessing-language.html#The-preprocessing-language" accesskey="n" rel="next">The preprocessing language</a>, Previous: <a href="Initial-processing.html#Initial-processing" accesskey="p" rel="prev">Initial processing</a>, Up: <a href="Overview.html#Overview" accesskey="u" rel="up">Overview</a> [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index-of-Directives.html#Index-of-Directives" title="Index" rel="index">Index</a>]</p> |
| </div> |
| |
| |
| |
| </body> |
| </html> |