Wednesday, October 14, 2009

Spaces in TeX

Spaces appear all over .tex files but only some of them appear as actual spaces in the output. To understand this, we first need to understand how TeX reads lines of input.

When reading input, TeX is in one of three states: state N is when TeX is at the beginning of a new line; state M is when TeX is in the middle of the line; and state S is when TeX is skipping spaces. TeX will discard space characters it sees in any state except for M. Basically, TeX starts in state N and on the first non space character (actually, it's slightly more complicated, but for for the purposes of this post, just consider tabs as spaces), it transitions into state M. While in state M, each character that is read is turned into a token except that control sequences are turned into a single token (again, it's more complicated than that, but this will suffice). Once a space is encountered, a space token is created and TeX enters state S. Again, a nonspace character brings TeX into state M. As an example, consider the line of input:
Hello      \TeX!
TeX begins in state N and then upon reading the H transitions into state M and produces an H token. Then e, l, l, and o tokens are produced in turn.

Upon reading the first space, TeX produces a space token and then enters state S. The rest of the spaces up to the \ are ignored. Once TeX reads the \, it will scan the rest of the control sequence and produce a single \TeX token. Finally, TeX produces a ! token. I haven't said what state TeX enters when it scan a control sequence. The answer depends on what type of control sequence it is. If the first character after the \ is not a letter, for example if it's a symbol like @ or #, then TeX produces a token consisting of the control symbol. (For example, the token \@ or \#.) In this case, TeX enters (or remains in) state M. If instead, the first character after the \ is a letter, then TeX reads a control word consisting of the \ and all following letters. TeX then enters state S. This explains why TeX ignores spaces after control words like \TeX or \bf. So in the example above, since \TeX is a control word, TeX will enter state S after reading that control word and then immediately enter state M when it reads the !.

Before we can move on, there are two points I skipped. Before TeX starts processing a line of input, it deletes all space characters at the right of the line and inserts a carriage return character which, by default, is the end of line character. So to conclude the discussion of a single line, we need to know what happens with comment characters and end of line characters. For a comment character, all information on the rest of the input line is thrown away and TeX starts on the next line of input in state N. For an end of line character, TeX throws away all remaining input on the line (just like a comment) and then does one of three things. If TeX is in state N, then it produces a \par token. If TeX is in state M, it produces a space token. If TeX is in state S, it ignores the end of line character.

Let's consider the implications of the handling of the end of line character. In state N, it produces a new paragraph which is why entering a blank line in your TeX source gives you a new paragraph. In State M, it produces a space which is why we can sprinkle newlines (almost) anywhere we like in our source and we get spaces. If spaces are being skipped, for example after a control word (but not a control symbol!), then the end of line does nothing. [Okay, one final lie above, after a control symbol consisting of \ and a space, TeX enters state S. This is so that \ followed by two spaces does not produce two space tokens.] To summarize, when TeX reads a line of input, it
1. removes trailing spaces and adds a carriage return,
2. enters state N,
3. reads characters, creating tokens and changing states as described above until it,
4. reaches the end of line character which is either turned into a \par token, a space token, or ignored, depending on the current state.

This is not the end of the story as there are situations where TeX ignores space tokens and \par tokens a.k.a, "why don't I get more blank lines when I enter more blank lines in my source?" However, this post is long enough, so I'll put that off for now and discuss modes, next time.