LET'S BUILD A COMPILER!
By
Jack W. Crenshaw, Ph.D.
5 March 1994
Part 15: BACK TO THE FUTURE
*****************************************************************
* *
* COPYRIGHT NOTICE *
* *
* Copyright (C) 1994 Jack W. Crenshaw. All rights reserved. *
* *
*****************************************************************
INTRODUCTION
Can it really have been four years since I wrote installment
fourteen of this series? Is it really possible that six long
years have passed since I began it? Funny how time flies when
you're having fun, isn't it?
I won't spend a lot of time making excuses; only point out that
things happen, and priorities change. In the four years since
installment fourteen, I've managed to get laid off, get divorced,
have a nervous breakdown, begin a new career as a writer, begin
another one as a consultant, move, work on two real-time systems,
and raise fourteen baby birds, three pigeons, six possums, and a
duck. For awhile there, the parsing of source code was not high
on my list of priorities. Neither was writing stuff for free,
instead of writing stuff for pay. But I do try to be faithful,
and I do recognize and feel my responsibility to you, the reader,
to finish what I've started. As the tortoise said in one of my
son's old stories, I may be slow, but I'm sure. I'm sure that
there are people out there anxious to see the last reel of this
film, and I intend to give it to them. So, if you're one of those
who's been waiting, more or less patiently, to see how this thing
comes out, thanks for your patience. I apologize for the delay.
Let's move on.
NEW STARTS, OLD DIRECTIONS
Like many other things, programming languages and programming
styles change with time. In 1994, it seems a little anachronistic
to be programming in Turbo Pascal, when the rest of the world
seems to have gone bananas over C++. It also seems a little
strange to be programming in a classical style when the rest of
the world has switched to object-oriented methods. Still, in
spite of the four-year hiatus, it would be entirely too wrenching
a change, at this point, to switch to, say, C++ with object-
orientation . Anyway, Pascal is still not only a powerful
programming language (more than ever, in fact), but it's a
wonderful medium for teaching. C is a notoriously difficult
language to read ... it's often been accused, along with Forth, of
being a "write-only language." When I program in C++, I find
myself spending at least 50% of my time struggling with language
syntax rather than with concepts. A stray "&" or "*" can not only
change the functioning of the program, but its correctness as
well. By contrast, Pascal code is usually quite transparent and
easy to read, even if you don't know the language. What you see is
almost always what you get, and we can concentrate on concepts
rather than implementation details. I've said from the beginning
that the purpose of this tutorial series was not to generate the
world's fastest compiler, but to teach the fundamentals of
compiler technology, while spending the least amount of time
wrestling with language syntax or other aspects of software
implementation. Finally, since a lot of what we do in this course
amounts to software experimentation, it's important to have a
compiler and associated environment that compiles quickly and with
no fuss. In my opinion, by far the most significant time measure
in software development is the speed of the edit/compile/test
cycle. In this department, Turbo Pascal is king. The compilation
speed is blazing fast, and continues to get faster in every
release (how do they keep doing that?). Despite vast improvements
in C compilation speed over the years, even Borland's fastest
C/C++ compiler is still no match for Turbo Pascal. Further, the
editor built into their IDE, the make facility, and even their
superb smart linker, all complement each other to produce a
wonderful environment for quick turnaround. For all of these
reasons, I intend to stick with Pascal for the duration of this
series. We'll be using Turbo Pascal for Windows, one of the
compilers provided Borland Pascal with Objects, version 7.0. If
you don't have this compiler, don't worry ... nothing we do here
is going to count on your having the latest version. Using the
Windows version helps me a lot, by allowing me to use the
Clipboard to copy code from the compiler's editor into these
documents. It should also help you at least as much, copying the
code in the other direction.
I've thought long and hard about whether or not to introduce
objects to our discussion. I'm a big advocate of object-oriented
methods for all uses, and such methods definitely have their place
in compiler technology. In fact, I've written papers on just this
subject (Refs. 1-3). But the architecture of a compiler which is
based on object-oriented approaches is vastly different than that
of the more classical compiler we've been building. Again, it
would seem to be entirely too much to change these horses in mid-
stream. As I said, programming styles change. Who knows, it may
be another six years before we finish this thing, and if we keep
changing the code every time programming style changes, we may
NEVER finish.
So for now, at least, I've determined to continue the classical
style in Pascal, though we might indeed discuss objects and object
orientation as we go. Likewise, the target machine will remain
the Motorola 68000 family. Of all the decisions to be made here,
this one has been the easiest. Though I know that many of you
would like to see code for the 80x86, the 68000 has become, if
anything, even more popular as a platform for embedded systems,
and it's to that application that this whole effort began in the
first place. Compiling for the PC, MSDOS platform, we'd have to
deal with all the issues of DOS system calls, DOS linker formats,
the PC file system and hardware, and all those other complications
of a DOS environment. An embedded system, on the other hand, must
run standalone, and it's for this kind of application, as an
alternative to assembly language, that I've always imagined that a
language like KISS would thrive. Anyway, who wants to deal with
the 80x86 architecture if they don't have to?
The one feature of Turbo Pascal that I'm going to be making heavy
use of is units. In the past, we've had to make compromises
between code size and complexity, and program functionality. A
lot of our work has been in the nature of computer
experimentation, looking at only one aspect of compiler technology
at a time. We did this to avoid to avoid having to carry around
large programs, just to investigate simple concepts. In the
process, we've re-invented the wheel and re-programmed the same
functions more times than I'd like to count. Turbo units provide
a wonderful way to get functionality and simplicity at the same
time: You write reusable code, and invoke it with a single line.
Your test program stays small, but it can do powerful things.
One feature of Turbo Pascal units is their initialization block.
As with an Ada package, any code in the main begin-end block of a
unit gets executed as the program is initialized. As you'll see
later, this sometimes gives us neat simplifications in the code.
Our procedure Init, which has been with us since Installment 1,
goes away entirely when we use units. The various routines in the
Cradle, another key features of our approach, will get distributed
among the units.
The concept of units, of course, is no different than that of C
modules. However, in C (and C++), the interface between modules
comes via preprocessor include statements and header files. As
someone who's had to read a lot of other people's C programs, I've
always found this rather bewildering. It always seems that
whatever data structure you'd like to know about is in some other
file. Turbo units are simpler for the very reason that they're
criticized by some: The function interfaces and their
implementation are included in the same file. While this
organization may create problems with code security, it also
reduces the number of files by half, which isn't half bad.
Linking of the object files is also easy, because the Turbo
compiler takes care of it without the need for make files or other
mechanisms.
STARTING OVER?
Four years ago, in Installment 14, I promised you that our days of
re-inventing the wheel, and recoding the same software over and
over for each lesson, were over, and that from now on we'd stick
to more complete programs that we would simply add new features
to. I still intend to keep that promise; that's one of the main
purposes for using units. However, because of the long time since
Installment 14, it's natural to want to at least do some review,
and anyhow, we're going to have to make rather sweeping changes in
the code to make the transition to units. Besides, frankly, after
all this time I can't remember all the neat ideas I had in my head
four years ago. The best way for me to recall them is to retrace
some of the steps we took to arrive at Installment 14. So I hope
you'll be understanding and bear with me as we go back to our
roots, in a sense, and rebuild the core of the software,
distributing the routines among the various units, and
bootstrapping ourselves back up to the point we were at lo, those
many moons ago. As has always been the case, you're going to get
to see me make all the mistakes and execute changes of direction,
in real time. Please bear with me ... we'll start getting to the
new stuff before you know it.
Since we're going to be using multiple modules in our new
approach, we have to address the issue of file management. If
you've followed all the other sections of this tutorial, you know
that, as our programs evolve, we're going to be replacing older,
more simple-minded units with more capable ones. This brings us to
an issue of version control. There will almost certainly be times
when we will overlay a simple file (unit), but later wish we had
the simple one again. A case in point is embodied in our
predilection for using single-character variable names, keywords,
etc., to test concepts without getting bogged down in the details
of a lexical scanner. Thanks to the use of units, we will be
doing much less of this in the future. Still, I not only suspect,
but am certain that we will need to save some older versions of
files, for special purposes, even though they've been replaced by
newer, more capable ones.
To deal with this problem, I suggest that you create different
directories, with different versions of the units as needed. If
we do this properly, the code in each directory will remain self-
consistent. I've tentatively created four directories: SINGLE
(for single-character experimentation), MULTI (for, of course,
multi-character versions), TINY, and KISS.
Enough said about philosophy and details. Let's get on with the
resurrection of the software.
THE INPUT UNIT
A key concept that we've used since Day 1 has been the idea of an
input stream with one lookahead character. All the parsing
routines examine this character, without changing it, to decide
what they should do next. (Compare this approach with the C/Unix
approach using getchar and unget, and I think you'll agree that
our approach is simpler). We'll begin our hike into the future by
translating this concept into our new, unit-based organization.
The first unit, appropriately called Input, is shown below:
{--------------------------------------------------------------}
unit Input;
{--------------------------------------------------------------}
interface
var Look: char; { Lookahead character }
procedure GetChar; { Read new character }
{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
{ Read New Character From Input Stream }
procedure GetChar;
begin
Read(Look);
end;
{--------------------------------------------------------------}
{ Unit Initialization }
begin
GetChar;
end.
{--------------------------------------------------------------}
As you can see, there's nothing very profound, and certainly
nothing complicated, about this unit, since it consists of only a
single procedure. But already, we can see how the use of units
gives us advantages. Note the executable code in the
initialization block. This code "primes the pump" of the input
stream for us, something we've always had to do before, by
inserting the call to GetChar in line, or in procedure Init. This
time, the call happens without any special reference to it on our
part, except within the unit itself. As I predicted earlier, this
mechanism is going to make our lives much simpler as we proceed.
I consider it to be one of the most useful features of Turbo
Pascal, and I lean on it heavily.
Copy this unit into your compiler's IDE, and compile it. To test
the software, of course, we always need a main program. I used
the following, really complex test program, which we'll later
evolve into the Main for our compiler:
{--------------------------------------------------------------}
program Main;
uses WinCRT, Input;
begin
WriteLn(Look);
end.
{--------------------------------------------------------------}
Note the use of the Borland-supplied unit, WinCRT. This unit is
necessary if you intend to use the standard Pascal I/O routines,
Read, ReadLn, Write, and WriteLn, which of course we intend to do.
If you forget to include this unit in the "uses" clause, you will
get a really bizarre and indecipherable error message at run time.
Note also that we can access the lookahead character, even though
it's not declared in the main program. All variables declared
within the interface section of a unit are global, but they're
hidden from prying eyes; to that extent, we get a modicum of
information hiding. Of course, if we were writing in an object-
oriented fashion, we should not allow outside modules to access
the units internal variables. But, although Turbo units have a
lot in common with objects, we're not doing object-oriented design
or code here, so our use of Look is appropriate.
Go ahead and save the test program as Main.pas. To make life
easier as we get more and more files, you might want to take this
opportunity to declare this file as the compiler's Primary file.
That way, you can execute the program from any file. Otherwise,
if you press Cntl-F9 to compile and run from one of the units,
you'll get an error message. You set the primary file using the
main submenu, "Compile," in the Turbo IDE.
I hasten to point out, as I've done before, that the function of
unit Input is, and always has been, considered to be a dummy
version of the real thing. In a production version of a compiler,
the input stream will, of course, come from a file rather than
from the keyboard. And it will almost certainly include line
buffering, at the very least, and more likely, a rather large text
buffer to support efficient disk I/O. The nice part about the
unit approach is that, as with objects, we can modify the code in
the unit to be as simple or as sophisticated as we like. As long
as the interface, as embodied in the public procedures and the
lookahead character, don't change, the rest of the program is
totally unaffected. And since units are compiled, rather than
merely included, the time required to link with them is virtually
nil. Again, the result is that we can get all the benefits of
sophisticated implementations, without having to carry the code
around as so much baggage.
In later installments, I intend to provide a full-blown IDE for
the KISS compiler, using a true Windows application generated by
Borland's OWL applications framework. For now, though, we'll obey
my #1 rule to live by: Keep It Simple.
THE OUTPUT UNIT
Of course, every decent program should have output, and ours is no
exception. Our output routines included the Emit functions. The
code for the corresponding output unit is shown next:
{--------------------------------------------------------------}
unit Output;
{--------------------------------------------------------------}
interface
procedure Emit(s: string); { Emit an instruction }
procedure EmitLn(s: string); { Emit an instruction line }
{--------------------------------------------------------------}
implementation
const TAB = ^I;
{--------------------------------------------------------------}
{ Emit an Instruction }
procedure Emit(s: string);
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}
{ Emit an Instruction, Followed By a Newline }
procedure EmitLn(s: string);
begin
Emit(s);
WriteLn;
end;
end.
{--------------------------------------------------------------}
(Notice that this unit has no initialization clause, so it needs
no begin-block.)
Test this unit with the following main program:
{--------------------------------------------------------------}
program Test;
uses WinCRT, Input, Output, Scanner, Parser;
begin
WriteLn('MAIN:");
EmitLn('Hello, world!');
end.
{--------------------------------------------------------------}
Did you see anything that surprised you? You may have been
surprised to see that you needed to type something, even though
the main program requires no input. That's because of the
initialization in unit Input, which still requires something to
put into the lookahead character. Sorry, there's no way out of
that box, or rather, we don't _WANT_ to get out. Except for simple
test cases such as this, we will always want a valid lookahead
character, so the right thing to do about this "problem" is ...
nothing.
Perhaps more surprisingly, notice that the TAB character had no
effect; our line of "instructions" begins at column 1, same as the
fake label. That's right: WinCRT doesn't support tabs. We have a
problem.
There are a few ways we can deal with this problem. The one thing
we can't do is to simply ignore it. Every assembler I've ever
used reserves column 1 for labels, and will rebel to see
instructions starting there. So, at the very least, we must space
the instructions over one column to keep the assembler happy. .
That's easy enough to do: Simply change, in procedure Emit, the
line:
Write(TAB, s);
by:
Write(' ', s);
I must admit that I've wrestled with this problem before, and find
myself changing my mind as often as a chameleon changes color.
For the purposes we're going to be using, 99% of which will be
examining the output code as it's displayed on a CRT, it would be
nice to see neatly blocked out "object" code. The line:
SUB1: MOVE #4,D0
just plain looks neater than the different, but functionally
identical code,
SUB1:
MOVE #4,D0
In test versions of my code, I included a more sophisticated
version of the procedure PostLabel, that avoids having labels on
separate lines, but rather defers the printing of a label so it
can end up on the same line as the associated instruction. As
recently as an hour ago, my version of unit Output provided full
support for tabs, using an internal column count variable and
software to manage it. I had, if I do say so myself, some rather
elegant code to support the tab mechanism, with a minimum of code
bloat. It was awfully tempting to show you the "prettyprint"
version, if for no other reason than to show off the elegance.
Nevertheless, the code of the "elegant" version was considerably
more complex and larger. Since then, I've had second thoughts. In
spite of our desire to see pretty output, the inescapable fact is
that the two versions of the MAIN: code fragment shown above are
functionally identical; the assembler, which is the ultimate
destination of the code, couldn't care less which version it gets,
except that the prettier version will contain more characters,
therefore will use more disk space and take longer to assemble.
but the prettier one not only takes more code to generate, but
will create a larger output file, with many more space characters
than the minimum needed. When you look at it that way, it's not
very hard to decide which approach to use, is it?
What finally clinched the issue for me was a reminder to consider
my own first commandment: KISS. Although I was pretty proud of
all my elegant little tricks to implement tabbing, I had to remind
myself that, to paraphrase Senator Barry Goldwater, elegance in
the pursuit of complexity is no virtue. Another wise man once
wrote, "Any idiot can design a Rolls-Royce. It takes a genius to
design a VW." So the elegant, tab-friendly version of Output is
history, and what you see is the simple, compact, VW version.
THE ERROR UNIT
Our next set of routines are those that handle errors. To refresh
your memory, we take the approach, pioneered by Borland in Turbo
Pascal, of halting on the first error. Not only does this greatly
simplify our code, by completely avoiding the sticky issue of
error recovery, but it also makes much more sense, in my opinion,
in an interactive environment. I know this may be an extreme
position, but I consider the practice of reporting all errors in a
program to be an anachronism, a holdover from the days of batch
processing. It's time to scuttle the practice. So there.
In our original Cradle, we had two error-handling procedures:
Error, which didn't halt, and Abort, which did. But I don't think
we ever found a use for the procedure that didn't halt, so in the
new, lean and mean unit Errors, shown next, procedure Error takes
the place of Abort.
{--------------------------------------------------------------}
unit Errors;
{--------------------------------------------------------------}
interface
procedure Error(s: string);
procedure Expected(s: string);
{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
{ Write error Message and Halt }
procedure Error(s: string);
begin
WriteLn;
WriteLn(^G, 'Error: ', s, '.');
Halt;
end;
{--------------------------------------------------------------}
{ Write "<something> Expected" }
procedure Expected(s: string);
begin
Error(s + ' Expected');
end;
end.
{--------------------------------------------------------------}
As usual, here's a test program:
{--------------------------------------------------------------}
program Test;
uses WinCRT, Input, Output, Errors;
begin
Expected('Integer');
end.
{--------------------------------------------------------------}
Have you noticed that the "uses" line in our main program keeps
getting longer? That's OK. In the final version, the main program
will only call procedures in our parser, so its use clause will
only have a couple of entries. But for now, it's probably best to
include all the units so we can test procedures in them.
SCANNING AND PARSING
The classical compiler architecture consists of separate modules
for the lexical scanner, which supplies tokens in the language,
and the parser, which tries to make sense of the tokens as syntax
elements. If you can still remember what we did in earlier
installments, you'll recall that we didn't do things that way.
Because we're using a predictive parser, we can almost always tell
what language element is coming next, just by examining the
lookahead character. Therefore, we found no need to prefetch
tokens, as a scanner would do.
But, even though there is no functional procedure called
"Scanner," it still makes sense to separate the scanning functions
from the parsing functions. So I've created two more units
called, amazingly enough, Scanner and Parser. The Scanner unit
contains all of the routines known as recognizers. Some of these,
such as IsAlpha, are pure boolean routines which operate on the
lookahead character only. The other routines are those which
collect tokens, such as identifiers and numeric constants. The
Parser unit will contain all of the routines making up the
recursive-descent parser. The general rule should be that unit
Parser contains all of the information that is language-specific;
in other words, the syntax of the language should be wholly
contained in Parser. In an ideal world, this rule should be true
to the extent that we can change the compiler to compile a
different language, merely by replacing the single unit, Parser.
In practice, things are almost never this pure. There's always a
small amount of "leakage" of syntax rules into the scanner as
well. For example, the rules concerning what makes up a legal
identifier or constant may vary from language to language. In
some languages, the rules concerning comments permit them to be
filtered by the scanner, while in others they do not. So in
practice, both units are likely to end up having language-
dependent components, but the changes required to the scanner
should be relatively trivial.
Now, recall that we've used two versions of the scanner routines:
One that handled only single-character tokens, which we used for a
number of our tests, and another that provided full support for
multi-character tokens. Now that we have our software separated
into units, I don't anticipate getting much use out of the single-
character version, but it doesn't cost us much to provide for
both. I've created two versions of the Scanner unit. The first
one, called Scanner1, contains the single-digit version of the
recognizers:
{--------------------------------------------------------------}
unit Scanner1;
{--------------------------------------------------------------}
interface
uses Input, Errors;
function IsAlpha(c: char): boolean;
function IsDigit(c: char): boolean;
function IsAlNum(c: char): boolean;
function IsAddop(c: char): boolean;
function IsMulop(c: char): boolean;
procedure Match(x: char);
function GetName: char;
function GetNumber: char;
{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
{ Recognize an Alpha Character }
function IsAlpha(c: char): boolean;
begin
IsAlpha := UpCase(c) in ['A'..'Z'];
end;
{--------------------------------------------------------------}
{ Recognize a Numeric Character }
function IsDigit(c: char): boolean;
begin
IsDigit := c in ['0'..'9'];
end;
{--------------------------------------------------------------}
{ Recognize an Alphanumeric Character }
function IsAlnum(c: char): boolean;
begin
IsAlnum := IsAlpha(c) or IsDigit(c);
end;
{--------------------------------------------------------------}
{ Recognize an Addition Operator }
function IsAddop(c: char): boolean;
begin
IsAddop := c in ['+','-'];
end;
{--------------------------------------------------------------}
{ Recognize a Multiplication Operator }
function IsMulop(c: char): boolean;
begin
IsMulop := c in ['*','/'];
end;
{--------------------------------------------------------------}
{ Match One Character }
procedure Match(x: char);
begin
if Look = x then GetChar
else Expected('''' + x + '''');
end;
{--------------------------------------------------------------}
{ Get an Identifier }
function GetName: char;
begin
if not IsAlpha(Look) then Expected('Name');
GetName := UpCase(Look);
GetChar;
end;
{--------------------------------------------------------------}
{ Get a Number }
function GetNumber: char;
begin
if not IsDigit(Look) then Expected('Integer');
GetNumber := Look;
GetChar;
end;
end.
{--------------------------------------------------------------}
The following code fragment of the main program provides a good
test of the scanner. For brevity, I'll only include the
executable code here; the rest remains the same. Don't forget,
though, to add the name Scanner1 to the "uses" clause.
Write(GetName);
Match('=');
Write(GetNumber);
Match('+');
WriteLn(GetName);
This code will recognize all sentences of the form:
x=0+y
where x and y can be any single-character variable names, and 0
any digit. The code should reject all other sentences, and give a
meaningful error message. If it did, you're in good shape and we
can proceed.
THE SCANNER UNIT
The next, and by far the most important, version of the scanner is
the one that handles the multi-character tokens that all real
languages must have. Only the two functions, GetName and
GetNumber, change between the two units, but just to be sure there
are no mistakes, I've reproduced the entire unit here. This is
unit Scanner:
{--------------------------------------------------------------}
unit Scanner;
{--------------------------------------------------------------}
interface
uses Input, Errors;
function IsAlpha(c: char): boolean;
function IsDigit(c: char): boolean;
function IsAlNum(c: char): boolean;
function IsAddop(c: char): boolean;
function IsMulop(c: char): boolean;
procedure Match(x: char);
function GetName: string;
function GetNumber: longint;
{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
{ Recognize an Alpha Character }
function IsAlpha(c: char): boolean;
begin
IsAlpha := UpCase(c) in ['A'..'Z'];
end;
{--------------------------------------------------------------}
{ Recognize a Numeric Character }
function IsDigit(c: char): boolean;
begin
IsDigit := c in ['0'..'9'];
end;
{--------------------------------------------------------------}
{ Recognize an Alphanumeric Character }
function IsAlnum(c: char): boolean;
begin
IsAlnum := IsAlpha(c) or IsDigit(c);
end;
{--------------------------------------------------------------}
{ Recognize an Addition Operator }
function IsAddop(c: char): boolean;
begin
IsAddop := c in ['+','-'];
end;
{--------------------------------------------------------------}
{ Recognize a Multiplication Operator }
function IsMulop(c: char): boolean;
begin
IsMulop := c in ['*','/'];
end;
{--------------------------------------------------------------}
{ Match One Character }
procedure Match(x: char);
begin
if Look = x then GetChar
else Expected('''' + x + '''');
end;
{--------------------------------------------------------------}
{ Get an Identifier }
function GetName: string;
var n: string;
begin
n := '';
if not IsAlpha(Look) then Expected('Name');
while IsAlnum(Look) do begin
n := n + Look;
GetChar;
end;
GetName := n;
end;
{--------------------------------------------------------------}
{ Get a Number }
function GetNumber: string;
var n: string;
begin
n := '';
if not IsDigit(Look) then Expected('Integer');
while IsDigit(Look) do begin
n := n + Look;
GetChar;
end;
GetNumber := n;
end;
end.
{--------------------------------------------------------------}
The same test program will test this scanner, also. Simply change
the "uses" clause to use Scanner instead of Scanner1. Now you
should be able to type multi-character names and numbers.
DECISIONS, DECISIONS
In spite of the relative simplicity of both scanners, a lot of
thought has gone into them, and a lot of decisions had to be made.
I'd like to share those thoughts with you now so you can make your
own educated decision, appropriate for your application. First,
note that both versions of GetName translate the input characters
to upper case. Obviously, there was a design decision made here,
and this is one of those cases where the language syntax splatters
over into the scanner. In the C language, the case of characters
in identifiers is significant. For such a language, we obviously
can't map the characters to upper case. The design I'm using
assumes a language like Pascal, where the case of characters
doesn't matter. For such languages, it's easier to go ahead and
map all identifiers to upper case in the scanner, so we don't have
to worry later on when we're comparing strings for equality.
We could have even gone a step further, and map the characters to
upper case right as they come in, in GetChar. This approach works
too, and I've used it in the past, but it's too confining.
Specifically, it will also map characters that may be part of
quoted strings, which is not a good idea. So if you're going to
map to upper case at all, GetName is the proper place to do it.
Note that the function GetNumber in this scanner returns a string,
just as GetName does. This is another one of those things I've
oscillated about almost daily, and the last swing was all of ten
minutes ago. The alternative approach, and one I've used many
times in past installments, returns an integer result.
Both approaches have their good points. Since we're fetching a
number, the approach that immediately comes to mind is to return
it as an integer. But bear in mind that the eventual use of the
number will be in a write statement that goes back to the outside
world. Someone -- either us or the code hidden inside the write
statement -- is going to have to convert the number back to a
string again. Turbo Pascal includes such string conversion
routines, but why use them if we don't have to? Why convert a
number from string to integer form, only to convert it right back
again in the code generator, only a few statements later?
Furthermore, as you'll soon see, we're going to need a temporary
storage spot for the value of the token we've fetched. If we treat
the number in its string form, we can store the value of either a
variable or a number in the same string. Otherwise, we'll have to
create a second, integer variable.
On the other hand, we'll find that carrying the number as a string
virtually eliminates any chance of optimization later on. As we
get to the point where we are beginning to concern ourselves with
code generation, we'll encounter cases in which we're doing
arithmetic on constants. For such cases, it's really foolish to
generate code that performs the constant arithmetic at run time.
Far better to let the parser do the arithmetic at compile time,
and merely code the result. To do that, we'll wish we had the
constants stored as integers rather than strings.
What finally swung me back over to the string approach was an
aggressive application of the KISS test, plus reminding myself
that we've studiously avoided issues of code efficiency. One of
the things that makes our simple-minded parsing work, without the
complexities of a "real" compiler, is that we've said up front
that we aren't concerned about code efficiency. That gives us a
lot of freedom to do things the easy way rather than the efficient
one, and it's a freedom we must be careful not to abandon
voluntarily, in spite of the urges for efficiency shouting in our
ear. In addition to being a big believer in the KISS philosophy,
I'm also an advocate of "lazy programming," which in this context
means, don't program anything until you need it. As P.J. Plauger
says, "Never put off until tomorrow what you can put off
indefinitely." Over the years, much code has been written to
provide for eventualities that never happened. I've learned that
lesson myself, from bitter experience. So the bottom line is: We
won't convert to an integer here because we don't need to. It's
as simple as that.
For those of you who still think we may need the integer version
(and indeed we may), here it is:
{--------------------------------------------------------------}
{ Get a Number (integer version) }
function GetNumber: longint;
var n: longint;
begin
n := 0;
if not IsDigit(Look) then Expected('Integer');
while IsDigit(Look) do begin
n := 10 * n + (Ord(Look) - Ord('0'));
GetChar;
end;
GetNumber := n;
end;
{--------------------------------------------------------------}
You might file this one away, as I intend to, for a rainy day.
PARSING
At this point, we have distributed all the routines that made up
our Cradle into units that we can draw upon as we need them.
Obviously, they will evolve further as we continue the process of
bootstrapping ourselves up again, but for the most part their
content, and certainly the architecture that they imply, is
defined. What remains is to embody the language syntax into the
parser unit. We won't do much of that in this installment, but I
do want to do a little, just to leave us with the good feeling
that we still know what we're doing. So before we go, let's
generate just enough of a parser to process single factors in an
expression. In the process, we'll also, by necessity, find we
have created a code generator unit, as well.
Remember the very first installment of this series? We read an
integer value, say n, and generated the code to load it into the
D0 register via an immediate move:
MOVE #n,D0
Shortly afterwards, we repeated the process for a variable,
MOVE X(PC),D0
and then for a factor that could be either constant or variable.
For old times sake, let's revisit that process. Define the
following new unit:
{--------------------------------------------------------------}
unit Parser;
{--------------------------------------------------------------}
interface
uses Input, Scanner, Errors, CodeGen;
procedure Factor;
{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
{ Parse and Translate a Factor }
procedure Factor;
begin
LoadConstant(GetNumber);
end;
end.
{--------------------------------------------------------------}
As you can see, this unit calls a procedure, LoadConstant, which
actually effects the output of the assembly-language code. The
unit also uses a new unit, CodeGen. This step represents the last
major change in our architecture, from earlier installments: The
removal of the machine-dependent code to a separate unit. If I
have my way, there will not be a single line of code, outside of
CodeGen, that betrays the fact that we're targeting the 68000 CPU.
And this is one place I think that having my way is quite
feasible.
For those of you who wish I were using the 80x86 architecture (or
any other one) instead of the 68000, here's your answer: Merely
replace CodeGen with one suitable for your CPU of choice.
So far, our code generator has only one procedure in it. Here's
the unit:
{--------------------------------------------------------------}
unit CodeGen;
{--------------------------------------------------------------}
interface
uses Output;
procedure LoadConstant(n: string);
{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
{ Load the Primary Register with a Constant }
procedure LoadConstant(n: string);
begin
EmitLn('MOVE #' + n + ',D0' );
end;
end.
{--------------------------------------------------------------}
Copy and compile this unit, and execute the following main
program:
{--------------------------------------------------------------}
program Main;
uses WinCRT, Input, Output, Errors, Scanner, Parser;
begin
Factor;
end.
{--------------------------------------------------------------}
There it is, the generated code, just as we hoped it would be.
Now, I hope you can begin to see the advantage of the unit-based
architecture of our new design. Here we have a main program
that's all of five lines long. That's all of the program we need
to see, unless we choose to see more. And yet, all those units
are sitting there, patiently waiting to serve us. We can have our
cake and eat it too, in that we have simple and short code, but
powerful allies. What remains to be done is to flesh out the
units to match the capabilities of earlier installments. We'll do
that in the next installment, but before I close, let's finish out
the parsing of a factor, just to satisfy ourselves that we still
know how. The final version of CodeGen includes the new
procedure, LoadVariable:
{--------------------------------------------------------------}
unit CodeGen;
{--------------------------------------------------------------}
interface
uses Output;
procedure LoadConstant(n: string);
procedure LoadVariable(Name: string);
{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
{ Load the Primary Register with a Constant }
procedure LoadConstant(n: string);
begin
EmitLn('MOVE #' + n + ',D0' );
end;
{--------------------------------------------------------------}
{ Load a Variable to the Primary Register }
procedure LoadVariable(Name: string);
begin
EmitLn('MOVE ' + Name + '(PC),D0');
end;
end.
{--------------------------------------------------------------}
The parser unit itself doesn't change, but we have a more complex
version of procedure Factor:
{--------------------------------------------------------------}
{ Parse and Translate a Factor }
procedure Factor;
begin
if IsDigit(Look) then
LoadConstant(GetNumber)
else if IsAlpha(Look)then
LoadVariable(GetName)
else
Error('Unrecognized character ' + Look);
end;
{--------------------------------------------------------------}
Now, without altering the main program, you should find that our
program will process either a variable or a constant factor. At
this point, our architecture is almost complete; we have units to
do all the dirty work, and enough code in the parser and code
generator to demonstrate that everything works. What remains is
to flesh out the units we've defined, particularly the parser and
code generator, to support the more complex syntax elements that
make up a real language. Since we've done this many times before
in earlier installments, it shouldn't take long to get us back to
where we were before the long hiatus. We'll continue this process
in Installment 16, coming soon. See you then.
REFERENCES
1. Crenshaw, J.W., "Object-Oriented Design of Assemblers and
Compilers," Proc. Software Development '91 Conference, Miller
Freeman, San Francisco, CA, February 1991, pp. 143-155.
2. Crenshaw, J.W., "A Perfect Marriage," Computer Language, Volume
8, #6, June 1991, pp. 44-55.
3. Crenshaw, J.W., "Syntax-Driven Object-Oriented Design," Proc.
1991 Embedded Systems Conference, Miller Freeman, San
Francisco, CA, September 1991, pp. 45-60.
*****************************************************************
* *
* COPYRIGHT NOTICE *
* *
* Copyright (C) 1994 Jack W. Crenshaw. All rights reserved. *
* *
* *
*****************************************************************
--
--/*\ Jon A. Lambert - TychoMUD Internet:jlsysinc@ix.netcom.com /*\--
--/*\ Mud Server Developer's Page <
http://www.netcom.com/~jlsysinc> /*\--
--/*\ "Everything that deceives may be said to enchant" - Plato /*\--