Machine Code Optimization
This is a Web edition of my 1986 Ph.D. dissertation “Machine Code Optimization –
Improving Executable Object Code”. The original dissertation was published
as Technical Report #246 by the Computer Science Department of
Courant Institute of Mathematical Sciences, New York University, June 1986.
The text for this Web edition was extracted from a PDF file made available courtesy of
The Internet Archive and reformatted for the web in January 2011.
The original version was authored on a CDC 66001 using TROFF with TBL and EQN pre-processor directives for tables and equations. Diagrams were redrawn using Corel Draw.
I am releasing this under the Creative Commons Attribution-Noncommercial 3.0 license. This license lets you distribute and build upon this work non-commercially as long as you credit me as the author of this original work. Please see the full legal license for details.
If you have any comments, corrections, or suggestions, please contact me at clint@goss.com.
–– Clint Goss [clint@goss.com], initially published on-line March 2011, revised August 2013
Machine Code Optimization -
Improving Executable Object Code
Author: Clinton F. Goss
Advisor: Edmond Schonberg
Institution: Courant Institute of Mathematical Sciences, New York University
Date: June 1986
Abstract
This dissertation explores classes of compiler optimization techniques that are
applicable late in the compilation process, after all executable code for a program has
been linked. I concentrate on techniques which, for various reasons, cannot be applied
earlier in the compilation process. In addition to a theoretical treatment of this class of
optimization techniques, this dissertation reports on an implementation of these
techniques in a production environment. I describe the details of the implementation
which allows these techniques to be re-targeted easily and report on improvements
gained when optimizing production software.
I begin by demonstrating the need for optimizations at this level in the UNIX
programming environment. I then describe a Machine Code Optimizer that improves
code in executable task files in that environment. The specific details of certain
algorithms are then described: code elimination to remove unreachable code, code
distribution to re-order sections of code, operand reduction to convert operands to use
more advantageous addressing modes available on the target architecture, and macro
compression to collapse common sequences of instructions. I show that the problem of
finding optimal solutions for code distribution to be NP-Complete and discuss heuristics
for practical solutions.
I then describe the implementation of a Machine Code Optimizer containing the
code elimination, code distribution, and operand reduction algorithms. This optimizer
operates in a production environment and incorporates a machine-independent
architecture representation that allows it to be ported across a large class of machines.
I demonstrate the portability of the Machine Code Optimizer to the Motorola
MC68000 and the Digital VAX-11 instruction sets. Finally, metrics on the
improvements obtained across architectures and across the optimization techniques are
provided along with proposed lines of further research. The metrics demonstrate that
substantial reductions in code space and more modest improvements in execution speed
can be obtained using these techniques.
Table of Contents
Trademarks
1. Introduction
1.1. Background
1.2. Related Work
1.3. Organization of the Dissertation
2. Design of the MCO
2.1. Input
2.2. Instruction Parsing and Internal Representations
2.3. Text Blocking
2.4. Operand Linking
2.5. Code Elimination
2.6. Code Distribution
2.7. Operand Reduction
2.8. Code Relocation
3. Code Elimination
3.1. Restrictions on the Input Program
3.2. Current Code Elimination Techniques
3.3. Subprogram Elimination
4. Code Distribution
4.1. Data Distribution
4.2. Complexity of Code Distribution
4.3. Heuristics for Code Distribution
5. Operand Reduction
5.1. Static Data Structures
5.2. Minimize and Lengthen
5.3. Register Tracking
6. Macro Compression
6.1. Background
6.2. Assembly Code Compression
6.3. A First Attempt
6.4. Macro Compression in the MCO
6.5. G-Code and G-Compression
7. Measurement and Evaluation of Performance
7.1. Test Input
7.2. Speed of the MCO
7.3. Space Requirements of the MCO
7.4. Effect on Program Space
7.5. Effect on Program Speed
8. Conclusions
8.1. Review of Work Done
8.2. Proposals for Further Investigation
Appendix A. Definition of Text and Block Node Fields
Appendix B. Low Level Implementation of Data Structures
Appendix C. Low Level Implementation of Algorithms
References and Bibliography
Trademarks
-
PHILON® is registered in the U.S. Patent and Trademark Office.
-
The following are trademarks of Philon, Inc.:
-
Philon FAST/C®
-
Philon
FAST/COBOL®
-
Philon FAST/FORTRAN®
-
Philon FAST/BASIC-C®
-
MC68000® and MC68020® are trademarks of Motorola, Inc.
-
UNIX® is a trademark of Bell Laboratories (a division of AT&T).
-
DEC®, VAX®, and PDP-11® are trademarks of Digital Equipment Corporation.
1. Introduction
The topic of compiler optimization covers a wide range of program analysis
methods and program transformations which are applied primarily to improve the speed
or space efficiency of a target program. These techniques are typically applied to a
representation of the target program which is, to some degree, removed from the
program representation executed by the hardware. The representations on which
optimization techniques are applied include source-to-source transformations ([Parts 83], [Schn 73]) down to optimizations on assembly code ([Fras 84], [Lower 69], [McKee 65]).
However, in many program development environments, some significant optimization
techniques cannot be performed on any program representation prior to the
representation executed by the hardware, without general re-design of that environment.
It is my thesis that the application of optimization techniques at this level is
warranted and can be shown to yield a significant decrease in code space and a more
modest improvement in execution speed. As a result, this dissertation describes two
aspects of this area in parallel. I explore analysis and optimization techniques from a
theoretical viewpoint. Some of these are new and some are extensions of techniques
which have been applied in other phases of the compilation process. In addition, I
report on the implementation of a production quality Machine Code Optimizer. This
optimizer represents the first time that these techniques have been brought together in
this fashion and at this level. The performance of this optimizer substantiates the
expected speed and space improvements on two target architectures.
In this chapter, I demonstrate the need for optimizations at this level and suggest
that such optimizations can he carried out despite the lack of auxiliary information which
would normally be available to an optimizer. I also survey existing work in closely
related areas and outline the remainder of this dissertation.
1.1. Background
As a working example throughout this dissertation, I will consider the compilation
and optimization of programs under UNIX and UNIX-like program development
environments. These environments will be considered specifically for machines based on
architectures such as the Digital Equipment Corporation VAX-11 ([DEC 77]), Motorola
MC68000 ([Motor 84a]), and the Texas Instruments TI32000 ([Texas 85]). I will, unless
specifically noted, rely only on features of UNIX which are generally available in
program development environments. The scope of architectures considered in this
research is discussed later in this section.
In source form, a program consists of a number of modules, each containing one or
more subprograms (subroutines, functions, etc.). A compiler for the given high level
language reads the source code of a single module, possibly translating into one or more
internal forms over which optimization techniques are performed, and produces a single
object file. This file contains machine code consisting of instructions to be executed by the
hardware, data objects which are operated on by the instructions, and other information.
The architectures to be considered here have a memory area consisting of locations with
an associated linear ordering. The locations are numbered sequentially by addresses that
follow the ordering. Each instruction on these architectures has an opcode which names
the operation to be performed and a number of operands which yield values or give the
address of data objects or memory locations to be operated on. Each operand requires
an area, called an extension word, in the instruction to hold information on the value or
machine address which the operand represents.
The bit representations of the instructions and data objects in an object file are
identical to those which will appear in memory when the program is executed, except that
references to code and data objects in other modules as well as references to absolute
addresses in the current object file are not set. Such references are called relocatable
references. Any operand of an instruction containing a relocatable reference is called a
relocatable operand and the address it references is called the effective address.
Information regarding where relocatable references are and what they refer to is
contained in the relocation information in each object file. The location of a relocatable
reference, as specified by the relocation information, is called its relocation point.
When all modules of a program are compiled, the object files are supplied to a
system linker. The linker produces a task file which can be directly loaded into memory
and executed by the hardware. Such a task file contains areas of code, data objects, and
optionally, relocation information. Each area, or segment, is formed by catenating the
corresponding areas from each object file, in the order they were supplied to the linker,
and resolving relocatable references by installing the actual machine address in each
reference.
For a particular high level language, it is typical to organize a set of object files that
implement the primitives of the language (e.g. SIN(x) in BASIC, Indexed Read in
COBOL, and printf() in C) into a library. Such libraries can be given to the linker,
which selects only those object files that contain code or data referenced by other
modules already linked into the task file.
This general approach reduces the compilation work necessary to effect small
changes in a program: only the affected modules need be recompiled. Since linking
object files is far faster than re-compiling the whole program from source, this system
greatly speeds development of highly modularized programs.
However, this general approach results in a number of inefficiencies in the code in
the task files. Furthermore, the optimization techniques that might remove these
inefficiencies must be performed after the link phase on the given architecture.
The first inefficiency arises when linking an object file which contains code for
several subprograms. If any of the subprograms or data objects in such an object file is
referenced, the entire object file is linked into the task file. This situation frequently
arises in UNIX environments where large libraries implementing primitives of various
types are linked into an application. Furthermore, it cannot be avoided prior to the link
phase, except by restructuring the offending object files.
Another inefficiency deals with the use of the instruction set itself. The given
architectures all have a set of addressing modes which can be used to represent the
semantics of instruction operands. Included in this set are a number of location-relative
modes which yield an effective address by giving an offset from the location of the
operand or the start of the instruction. Often, the location-relative modes require less
space and yield an effective address that is decoded (by the hardware) faster than
absolute modes which simply name the effective address. However, the short offset
employed limits the effective address to within a specified distance from the operand.
This limitation is called the span of the mode.
For example, many machines have several addressing modes for operands of branch
instructions. Each addressing mode has its own span restrictions. Often, the most
general form allows an arbitrary branch target, but is the most expensive in terms of
space and execution speed. Shorter and more efficient forms of branch operands
compute their target address relative to the memory address of the branch instruction
itself, together with an offset from the start of the instruction. However, the offset must
be small, allowing only relatively local branch targets.
Often, these location-relative modes cannot be used in UNIX task files due to the
method of linking. After an object file is produced, the sizes of instructions do not
change; the linker merely fills in resolved addresses in relocatable operands. Again, this
allows a fast linker to perform minimal work when code to a single module is changed.
However, relocatable effective addresses which cannot be confined within the span of a
location-dependent mode must be implemented using the most general and usually most
expensive addressing mode. Under the UNIX scheme, this includes references to code
in other modules as well as all references to static data, since the data appears in memory
after all code and can be arbitrarily far away from an operand.
Finally, the installation of location-relative modes is itself limited by the order in
which object files are linked. Typically, libraries containing code for primitives are
linked in at the end of the text segment. Thus, they tend to be far away from the code
which uses them. In many high level languages, primitives tend to be the most
frequently referenced routines, and locating them at the end of the text segment may
significantly reduce an optimizer's ability to install location-relative modes in operands. Such high-use subprograms need to be placed near their references.
Conversely, code generated from the user's source appears at the beginning, far
removed from the data segment containing referenced global variables. This code should
appear near the end of the text segment, as close to the data segment as possible.
The inefficiencies described thus far are common across a variety of architectures. These generally include machines with a linear address space which provide several
interchangeable addressing modes to access this space. Other than the MC68000,
VAX-11, and TI32000 mentioned earlier, the Digital PDP-11 ([DEC 75]), Interdata 8/32,
IBM 1130, CDC 6600 Peripheral Processor, and the Prime 400 (see [Bell 71] for a
general discussion) are in this class. The above remarks do not apply to architectures
with a purely segmented architecture such as the Intel 8086 ([Russ 80]).
The generic techniques for handling the inefficiencies described above are
applicable across this full class of architectures. However, the implementation of these
techniques for a given architecture is highly dependent on the specifics of the
instruction set, addressing structure, and memory model of the target machine. A
straightforward implementation of these techniques will be riddled with specific
references to the architecture. Therefore, it is of special interest to develop, in
conjunction with generic techniques for handling these inefficiencies, a technology for
instantiating those techniques in an architecture-independent fashion.
1.2. Related Work
The general topic of compiler optimization has received much attention, with the
bulk of the work concentrating on transformations applicable to some intermediate
representation between source code and executable code. I have borrowed a number
of high level concepts and techniques from a number of sources, applying them with
greater effectiveness at the machine code level. While the specific work in each area is
reviewed in detail in the relevant sections which follow, I outline the major references
in each area:
A number of the inefficiencies described above can be partially removed by means
of techniques applied at a higher program level. These include the implementation of
code elimination of various forms, as described in [Aho 77], the handling of span-dependent
instructions at the assembly code level in [Szym 78], and the compression of
repeated code sequences at the assembly code level in [Fras 84].
Many techniques appear in the literature which deal with low-level constructs, but
are not applicable to machine code optimization due to their effectiveness on an intra-module basis. Thus they are more efficiently applied before the machine code level so
that work is not repeated on a given module. These techniques include ordering basic
blocks to minimize the number of branches [Raman 84], chaining span-dependent jumps
[Lever 80], and peephole optimization [McKee 65].
Comparatively little work has been done on optimizations at the machine code level. The works known to the author are those by [Dewar 79a] on compressing an interpretive
byte stream and [Rober 79] on distribution of data throughout the code segment as
described in Chapter 4.
1.3. Organization of the Dissertation
In response to the inefficiencies described in
Section §1.1 and building on the work
surveyed in the last Section, a Machine Code Optimizer (MCO) was constructed to make
machine code smaller and faster. The remainder of this dissertation deals mainly with
the design, implementation, and performance of the MCO. I concentrate on
describing those techniques which, for various reasons, cannot be applied before the link
phase of compilation.
The next Chapter gives an overview of the organization of the MCO and outlines
the design of particular areas.
Chapter 3, Chapter 4, and Chapter 5 expand on the specific techniques for removing the
inefficiencies in
Section §1.1. References to related work in each area are given as well as
the specific algorithms and their analysis.
Chapter 6 describes a number of techniques relating to recognizing and compressing
common sequences of code. These are not used in the current MCO for various
implementation or efficiency reasons, but the experience gained is of interest to compiler
constructors.
Chapter 7 presents statistics on the space and speed improvements gained by the
MCO on VAX-11 and MC68000 code for various high level languages. Statistics on
the space and time cost of the MCO itself are also presented as well as the effort of
re-targeting the MCO from the 68000 to the VAX architecture.
Finally, Chapter 8 reviews the work, summarizes the results obtained, and proposes
lines of future research.
2. Design of the MCO
The MCO reads an input task file containing executable machine code, data, and
relocation information data for a given architecture, applies various techniques for
improving machine code, and outputs another task file which is semantically equivalent.
Briefly, the MCO operates in the following sequential phases:
The input task file is read and augmented by a set of dynamic data structures which
hold information about the instructions and data of the program. These are built up
during instruction parsing in which the byte stream containing program code is partitioned
into machine instructions and data areas. This list of instructions and data areas is then
partitioned into subprograms during text blocking.
The next phase, called operand linking, is responsible for identifying all relocatable
operands and determining what they refer to.
The first optimization performed is code elimination in which unreferenced areas of
code and subprograms are removed from the dynamic data structures.
Then, code distribution is performed. Sections of code and data are re-ordered to
reduce the average distance between instruction operands and the effective addresses they
reference. This transformation by itself does not improve the code, but makes the next
technique more effective.
Operand reduction converts each instruction operand to use the least expensive
addressing mode which can represent the operand on the given architecture. This
operates in two sub-phases: MINIMIZE contracts all operands to use the least expensive
applicable addressing mode and LENGTHEN expands minimized operands as necessary to
satisfy constraints on the addressing modes.
Finally, the code relocation phase installs changes in the bit patterns of instructions
as a result of the improvement techniques applied and writes the output task file.
One of the design goals of the MCO is to ease the onus of re-targeting the MCO to
various architectures. Toward this goal, most of the relevant information about the
target architecture is kept in a set of static data structures. They describe the details of
the instruction set and addressing modes of the target architecture which are needed by
the MCO, especially during Operand Reduction. The static data structures allow the
MCO to be largely table driven in areas where re-targeting is an issue.
In this chapter, I give a more detailed description of the dynamic data structures
and the phases of the MCO I have just outlined. Particular attention is given to how
the phases interface and what their effects are on the dynamic data structures. Certain
algorithms as well as the static data structures are described and analyzed in later
chapters.
2.1. Input
Input to the MCO consists of a single task file. This file contains the following areas
of information:
The Header: A fixed-size structure containing the sizes of the other areas, the
program load address where the program is loaded into memory, and the entry point
giving the location of the first instruction to be executed.
The Text Segment: A byte stream containing the machine code instructions of the
program, possibly interspersed with areas of program data. At execution time, the byte
stream is loaded into memory (possibly in a virtual fashion) by a system loader,
beginning at the program load address specified in the header. The address at which an
instruction is loaded into memory is called that instruction's load address.
The Data Segment: A byte stream similar to the text segment, but containing only
program data. It is loaded by the system loader either directly after the text segment or
at some address specified in the header. The rules as to where the data segment may be
loaded in relation to the text segment (e.g. at the next 64-byte boundary) vary depending
on the environment. The location where the data is loaded into memory is the data load
address.
The Symbol Table: A list of structures which map symbolic names onto symbol
types and machine addresses.
Relocation Information: A list of locations in the text and data segments which
reference machine addresses. These may be instruction operands which specify the
address of an object in the data segment or a pointer in the data segment initialized to
point to another piece of data or an instruction. Each such area specified is the size of a
pointer on the target architecture.
Except for the last area, the information required by the MCO in a task file is
standard in that such information is logically required for a system loader to be able to
load a program into memory.
The Relocation Information is optionally provided by the UNIX linker, which
links together object files. On some systems, the linker cannot provide this information
in the task file. However, the relocation information is simply distilled by the UNIX
linker from similar information in each of the object files it links together. This
information must be present in some form in object files in order for a linker to assign
proper values to pointers. In this case, the MCO can extract and distill it in the same way
that the UNIX linker does.
2.2. Instruction Parsing and Internal Representations
After opening the input task file and reading the header, the MCO begins parsing
the text and data segments to build an internal representation of the program. First, the
contents of the text and data segments are read into buffers in memory. Then the MCO
creates a list of text and data nodes in memory to hold relevant information about the
program.
Each text node describes a single machine instruction and each data node is
associated with a single area of contiguous data. The last node on the list is always a data
node which represents the data segment.
Appendix A provides a detailed description of the fields in a text node and what
they represent. Figure 2.1 summarizes the fields of a text node using an example of an
instruction on the 68000 architecture as they would appear after instruction parsing. In
this example, the target address, S, is represented using the absolute long mode.
Instruction: 0A00 jsr S
0A06
...
0C20 S:
Text Node: OPC: o_jsr
SIZE: sz_none
IADDR: 0A00
FADDR: (NULL)
INSTR: 4EB900000C20
NEXT:
IBYTES: 6
NBYTES: 6
REF: 0
JSR: 0
OP[0]:
ADDR: 0
TARGET:
MODE: am_abs
OFFSET: 2
REG: NULL |
Figure 2.1 Example Text Node |
Each data node holds information pertaining to a single contiguous block of data. The data may be in the text or data segments. Data notes have IADDR, FADDR, NEXT, NBYTES, and REF fields which are identical to text nodes. They also have a DATA field which serves the same purpose as the INSTR field in the text notes.
The dynamic data structures are built up by an instruction parsing routine. This routine is given a pointer to a location in the input text segment and determines the information needed to initialize a single text or data node for the instruction or data area beginning at that location. The instruction parsing routine depends heavily on the architecture and takes a significant portion of the processing time of the MCO.
For the 68000, the logic of parsing instructions is embedded in a large routine (28 pages of C source) which was tightly coded for speed. When re-targeting to the VAX architecture, a data-driven scheme was used. This routine was small (2 pages of C source), developed and debugged quickly, but still runs about as fast as the 68000 version. This was possible due to the greater orthogonality of the VAX instruction set.2
2.3. Text Blocking
Instruction parsing organizes the text and data nodes into a single linked list, in the
order they were read in. This single list is broken down into a two-level data structure
during text blocking.
The text and data nodes are partitioned into blocks, each of which is assigned a
block node.
Text blocking is performed in a single pass over the code. A pair of text nodes
containing an unconditional branch or subprogram return followed by an instruction with
its JSR field set constitutes a partition point. At these points, a new block is formed.
Thus, a block is typically one or several subprograms in the text segment where
each block is independent and linked only via the block nodes. After text blocking, the
two-level data structure is processed by all subsequent algorithms, rather than the initial
single list of text nodes. Figure 2.2 demonstrates this transformation. The specific fields
of a block node are described in detail in Appendix A.
Text blocking is done for two reasons. First, the code distribution algorithm
reorders sections of code. After text blocking, it simply deals with block nodes rather
than lists of text and data nodes. Second, several of the algorithms performed on the
dynamic data structures have a worst-case performance which is quadratic in the number
of text nodes since they have to perform linear searches for a node with a given IADDR.
In these cases, we search through the list of block nodes to find the correct block and
then examine the text and data nodes in that block. In this way, the quadratic algorithms
run in reasonable time for all but pathological or contrived input.
|
Figure 2.2 Text Blocking |
2.4. Operand Linking
After the instructions have been parsed and the dynamic data structures built, the
relocatable operands are identified and linked to their targets. This operand linking is
done in two passes over the dynamic data structures.
The first pass identifies all relocatable operands. This is done by a pair of co-routines
which pass over the text and data nodes and over the relocation information in
the input task file. The first co-routine processes instructions up to the next relocation
point specified by the second co-routine. The second co-routine processes the relocation
information to determine the next text or data node which has a relocatable operand.
The first co-routine marks any operand which uses some location-relative addressing
mode as relocatable and the effective address is stored in the ADDR field of the operand.
At a relocation point, we determine whether the address specified as being relocatable is
in a text node or a data node. If it is in a text node, the operand of the instruction
specified by the relocation information is identified and again marked as relocatable by
setting its ADDR field. Since no relocation information is kept in data nodes, relocation
directives specifying relocatable references in data nodes are ignored.
Note that, except for location-relative addressing modes, it is crucial to use the
relocation information to identify relocatable operands. If we rely on the apparent nature
of the operand based on its addressing mode, operands could not be unambiguously
identified as relocatable. Consider an instruction which loads the address of its operand.
Although this operand appears to be relocatable, the idioms
lea val,An
moval val,Rn
are often used to load constant (non-relocatable) values. Conversely, a comparison with
an immediate operand may be a constant, but could also be comparing a value with the
(relocatable) address of a routine.
After the first operand linking pass, all operands which are relocatable have their
ADDR field set. The second pass sets the TARGET field for all such operands. This
field is set to point to the text or data node containing the code or data that will be loaded
at the ADDR address. Note that the ADDR field need not refer to the start of the code
or data in the referenced node; the referenced node must simply contain the target.
It is this second pass which runs in quadratic time in the number of block nodes.
However, due to the blocked data structure employed, the second pass runs with
reasonable speed (see
Section §7.2).
In addition to setting the TARGET field, the REF field of any text node referenced
by a relocatable operand anywhere in a text or data node is incremented during pass 2. Also, the JSR field is incremented if the relocatable operand is the operand of some
subprogram call instruction.
Thus, at the end of operand linking, the ADDR and TARGET fields are set for all
and only those operands which are relocatable. Also, the REF field contains an exact
count of the number of relocatable references to a text or data node. The JSR field
contains a lower bound on the number of call instructions which refer to a given text
node. Due to indirect calls through pointers to procedures, some text nodes which may
be the target of a call instruction at execution time cannot be identified. However, such
nodes will always have a non-zero REF field.
Figure 2.3 depicts the text nodes from Figure 2.1 after the operand linking phase.
Instruction: 0A00 jsr S
0A06
...
0C20 S:
Text Node: OPC: o_jsr
SIZE: sz_none
IADDR: 0A00
FADDR: (NULL)
INSTR: 4EB900000C20
NEXT:
IBYTES: 6
NBYTES: 6
REF: 0
JSR: 0
OP[0]:
ADDR: 0C20 OPC: ??
TARGET: TN IADDR: 0C20
MODE: am_abs ...
OFFSET: 2 REF: 1
REG: NULL JSR: 1
...
|
Figure 2.3 Example Text Node After Operand Linking |
2.5. Code Elimination
The first code improvement performed by the MCO is the elimination of code which can never be executed. As previously described, we wish to eliminate sections of code as well as entire subprograms which are never referenced.
The code elimination algorithm is an augmented version of unreferenced code elimination ([Aho 77]). It relies on the REF and JSR fields set during the previous phase to determine what code can safely be eliminated. This is done in a single forward pass over the code. The algorithm removes unreferenced code as well as certain referenced sections which are not reachable from the program entry.
The algorithm and the restrictions on what input programs it operates on are given in Chapter 3.
2.6. Code Distribution
As mentioned in the introduction, one of the inefficiencies in the way UNIX task images are linked arises from the order in which subprograms in the text segments are arranged. Both the order of subprograms in a module and the order in which modules
are supplied to the linker give no consideration to placing span-dependent operands near
their targets.
The code distribution phase re-orders subprograms in the text segment to place
span-dependent operands near their targets. The target might be in another subprogram
or in the data segment at the end of the program.
In Chapter 4, I discuss the general problems of re-ordering sections of the data
segment as well as the text segment from a theoretical viewpoint and show these
problems to be difficult to solve. Due to these results, I employ efficient heuristics to
distribute the blocks. These heuristics are described in Chapter 4, along with other
related optimization techniques not employed in the current MCO.
2.7. Operand Reduction
After code elimination and distribution, the operand reduction algorithm is
employed. This algorithm makes aggressive use of the addressing modes available on the
target architecture to transform existing instruction operands to make them smaller and
faster.
The general operand reduction algorithm is loosely based on the one proposed by
[Szym 78] for assembling code for architectures with span-dependent instructions. The
correctness and termination arguments in that paper apply in a similar fashion to operand
reduction.
As with the assembler algorithm of [Szym 78], we perform operand reduction in
two phases. The first phase, MINIMIZE, makes a single pass over the code. For
instructions and operands which can potentially be reduced (all relocatable operands and
certain others with specified addressing modes), we form the set of all legal
opcode/addressing mode pairs which can yield a semantically equivalent instruction. We
then choose the shortest combination and install it.
After MINIMIZE, the LENGTHEN phase iterates over the code identifying operands
which employ addressing modes that are unsuitable due to some span-dependent
constraint on the mode. Again, we form the set of possible opcode/addressing mode
substitutions. We now choose the least expensive one which satisfies all semantic as well
as span-dependent constraints.
Again, no change is made to instruction bit patterns, but sufficient space is
maintained in the text node to store the full instruction.
The operand reduction algorithm is implemented in a largely machine-independent
fashion using static data structures to describe the necessary attributes of the instruction
set and addressing modes of the target architecture. The details of these static data
structures and the associated algorithms outlined above are given in Chapter 5.
2.8. Code Relocation
The final phase of the MCO, code relocation, installs the changes made during
earlier code improvements in the bit patterns of each instruction and data area and
produces an output task file.
First, a single pass is made over the code to install new bit patterns in instruction
opcodes and operands which were subject to operand reduction. The lengths of
instructions are correctly maintained by MINIMIZE and LENGTHEN, so no re-allocation
of buffers to hold instruction bytes is needed during code relocation.
Next, the relocation information in the input task file is re-scanned to find any
relocation directives referring to relocatable addresses in the data nodes. We then
modify the pointer value in the referenced data node to contain the FADDR of the node
whose IADDR was equal to the input pointer value. If the input pointer referenced an
instruction, we make sure that it points to the start of the instruction. An input pointer to
the middle of a data area can safely be translated since data nodes are never contracted or
expanded. We simply add the same offset to the FADDR address that the original
pointer was offset from the IADDR of the target data node.
Finally, a file header for the output task file is written, followed by the contents of
each text and data node. To satisfy the requirements of UNIX debuggers, a copy of the
input symbol table modified to reflect the changes in the machine address for each
symbol is also output.
3. Code Elimination
The code elimination phase of the MCO removes subprograms which can never be
invoked. It uses an augmented version of a simple single-pass code elimination algorithm
that employs a good heuristic to test which sections of the flow graph have circularities
but are not connected to the program entry point. In this way, entire subprograms,
especially ones with loops, can be eliminated in a single pass.
This chapter begins by describing the limitations to which the input program must
be subject in order for this technique to be applicable. I then give a classification of
existing code elimination techniques and present my algorithm in light of these.
3.1. Restrictions on the Input Program
In general, the problem of code elimination on machine code is complicated by two
considerations:
First, since we are not improving code from a given high-level source language, we
cannot rely on any rules of program structure (e.g. a task image compiled from PASCAL
source would never have a jump into the middle of a subprogram). Rather, we must
accept any valid machine code generated from any high-level language.
This problem is handled by the MCO by using the relocation information to set the
REF field of all instructions whose IADDR is referenced by some operand. Hence, each
instruction is treated separately and no assumptions regarding program structure need be
made.
The second problem arises since arbitrary machine code can appear. For example, it
is possible to compute a jump address from non-relocatable operands without having the
computed address named in the relocation information.
The problem of identifying these situations is undecidable, since the expression that
computes a referenced address may be arbitrarily complex and take arbitrary inputs.
Consider the following pseudo-machine code:
load addr(X), regl
add 17, regl
jump *regl
In this scheme, the node containing X will have its REF field set since the load
instruction has a relocatable reference to it. However, the node at X – 17 will not be
marked as referenced, and could be erroneously subject to code elimination.
Hence, I restrict the target of all (direct and indirect) control transfers to conform
to the following definition: A branch address is simple if it is identical to the initial
address of some operand or data area specified as relocatable by the relocation
information.
This requires that the compiler generate only simple branch addresses in order for
the REF field of nodes to be accurate.
Note that the definition of a simple branch target does not rule out constructs such
as:
- Jumping to an address which was extracted from an array. Such code is typically
generated for the BASIC
ON-GOTO , Fortran Computed-GOTO , and C switch
statements. Since each entry in the array referring to an address is specified as a
relocatable data area, all possible targets of such high-level statements will be
marked as referenced.
-
Pointers to code and procedure parameters. Such code appears in C procedure
pointers and Fortran
ASSIGN statements. The pointer values are generally loaded
from a data area, as above, or by some code such as
load addr(X), reg
in which case the first operand of the load will be marked as relocatable.
-
Interrupts and service routines. Although these routines are called asynchronously
in response to some event, their address appears somewhere in the task file.
Typically, some interrupt vector or table needs to be initialized when the program
begins. This is done either by installing the address of the routine in a table using
executable code (the address would then appear as a relocatable operand of an
instruction), passing the address of the routine to a system function (again, the
parameter passing mechanism would contain the relocatable operand), or by
initializing the table directly in the data segment (a relocatable data item would be
in the data segment).
Hence, I do not feel that the requirement for simple branch operands is a practical
restriction on compilers for most high-level languages. In fact, the MCO has been used
with production compilers for full ANSI COBOL ([Phil 85co]),
FORTRAN 77 ([Phil 85ft]),
C ([Phil 85c]),
and two versions of BASIC ([Phil 84cb], [Phil 85mb]).
None of the code
generators or any of the library code for the language primitives had to be modified to
accommodate this restriction.
3.2. Current Code Elimination Techniques
I define several methods for performing code elimination, with increasing degrees
of effectiveness:
Unlabeled code elimination removes code which follows an unconditional branch and
is not labeled. This can be done using a single forward pass over the code.
Unreferenced code elimination eliminates code following an unconditional branch
which is either unlabeled or is prefaced with a label which is not referred to. After
reference counts are tabulated on labels, unreferenced code is eliminated by a series of
converging forward passes over the code.
Unreachable code elimination eliminates code to which there is no flow path from the
entry point of the program. It is capable of eliminating, for example, mutually recursive
subprograms whereas unlabeled and unreferenced code elimination are not. Typically,
this technique is implemented by building a flow graph for the program and removing
disconnected subgraphs which do not contain the entry point.
3.3. Subprogram Elimination
For purposes of the MCO, unlabeled code elimination is not effective since every
instruction is potentially labeled.
Unreachable code elimination, although the most aggressive technique, is also not
applicable. In the presence of an indirect jump through a quantity in a register, the MCO
would need to trace all possible values in the register to determine the possible successors
to a flow graph node. Failing this, all nodes would need to be labeled as successors,
rendering the entire flow graph connected. Such jumps often arise when generating code
for high level constructs such as C switch statements and COBOL perform statements, so this
technique would yield poor results.
Due to the presence of reliable reference counts, unreferenced code elimination is
most suitable for the MCO. However, it fails to fully eliminate a subprogram that has a
loop in it. Consider the following code:
return
SUB1: instr
instr
LOOP: instr
instr
jump LOOP
instr
return
SUB2: instr
where SUB1 is an unreferenced instruction, LOOP is referenced due to the loop, and
SUB2 is the beginning of the next subprogram which is referenced.
Basic unreferenced code elimination removes code from SUB1 up to, but not
including, LOOP (denoted SUB1~LOOP). However, LOOP up to SUB2 (LOOP~SUB2) is not eliminated since LOOP is referenced.
In order to eliminate such routines, I implement an augmented version of
unreferenced code elimination called subprogram elimination. This eliminates likely
sections of code on a trial basis and checks the resulting program for consistency. When
a candidate for unreferenced code elimination is detected, we perform the following
algorithm:
- Let SUB1 be an unreferenced instruction following an unconditional branch,
return ,
etc. Let LOOP be the first referenced instruction following SUB1 and let SUB2 be
the first instruction with its JSR field set at or after LOOP. Perform basic
unreferenced code elimination and remove SUB1~LOOP.
- Decrement the reference count of any instruction which is the target of an operand
of an instruction in LOOP~SUB2.
- Scan LOOP~SUB2 and determine if any instructions are still referenced.
- If no instructions in LOOP~SUB2 are referenced, then any instructions in
LOOP~SUB2 which were referenced before step 2 were the target of operands
within LOOP~SUB2. The instructions in LOOP~SUB2 can he eliminated and
further code elimination continue at SUB2.
-
Otherwise, some operand outside the range LOOP~SUB2 has the referenced
instruction found in Step 3 as a target. The instructions in LOOP~SUB2 cannot be
eliminated and we must repair the damage done to the reference counts in Step 2.
Further code elimination proceeds from the jump to LOOP.
Whenever an instruction is removed during the subprogram elimination algorithm,
we decrement the reference count of any targets of operands of the instruction. (This is
handled differently in steps 2, 4, and 5 above, but the net effect is the same). If the
resulting reference count goes to zero and the target is not currently being eliminated
(precedes LOOP in the above algorithm), another opportunity for code elimination has
occurred. However, since subprogram elimination works by forward passes only, this
opportunity will not be caught on this forward pass. Hence, we repeat subprogram
elimination until no such situations arise.
4. Code Distribution
The code distribution phase of the MCO re-orders sections of a program to improve
the effectiveness of operand reduction.
I divide the task of re-ordering a program into two problems:
Problem 4.1 (Data Distribution)
Partition the data segment into independent data objects, each of which can be
moved without regard for the load location of other independent data objects. Then reallocate
these objects in slots in the code segment which are not on an execution path (eg.
following a return or unconditional branch) in order to place them closer to operands
which reference them. ¤
Problem 4.2 (Code Distribution)
Re-order the code blocks created during text blocking to reduce the distance
between inter-block branches and their targets. ¤
This division corresponds to improving the effectiveness of operand reduction as it
deals with two distinct types of operands; those which reference targets in the data
segment and those which reference other code blocks.
In this chapter, I examine both these problems as implemented in [Rober 79] and
the MCO, respectively.
4.1. Data Distribution
Given the class of code improvements to which the MCO is addressed (those which
can only be done at or after the link phase), an algorithm for data distribution would be
appropriate in the MCO. However, a number of problems prevent it from being
implemented in this application.
The first problem is the lack of a reliable way of partitioning the data segment into
independent data objects, which preserves the semantics of the input program. The input
data segment is seen as a single block of data. No information is provided regarding what
areas of the data segment must remain in a fixed position relative to other areas. For
example, it is unclear where one array ends and another begins.
To obtain a complete partitioning, information as to the layout of the data segment
would have to be provided by the compiler for each module in the task file. This is
feasible, but is outside the current design of the MCO.
The second problem with data distribution concerns the ramifications of placing
modifiable program data in the same area in memory as program code or constant data.
All code currently generated by the compilers with which the MCO operates is reentrant,
thus allowing the text segment to be shared in a multi-task environment. Data distribution
renders the code non-reentrant since data would be interleaved in the text segment. Thus
it could not be used where text is shared or on a system where the text segment is
protected by hardware support.
Another problem with data distribution concerns the expectation, on the part of the
programmer, that the data segment will be laid out in the order that static data is declared
in the source code. Although the layout of static data is usually unspecified in language
standards, compilers have had no reason to allocate data in other than the input order.
As a result, the folklore for certain languages dictates that certain programming
constructs which rely on the order of static data are acceptable.
For example, a well-known technique in FORTRAN for building a zero-based array
of integers is to declare as follows:
INTEGER DUMMY
INTEGER A(99)
where DUMMY becomes an alias for A(0) . Although illegal, this usage is not detectable
in general, not flagged as an error even in specific cases where it is detectable (e.g. constant subscript), and actually works on all FORTRAN compilers known to the author
(FTN [Contr 75],
FORTRAN System/370 [IBM 74], Philon FAST/FORTRAN [Phil 85ft]).
Finally, there are the issues of actually performing data distribution in reasonable
time and space. The problem of data distribution was first examined in [Rober 79] from
a theoretical viewpoint. He showed that the problem of finding an optimal solution to
Problem 4.1 is NP-Complete [Garey 79]. Furthermore, the problem of finding a solution
which is within a (non-trivial) constant factor of the optimal solution is also NP-Complete.
Thus, the best we could hope for is a well-tuned heuristic which places variables
well. In the UNIX environment, where data resides at the end of the text segment,
even a simple heuristic could improve the code substantially. For example, one might go
through the independent data objects in order, placing each in the slot which maximizes
the number of references to it which can be made short at the time. In the absence of the
problems already mentioned (e.g. on a single-user dedicated machine with no memory
protection), such a heuristic might be worthwhile.
4.2. Complexity of Code Distribution
The problem of code distribution as stated at the beginning of this Chapter is
characterized in graph theoretic terms as follows:
A directed graph G = (V, E) consists of a finite set V of vertices and a finite
collection of edges, E: V × V, where each edge connects a pair of distinct vertices in V. The collection of edges of a graph may have duplicates (parallel edges) but the set of
vertices may not. Edges are denoted μ→ν where μ,ν ∈ V. If μ→ν ∈ E then vertex μ is
adjacent to ν. The set of all vertices adjacent to vertex ν is denoted adj(ν). A weighted
graph W = (V, E, Wv, We) is a directed graph with functions Wv: V → (N = {0, 1, …}) and
We: E → N × N × N × N.
In characterizing code distribution, we map the code blocks of the text segment onto
the vertices of a weighted graph and use the edges to represent the inter-block references.
We allow parallel edges since a block may reference another block many times, but we
allow no self-loops (edges must connect distinct vertices) since intra-block references are
not considered.
The weight function on vertices, Wv, gives the size, in bytes, of the code block as
read in from the text segment. The weight function on edges is formulated from the
position of the source and destination of the inter-block reference within their respective
blocks. In Figure 4.1, the source of the reference is offset s bytes in code block α and
the target is at position t in code block γ. Thus, We(α→γ) = (s, s′, t, t′).
|
Figure 4.1 Inter-block Reference |
The goal of code distribution is to find a permutation of the vertices, ψ:
V↔{1, 2, …, |V|}, which keeps the number of edges requiring a long addressing mode to a
minimum. Given a permutation, ψ, for each edge μ→ν, we define:
span(μ→ν) = endpoints(μ→ν) + interposed(μ→ν)
endpoints(μ→ν) = if ψ(μ) < ψ(ν) then
We(μ→ν)(2) + We(μ→ν)(3)
else
We(μ→ν)(1) + We(μ→ν)(4)
The span of an inter-block reference must account for the location of the source and
destination of the reference in their respective blocks (the endpoints() function) as well as
the size of all intervening blocks in the ordering of code blocks (the interposed() function).
Given a weighted graph W = (V, E, Wv, We), a permutation ψ:
V↔{1, 2, …, |V|}, and a
threshold T, we define the threshold cost function:
TCF(W, ψ, T) = |μ→ν ∈ E : span(μ→ν) ≥ T|
The problem of code distribution is analogous to the problem MINLTA:
Problem 4.3 (MINLTA - Minimum Linear Threshold Arrangement)
Given a weighted graph W = (V, E, Wv, We), we wish to find a permutation ψ:
V↔{1, 2, …, |V|} which orders the vertices such that the threshold cost function, TCF(W, ψ, T), for a given threshold T, is minimized. ¤
MINLTA relates to code distribution as follows: We wish to order the code blocks
in the text segment to minimize the number of inter-block references whose span exceeds
a certain threshold T.
|
Figure 4.2 Computing the Span of a Reference |
For example, in Figure 4.2, if blocks α and γ are ordered with block β between them,
then the span of α→γ is the sum of s′, r, and t. In MINLTA, the s′ and t are
incorporated in endpoints(α→γ) while r is represented in interposed(α→γ).
I now show that the decision version of MINLTA is NP-Complete [Garey 79] by
-
showing that the problem can be solved non-deterministically in polynomial time and
-
by polynomially reducing instances of a related problem, MINLA, to instances of
MINLTA such that MINLTA yields the same answer as MINLA would have.
Problem 4.4 (Decision Version of MINLTA)
Given a weighted graph W = (V, E, Wv, We), a threshold T, and an integer k, is
there a permutation ψ:
V↔1, 2, …, |V| which orders the vertices such that TCF(W, ψ, T) ≤ k. ¤
Problem 4.5 (MINLA - Minimum Linear Arrangement)
Given a directed graph G = (V, E) and a positive integer k, is there a permutation
ψ:
V↔1, 2, …, |V| which orders the vertices such that the additive cost function, ACF(G, ψ) ≤ k, where:
This cost function is identical to the span(μ→ν) function given for MINLTA with
Wv(ν) = 1 and We(μ→ν) = (0, 0, 0, 0) for all vertices and edges, respectively. ¤
A simpler version of Problem 4.5, in which the graph was undirected, was shown to
be NP-Complete in [Even 75] and [Even 79] by a two-stage reduction from the maximum
cut set problem on graphs.
Theorem 4.1
The decision version of MINLTA is NP-Complete.
Proof: First, we assert that MINLTA can be solved non-deterministically in
polynomial time. This is done by non-deterministically choosing the appropriate
permutation, Π, from the O(|V|!) permutations of the vertices and evaluating TCF(W, Π,
T).
Next, we reduce instances of MINLA to instances of MINLTA: Given an instance
of MINLA consisting of G = (V′, E′) and an integer k, we define an instance of Problem
4.4 as follows: The vertices V of W are the same as those of V′ of G. For each edge e ∈
E', E contains a bundle of |V| – 1 edges e1, e2, …, e|V|–1. The weight of an edge We(ei)
= (i, i, i, i). The weight of all vertices Wv(v) = 2 (see Figure 4.3).
|
Figure 4.3 Mapping MINLA onto MINLTA |
I propose that any ordering function, p, on V will yield the same value for
TCF(W, Π, 2|V|-2) under MINLTA as ACF(G, Π) under MINLA.
Case 1
Consider two vertices, α and β, which are placed sequentially by Π. Under
MINLA, an edge α→β between them contributes one to ACF(G, Π). Under
MINLTA, exactly one of the edges between α and β from among those generated
from α→β yields a span ≥ 2|V|-2 (the edge with weight (|V|–1, |V|–1, |V|–1, |V|–1)) so
one is added to TCF(W, Π, 2|V|-2).
Case 2
Under MINLA, if there are n vertices interposed between the endpoints of α→β, then the edge α→β adds n–1 to ACF(G, Π).
Likewise, under MINLTA, exactly
n–1 edges from the bundle of edges generated from α→β would be included in
TCF(W, Π, 2|V|–2). Those are the edges with weights (|V|–n–1, |V|–n–1, |V|–n–1, |V|–n–1) through (|V|–1, |V|–1, |V|–1, |V|–1). Hence, an ordering, Π, of V′ yields ACF(G,
Π) ≤ k if and only if that ordering of V yields TCF(W, Π, 2|V|–2) ≤ k.
Q.E.D.
In the light of these results and the expectation that the number of code blocks in a
text segment is on the order of the number of subprograms, an algorithm for code
distribution which yields an optimal solution is not likely to run in polynomial or
reasonable time on a deterministic processor. However, it should be noted that there are
some differences between MINLTA and the problem of code distribution:
- Since current architectures often have location-relative modes using byte, word, and
long offsets, the real-world code distribution problem could be required to deal with
several thresholds rather than just one.
-
In code distribution, the size of a code block is not fixed, but depends on the
placement function itself. MINLTA simplifies this by assigning a span value for an
edge which is determined solely from the initial conditions of the problem, while the
operand reduction algorithm must be applied for each placement function to
determine the span of an edge.
-
In MINLTA, we allow the weights on edges to be arbitrary positive numbers, while
in practice they would be limited to the weights of their corresponding vertices. It
is not known whether this more restrictive version of MINLTA is NP-Complete.
4.3. Heuristics for Code Distribution
Since the possibilities for an efficient optimal algorithm for code distribution are
dim, the MCO applies a heuristic to order the code blocks.
The basic approach is to build a tuple of code blocks starting at the end nearest the
data segment. At each step, we choose the best block from among those yet to be
placed, according to a heuristic which evaluates unplaced blocks. This block is added to
the start of the tuple. This basic scheme is summarized in the algorithm:
proc basic_code_distribution();
unplaced := {set of blocks};
set_of_spans := {spans for addressing modes of this architecture};
placed : = [];
while unplaced ≠ {} do
bestworth := -1;
(∀ bl ∈ unplaced)
w := 0;
(∀ span ∈ set_of_spans)
w -:= worth(bl, unplaced, placed, span);
end ∀;
if w > bestworth then
bestworth := w;
bestbl := bl;
end if;
end ∀;
placed := [bl] + placed;
unplaced less:= bl;
end while;
end proc;
Of course, the effectiveness of this algorithm depends on the worth(bl, unplaced,
placed, span) function. The MCO currently uses two heuristic functions in combination:
-
σ0
-
This function evaluates references in bl to the data segment. The following
multiplicative factors constitute σ0:
-
σ10
- The number of references in bl to the data segment.
-
σ20
-
The fraction of the data segment that the average reference (i.e. one at the center of
bl) would reach under the given span if bl were placed at the head of the placed list.
-
σ30
- The number of bytes saved by installing the addressing mode associated with span
over the addressing mode with no span restrictions.
-
σ40
-
The inverse of the size of bl (larger blocks are penalized). This may be thought of
in combination with σ10 to produce a single factor which denotes the density of data
references per byte of code.
-
σ1
- This function evaluates references between bl and blocks already placed in the list.
The following factors constitute σ1:
-
σ11
-
The number of code references in bl which would reach their targets under the
given span in the placed list if bl were placed at the head of the list.
-
σ21
-
The number of code references in the list which would reach the average target in bl under the given span.
These functions are designed to choose heuristically what would seem to be the best
block from among the remaining unplaced blocks when running the inner loop of
basic_code_distribution(). The σ0 function accounts for expected gains from operand
reduction due to references to data. Likewise, σ1, predicts gains from references to the
code already in the list. Within each of these functions, σ10, through σ40, and σ11 and σ21 can
be balanced to give the best results.
These functions are implemented efficiently by attaching the following information
to each block node, bl:
REF
For each span, the number of references to text nodes in bl from blocks in the
placed list which reach bl under the span.
RELOC
The number of relocatable operands in text nodes in bl which reach nodes in block
in the placed list under each span.
DRELOC
The number of relocatable operands in text nodes in bl which reach nodes in the
data segment.
These fields are maintained by the following expanded algorithm:
proc code_distribution();
unplaced := (set of blocks};
$ Set the DRELOC field of each block.
set_dreloc(unplaced);
placed := [];
plsize := 0;
while unplaced ≠ {} do
bestworth := -1;
(∀ bl ∈ unplaced)
$ Modify REF and RELOC to account for the most recent
$ block added to the list and references which are
$ now out of range.
update_ties(bl, placed);
w := 0;
(∀ span ∈ set_of_spans)
w -:= worth(bl, unplaced, placed, span, plsize);
end ∀;
if w > bestworth then
bestworth := w;
bestbl := bl;
end if;
end ∀;
placed := [bl] + placed;
plsize +:= size(bl);
unplaced less:= bl;
end while;
end proc;
The MCO allows any combination of σ0 and σ1, to be used during a run. The
relative effectiveness of these heuristics is reported in Chapter 6.
5. Operand Reduction
As described in
Section §2.7, operand reduction installs, in each operand, the least
expensive addressing mode which satisfies all constraints imposed by the architecture.
This Chapter begins by describing the data structures which represent the attributes of the
target architecture needed for operand reduction. This is followed by a discussion and
analysis of the algorithms which implement the two phases of operand reduction.
5.1. Static Data Structures
At the heart of the MINIMIZE and LENGTHEN phases, the following problem
arises:
Problem 5.1. (Build Translation Class)
Given an instruction, i, and an operand of that instruction, op, form the set of
(opcode, addressing mode) pairs which can be used in place of the existing opcode of i and addressing mode of op (OPC(i), MODE(op)). This set is called the
TRANSLATE_CLASS(i, op). ¤
The remainder of this section describes how the TRANSLATE_CLASS is built.
First I describe what types of restrictions the target architecture places on membership
in this set. Then I give a set-theoretic description of how the TRANSLATE_CLASS set is formed. Finally, I discuss a space and speed efficient implementation of the set
formers.
The set-formers and algorithms in this Chapter are presented in the set-theoretic
language SETL ([Dewar 79b]) to elucidate the concepts involved. Lower level versions,
coded in C ([Kern 78]), may be found in Appendices B and C.
For an opcode/addressing mode pair (opc, am) to belong to the set
TRANSLATE_CLASS(i, op), it must satisfy the following restrictions:
Addressing Restrictions: Under the rules of the target architecture, am must be a
legal addressing mode for an operand of opcode opc in operand position OPNUM(op). Furthermore, the new opcode must accept the same number of operands as the existing
opcode and, for each operand, op′, of i other than the operand being considered. MODE(op′) must also be legal for the new opcode in that operand position.
Semantic Restrictions: Each addressing mode on the given architecture performs a
function such as yielding a value of some type or operating on a register. The function of
am must be equivalent to that of MODE(op). Likewise, the function of the new opcode,
opc, must be equivalent to the existing opcode, OPC(i).
Span Restrictions: If am is a location-relative mode, then the effective address which
op yields must be within the span of am.
To form a TRANSLATE_CLASS which complies with the addressing and semantic
restrictions, we begin with the following sets defined across all opcodes, opc, and
addressing modes, am:
ADDRESSING_CLASS(opc, opnum)
For each operand position, opnum, corresponding to an operand of opc, the set of
addressing modes which are legal on the target architecture.
OPERAND_EQUIV_CLASS(am)
The set of addressing modes which perform an equivalent function to am.
OPCODE_EQUIV_CLASS(opc)
The set of opcodes which perform an operation which is equivalent to opc.
For a given instruction, i, and operand, op, the TRANSLATE_CLASS(i, op) is
formed, as needed, by the following set constructors:
For the given instruction and operand of that instruction, this is the set of
addressing restrictions of the machine for that instruction and operand and the semantic
restrictions imposed by the existing addressing mode.
This is the set of opcodes which are equivalent to the current opcode and which
allow at least one addressing mode in the OPERAND_TRANSLATE_CLASS() for the
given opcode. Also, the current addressing mode in operands we are not scrutinizing
must be allowed in the operand position of each opcode in this set.
Finally, we combine the intermediate sets to form the TRANSLATE_CLASS() as
defined above, satisfying addressing and semantic restrictions, but not span restrictions.
In practice, we do not form the OPERAND_TRANSLATE_CLASS and
OPCODE_TRANSLATE_CLASS sets, but construct TRANSLATE_CLASS directly.
The following algorithm presents a high level view of how FORM_TC is implemented:
proc FORM_TC(i, op)
TRANSLATE.CLASS := {};
(∀ opc ∈ OPCODE_EQUIV_CLASS(OPC(i)))
if opc ≠ OPC(i) then
$ Check that the operands of opc other than op accept the current
$ addressing modes in i.
(∀ opnum ∈ [l..NOPER(opc)] | opnum ≠ OPNUM(op))
if MODE(opnum) ∉ ADDRESSING_CLASS(opc, opnum) then
continue opc;
end if;
end ∀;
end if;
(∀ am ∈ ADDRESSING_CLASS(opc, OPNU.VI(op)))
$ Check that this new mode is semantically equivalent to the
$ existing mode.
if am ∈ OPERAND_EQUIV_CLASS(MODE(op)) then
TRANSLATE_CLASS with:= [opc, am];
end if;
end ∀;
end ∀;
end proc;
I now represent the data structures and algorithms for FORM_TC in a lower
level implementation. The data structures were designed to conserve space and be
accessible with reasonable speed. The version of FORM_TC as coded in C is presented
in Appendix C.
In order to represent the ADDRESSING_CLASS, OPERAND_EQUIV_CLASS,
and OPCODE_EQUIV_CLASS sets, a set of static data structures are built for the given
architecture. The static data structures consist of a pair of tables, one for addressing
modes and one for opcodes, and various arrays as described below.
First, we examine the addressing mode table. This is an array of addressing mode
descriptors, one for each distinct addressing mode on the target architecture. Two
addressing modes in two different instructions are considered distinct if they are
represented differently in the two instructions or are not semantically equivalent. In
particular, modes which are represented using different bit patterns or the same pattern
in different locations in instructions must be distinct.
Consider the Data Register Direct addressing mode on the 68000 mov instruction. As a source operand, this mode is the same for mov as for the first operand of a cmp
instruction. However, a distinct addressing mode must be used for a destination operand
of mov which uses Data Register Direct since the location of the bits to specify the mode
and register are in a different location in the instruction.
Each addressing mode descriptor contains the following fields which are relevant to
this discussion:
SIZE, SPEED
Values used to evaluate the cost of using an addressing mode. These are relative
values used for purposes of evaluating cost functions and are related to the clock
cycles and size in bytes above a basic opcode for the use of the addressing mode.
OEC
Pointer to an array of nodes, each containing the code of an addressing mode in the
same OPERAND_EQUIV_CLASS of this mode. All modes in the same OEC have
the same effect on the relevant aspects of the machine state when evaluated.
SPAN_OK
A pointer to a predicate which determines, given an instruction and an operand of
the instruction, whether the addressing mode would satisfy span restrictions if
installed.
INSTALL
A pointer to a routine to install the addressing mode in a given instruction and
operand. This routine is invoked during code relocation.
The opcode table contains a single opcode descriptor for each distinct opcode on the
target architecture. As with addressing modes, a single operator is sometimes broken
down into several opcodes for purposes of operand reduction even though the bit patterns
of the instructions may be identical. This occurs in multi-operand operators since the
addressing mode in the ADDRESSING_CLASS(opc, opnum) must all be valid regardless
of addressing modes employed in other operands.
Operators such as the 68000 sub instruction must be broken down into two opcodes:
a sub_d opcode which allows a large class (source class) of addressing modes as a first
operand and a data register for a second operand and a sub_m opcode whose first
operand is a data register and whose second operand can be represented using another set
of addressing modes (memory alterable class). To implement these using a single opcode
would imply that the sub instruction allows any source class mode and any memory
alterable mode in its two operands, which is not the case.
An opcode descriptor contains the following relevant fields:
NOPER
Number of operands accepted by this instruction.
SPEED
Used in evaluating opcode/addressing mode pairs. This is the speed relative to
other instructions in the operand equivalence class.
OPEC
An opcode which is in the same OPCODE_EQUIV_CLASS as this opcode. The
OPEC fields of all opcodes in a non-singleton OPCODE_EQUIV_CLASS set form
a circular linked list using this field.
CLASS
An array of pointers, one for each operand of the opcode. Each pointer names an
array of nodes containing the codes of addressing modes in the
ADDRESSING_CLASS of this opcode and operand.
The structure of these tables is summarized in Figure 5.1. I show an example of
the static data structures of two instructions on the Motorola 68000: jmp and bra . These
instructions are semantically equivalent, so their OPEC fields form a ring. However, the
sets of addressing modes allowed for their respective operands are disjoint. Each of the
addressing modes is described by an addressing mode descriptor. Finally, the semantic
meanings of addressing modes are related in the operand equivalence classes. Through
this data structure, a jmp using the absolute long addressing mode can be converted to a
bra using the disp8 mode.
|
Figure 5.1 Static Data Structures for 68000 jmp and bra |
5.2. Minimize and Lengthen
The purpose of operand reduction is to find an optimal solution to the following
problem:
Problem 5.2. (Operand Reduction)
Install the least expensive addressing mode in each operand of each instruction so
that all addressing, semantic, and span restrictions are satisfied.¤
In the last section, I presented a general algorithm to find all opcode/addressing
mode substitutes for a given instruction and operand that satisfy addressing and semantic
constraints. The remaining problem of operand reduction is to satisfy span constraints.
This problem is examined in [Rich 71] and [Fried 76].
In [Szym 78], two algorithms
are presented which produce optimal solutions. I will briefly describe the
requirements and complexity of each before presenting my solution.
The first, which I call Algorithm Sz1, builds a graph to represent the program.
Each operand of each instruction which can employ a location-relative mode is
represented by a node in the graph.3 A directed arc
A→B is installed if the instruction for
B lies between A and a target which references an operand of A in the program. In each
node, information similar to our own text node is maintained. In addition, for each
operand, the distance from the instruction to the target of the operand (the operand's
range) is maintained.
All operands represented by nodes are initially assigned a minimum length
location-relative addressing mode. We then process nodes in the graph whose range
exceeds the span of the current addressing mode. A longer addressing mode with a
larger span is then installed and all predecessors of such nodes in the graph (i.e. nodes
whose range depends on the size of the expanded instruction) have their ranges increased
to accommodate the longer addressing mode. The node may then be removed from the
graph if a maximum-length addressing mode has been installed. The algorithm
terminates when no more nodes need to be expanded.
Algorithm Sz1 produces an optimal assignment of addressing modes using a graph
with O(n) nodes and O(n2) arcs. [Szym 78] claims that the running time, with suitable
low-level data structures, is at worst O(n) since each node must be visited at most once
for each addressing mode.
In practice. Algorithm Sz1 is useful for the application described in [Szym 78] jump
or subprogram call operands on the Digital Equipment Corporation PDP-11 ([DEC 75]).
Under this instruction set, a single location-relative addressing mode whose span is
approximately ±256 bytes is available for such operands. This limits the out-degree of
nodes in the dependency graph to 255 for contrived pathological cases. In practice, the
average out-degree is 3.5 (across a large sample of application code) which allows Sz1 to
operate rapidly in practical cases.
However, each of the target architectures, in addition to the PDP-11 mode
described above, has location-relative modes with spans of approximately ±32,767 bytes. This allows the out-degree of nodes to be at most 16,381 (assuming a minimum of two
bytes per instruction) and an average of 896 in practice. These figures render Sz1
impractical for our use, especially since we wish to process not only branch operands, but
all relocatable operands.
Algorithm Sz2 is similar to Sz1, except that the arcs are not represented in the
graph. Instead of adjusting the range of predecessors in the graph, whenever an operand
is expanded, a brute-force scan of the instructions is made to find operands whose range
need adjustment. This reduces the space requirements to O(n) but the running speed
goes to O(n2).
Again, since the maximum span of an addressing mode is ±254 bytes on the PDP-11, only a small area of code needs to be scanned when an operand is expanded.
However, for this application, the re-scanning often requires a large portion of the
program, thus rendering the running time quadratic in practice.
My algorithm builds on Sz2 with the same worst-case space and time complexity,
but runs in linear time in practice. Rather than maintaining the range of an operand, the
range value is computed as necessary. This can be done since the TARGET field has
been set for all such operands during operand linking.
As with Sz1 and Sz2, the operand reduction algorithm begins with MINIMIZE, which performs a single pass over the code. For each instruction, i, and relocatable
operand, op, we change the opcode and addressing mode to the pair from
TRANSLATE_CLASS(i, op) which yields the shortest instruction:
proc MINIMIZE()
(∀ b ∈ BLOCK_LIST)
(∀ tx ∈ TEXT(b) | tx is a text node)
(∀ op ∈ OP(tx))
tc := FOR.M_TC(tx, op);
bestcost := MAXCOST;
(∀ [opc, am] ∈ tc)
c := cost(tx, op, opc, am);
if c < bestcost then
bestcost := c;
newpair := [opc, am];
end if;
end ∀;
if newpair ≠ [OPC(tx), MODE(op)] then
contract(tx, op, newpair);
end if;
end ∀;
end ∀;
end ∀;
end proc;
After MINIMIZE, the LENGTHEN phase installs larger addressing mode in operands
using a series of passes over the code. The first step in each pass is to set the FADDR field of each text and data node to reflect its current load location based on the sizes of
all instructions before it. This is the field we will later use to determine the ranges for
operands.
We then process each relocatable operand of each text node. If a location-relative
addressing mode, am, is currently in use, the range of the operand is computed using the
FADDR field of the instruction and the FADDR field of the TARGET node of the
operand. The predicate SPAN_OK(am) is then evaluated for the range to determine if
the operand needs expansion. If so, we compute the TRANSLATE_CLASS(instruction,
operand). From this we choose the least-cost opcode/addressing mode pair for which
SPAN_OK(am), evaluated for the range, indicates that the new mode satisfies all span
restrictions.
This phase is summarized in the following algorithm:
proc LENGTHEN()
change := true;
while (change) do
change := false;
$ Set FADDR fields of all nodes.
addr := IADDR(TEXT(BLOCK_LIST(l))(l));
(∀ b ∈ BLOCK_LIST)
(∀ tx ∈ b)
FADDR(tx) := addr;
addr += NBYTES(tx);
end ∀;
end ∀;
$ Expand operands as necessary.
(∀ b ∈ BLOCK_LIST))
(∀ tx ∈ b | tx is a text node)
$ Get all relocatable operands (ones with TARGET set) which
$ might need expansion.
(∀ op ∈ OP(tx) | TARGET(op) ≠ Ω and loc_relative(MODE(op)))
range := FADDR(TARGET(op)) - FADDR(tx);
if SPAN_OK(MODE(op))(range) then
continue;
end if;
tc := FORM_TC(tx, op);
bestcost := MAXCOST;
(∀ [opc, am] ∈ tc)
c := cost(tx, op, opc, am);
if c < bestcost then
bestcost := c;
newpair := [opc, am];
end if ;
end ∀;
if [OPC(tx), MODE(op)] * newpair then
expand(tx, op, newpair);
ehange := true;
end if;
end ∀;
end ∀;
end ∀;
end while;
end proc;
This algorithm performs well in practice since range values are changed only at the
start of each pass and are done through the TARGET pointer rather than maintaining
explicit range values in operand descriptors. The TARGET field generally requires
O(n2) time to compute but, during operand linking, we compute these efficiently using
the blocked dynamic data structure (see
Section §2.3). Likewise, O(n) passes could be
made through the code during LENGTHEN, giving an O(n2) worst case. In practice, the
algorithm converges in 2–5 iterations (see
Section §7.2).
5.3. Register Tracking
The design of operand reduction, as described thus far, falls short in one major
area: it utilizes index modes which use only the program counter, while many
architectures allow indexing off other registers. For example, on the Motorola 68000,
even if a target is not within the span of a PC-indexed mode, if an address register points
in the vicinity of the target, an address register-indexed mode is available which costs the
same space and time as the PC-indexed mode.
Hence, an improvement to the current operand reduction algorithm would be to
provide a data structure which maintains the known values in all registers which can be
indexed. In addition, known values in non-indexable registers may be useful since such
registers can replace addressing modes which yield constant values.
A number of approaches can be taken in handling this data structure:
- Have the compiler set aside a single address register as a base register, thus
mimicking segmented architectures such as the Intel 8086. This register could be
initialized by the MCO to point to an advantageous location and references to all
targets which fall within the span of this location could be improved. This is
essentially the scheme taken in the Macintosh operating system ([Apple 85]) for the
Macintosh 68000-based computer. However, on the Macintosh, all data references
must be made using the base register; this limits global data to 32,767 bytes on this
machine.
-
Allocate base registers on a less global level. Information as to which registers are
unused over ranges in the code would have to be obtained. These could then be
initialized and used as local base registers if there were sufficient references in the
range which could index off the address register.
-
Information as to which registers have known values in ranges of the code could be
obtained by techniques similar to constant propagation [Aho 77]. These registers
could be used as base registers in the proper ranges without initialization. On
architectures such as the Motorola MC68020 [Motor 84b] where a number of
registers can be combined with scaling factors and constant offsets, registers could
be used in linear combinations to produce the least expensive addressing mode.
6. Macro Compression
Until this point, I have described optimizations and techniques that are employed
in the MCO and that are, to varying degrees, successful toward the goals of optimizing
task files and furthering this research. In this Chapter I reflect upon a class of
techniques that are also consonant with this research but which did not yield satisfactory
results in some dimension of performance and were removed from the production version
of the MCO.
6.1. Background
Common code compression is a class of optimization techniques in which common
sequences of code are identified by various analysis methods and removed by altering the
code or providing information to a translator that is converting the code to a lower level.
This class of techniques includes common subexpression elimination, available expression
elimination, very busy expression hoisting, and code hoisting and sinking (see [Aho 86]
for a general discussion of these). These techniques are generally more suitable to
earlier phases of the compilation process than the link phase.
The technique of macro compression recognizes common code sequences and
replaces each occurrence of the common code with a call to a code macro or subprogram
containing the common code.
This space optimization was first used in [Dewar 79a] to conserve space in an
interpretive byte stream. The language used an 8-bit opcode but only had 80 operators.
The remaining 176 opcodes were used to represent frequently occurring byte sequences
beginning on instruction boundaries. In practice, only multi-bytes instructions or part
instructions were subject to macro compression, but the savings remained substantial.
The theoretical aspects of this problem were studied in [Golum 80]. The
assumptions were:
-
A byte stream was to be minimized;
-
A macro call consisted of a single byte;
-
Exactly m macros of length ≤ k were to be chosen.
Optimal polynomial-time solutions were obtained which characterized potential
macro choices within the byte sequence using an interval or overlap graph (depending on
two slight variations of the problem). However, these algorithms were very costly in
practice.
6.2. Assembly Code Compression
A more recent approach [Fras 84] has been to apply pattern matching techniques to
assembly code to identify repeated subsequences. A suffix tree ([McCr 76]) is built for the
input code to be compressed. The suffix tree for a list of instructions, i, is a tree whose
|i| leaf nodes are labeled with the locations in i and whose arcs are labeled with
subsequences of i. For example, if a, b, and c are instructions and $ is the unique end
marker, the instruction list abcab$ would have the tree:
|
Figure 6.1 Example Suffix Tree |
This data structure allows us to find the subsequence beginning at any position and
ending at $ by following the path of edges from the root to the leaf with the proper label.
More importantly for macro compression, each non-leaf (internal) node represents a
common subsequence: the text of the subsequence is found by following the edges from
the root to the internal node and the number and location of the subsequences are
represented by all leaves whose path to the root goes through the internal node.
Once the suffix tree is built (in linear time — see [McCr 76]) the internal nodes of
the suffix tree are evaluated for validity under the semantic rules of macro compression
for the given assembly language and for payoff if they were replaced. Valid
subsequences are ordered in a priority queue by some criterion and the items of the
queue are processed in order, installing a code macro and calls to it at each step.
An optimizer for assembly code was built by [Fras 84] and was reported to run
efficiently and perform well. However, no statistics were given on the amount of
compression achieved.
6.3. A First Attempt
A preliminary optimizer for assembly language was built along these lines for the
purpose of gathering statistics. As expected, the effectiveness of macro compression
heavily depended on the size of the assembly code file. In assembly files generated from
languages such as COBOL ([ANSI 74]), a good deal of compression was obtained since the
entire user program is generated in a single assembly code file. However, for languages
such as C where a high degree of modularity tends to be observed, almost no
compression was obtained. Furthermore, code for language primitives, since they are
relatively small and selectively linked modules, were never compressed.
The next stage was to build an analyzer which maintained statistics on common
sequences across assembly language files. If, for a given language and compiler, many
common sequences appeared repeatedly across different programs, a database of those
sequences could be made available to a peephole optimizer [McKee 65]. It would replace
them in linear time [Knuth 77] and with small space overhead. A call to a macro body
would be substituted on the expectation that the sequence would appear enough times in
the various modules of the program to make substitution worthwhile, on the average.
The macro bodies would then be selectively linked in from a large library of these
subprograms.
However, it was found that, while a single program may have many common
subsequences within itself, the same sequences were, for the most part, not shared
between programs. Table 6.1 gives a summary of common sequences in two test
programs I will describe in detail in Chapter 7. These programs are called p1.68 and
p2.68 when compiled for the 68000. A breakdown is given for various sized
subsequences for p1.68, p2.68, and sequences which appeared in both. In each case, we
report the number of common sequences as well as the average number of occurrences of
each sequence in the programs. In the last column, I report the average occurrences in
both programs combined.
Table 6.1 Tabulation of Common Sequences
Size |
p1.68 |
p2.68 |
p1.vx |
# | occur |
# | occur |
# | occur |
50-60 |
1 |
2 |
2 |
2 |
0 |
|
40-48 |
1 |
2 |
2 |
2 |
0 |
|
30-38 |
3 |
2 |
2 |
2 |
0 |
|
20-28 |
4 |
2.25 |
4 |
3.25 |
0 |
|
18 |
3 |
2.00 |
4 |
7.00 |
0 |
|
16 |
6 |
3.33 |
4 |
2.75 |
0 |
|
14 |
9 |
4.00 |
7 |
5.71 |
1 |
4.00 |
12 |
12 |
3.17 |
13 |
6.07 |
2 |
4.50 |
10 |
17 |
6.65 |
13 |
7.30 |
0 |
|
8 |
32 |
6.00 |
48 |
7.96 |
10 |
17.80 |
6 |
41 |
13.46 |
61 |
16.84 |
14 |
52.29 |
Total |
129 | |
160 | |
27 |
|
These figures show that even though the same compiler was used and the same code for language primitives was linked in, few sequences were common to both program in comparison to the program treated separately. This happens since many of the common sequences contain code which refers to program-specific global data or subprograms.
6.4. Macro Compression in the MCO
The results of the assembly code macro compressor indicated that a macro compressor which operated on all the modules in a single program would compress the most macros. This, the macro compressor was recoded to operate on task files, and this became the first version of the MCO. It operated with essentially the same instruction parser and code relocation algorithm described earlier, but without any other optimizations described so far.
There were a number of significant additions to the MCO implementation beyond
that of [Fras 84]:
- The user had the choice of two priorities when inserting sequences into the priority
queue: they could be inserted in order of the number of bytes saved by the
substitution of the sequence (assuming no overlap with earlier substitutions) or they
could be inserted based on the number of bytes in the sequence (‘value’ priority
versus ‘length’ priority).
-
The MCO was more aggressive in salvaging sequences which would have been
discarded as invalid: if the sequence referenced the stack, the macro body was
constructed so that the return location was not stored on the stack: if the sequence
modified the stack in certain simple ways, similar transformations were applied to
the macro body; If some code in the sequence would have caused the sequence to be
discarded and that code appeared near the end of the sequence, it was shortened
and re-inserted into the priority queue.
-
Since [Fras 84] does not describe the data structures in detail, it is not clear how
their suffix tree was represented internally. The MCO maintained the tree in
virtual memory. The initial implementation had a pointer from each internal node
to the first child and a pointer from each node to its sibling. This simple data
structure was very compute intensive during the construction of the suffix tree
(48:07 for p1.68) and a hash table was installed to represent the parent-child
relation between the root node and the second level of the tree (reducing the time to
3:22 for p1.68).
The statistics relating to this version of the MCO are reported in Tables 7.2 for the
execution time of macro compression, 7.5 for the size improvement in the text segment,
and 7.6 for the degradation in the target program's execution speed. From these results
it was decided that macro compression was not desirable in the production version of the
MCO because of the costs in the following areas:
Compressor Speed
The macro compressor required several times the compute resources of the other
optimizations combined. The total speed of about 125 bytes/sec was very close to
the speed of the [Fras 84] implementation, but we were dealing with the entire task
file on each run of the optimizer.
Dynamic Memory Requirements
The suffix tree nearly doubled the dynamic memory required by the MCO.
Execution Speed Degradation
When running a program after macro compression, the task image is smaller, but
the CPU must spend time executing the macro calls. On p1.68, the compressed
program took 15% more CPU time than the original (with macro compression and
operand reduction). In real time, the compressed file took 7% more time -
probably due to the fact that the operating system treats smaller task files with a
higher priority. This speed degradation is not as severe if the length priority is used
rather than value priority (see
Section §7.5). With value priority 913 calls to
macros were installed whereas for length priority 821 calls to macros were installed.
Thus, fewer macro calls are made and execution speed is not affected as much with
length priority. In addition, value priority saved a total of 2,310 bytes while length
priority saved 2,384 bytes.
6.5. G-Code and G-Compression
G-compression takes the concept of macro compression to extremes. First, the text
segment is converted into a very compact Generative code or G-code. At execution time
this is loaded into memory along with the data segment for the program, a decoder, and
68
an execution buffer. The decoder is responsible for re-constituting sections of G-code
into their original native code and placing them in the execution buffer. Code is executed
in the execution buffer until a new section of code needs to be re-constituted, at which
time control returns to the decoder. If enough space is allocated for the execution buffer
and a good allocation algorithm is used, the decoder will be called infrequently compared
to execution of native code in the buffer. Even if the processor spends half its time in
the decoder, this is substantially better than the 10 to 40 times speed degradation
experienced in typical interpretive systems, with the potential for a greater savings of
space.
This approach is similar in concept to ‘throw-away compiling’. This technique
compiles frequently interpreted sections of code at run-time ([Brown 76],
[Brown 79],
[Hans 74]). However, the task of code generation was employed at run-time in these
systems to produce object code, rather than a straightforward decoding. Hence, the
translation was slow, required a very large ‘decoder’, and could not achieve a high level
of object code optimization.
To elaborate on these ideas, I describe the items in memory at execution time in
more detail:
G-Code
The generative code is a representation of the text segment of the original program
in which a series of transformations have been applied to translate original instruction
and sequences of instructions into G-code instructions. The first set of transformations
modifies certain types of instruction operands. In general, references to registers or
registers with displacements are unaffected. However, references to instructions or data
in the original code are converted to references to the corresponding instructions or data
in the G-code.
Each reference to an instruction in the original text segment (original text reference)
is replaced by the bit address of the start of the corresponding (j-code instruction relative
to the start of the sequence of G-code instructions (soft text reference). Thus the number
of bits to represent a soft text reference depends on the size of the G-code.
Each reference to a byte in the original data segment (original data reference) is
replaced by the byte offset of the referenced data byte from the start of the data segment
(soft data reference). The number of bits for a soft data reference depends on the size of
the data segment.
Any immediate operand which refers to the address of an instruction in the original
text segment is replaced by the soft text address of the corresponding G-code instruction.
An immediate operand which refers to an original data item is converted to a byte offset
from the start of the data segment.
In general, the set of operand addressing modes defined for native code and G-code
differ in order to accommodate these transformations. Also, since there are no byte-boundary
limitations imposed by the decoder, operands can occupy any number of bits
and can even vary in size. Native code designs incorporating some of these features such
as bit-aligned instructions and variable sized operands have been developed in the Intel
432 architecture ([Tyner 81]) and the design of the Burroughs B1700 ([Wiln 72]).
The second transformation applied to the original text segment is solely for the
purpose of compressing the instruction sequence. The criteria are that it must be
decodable starting at any instruction boundary and that the decoding must be done in real
time (requiring no look-ahead) [Peter 61]. This can be accomplished by one or a
combination of:
- A straightforward Huffman encoding ([Knuth 73], p. 402) of the G-code.
-
A partial Huffman encoding in which the main instruction word is encoded but
words which contain addresses and displacements are not affected.
-
A macro compression scheme in which common code sequences are collapsed into a
macro table (which becomes part of the G-code) and replaced with short non-instructions
from a Huffman encoding list. First, the priority queue is ordered by
the number of times a sequence appears, rather than by the length or value of a
sequence. Then all sequences which appear more than once are inserted into the
queue. When processing the queue, sequences are replaced by bit encodings of
increasing length. These encodings are assigned in the way that Huffman codes are
built [Gall 78] so that minimum space is required.
Data Segment
This is identical to the data segment which would be loaded with the original version
of the program, except that it is not necessarily loaded at the same address. Pointers in
the data segment to other data addresses are relocated based on the new base address for
the data segment. Text addresses are translated to their corresponding soft address.
Decoder and Execution Buffer
The decoder is a fixed section of code which runs on the native hardware. Its job is
to re-constitute G-code into machine code. It takes sections of G-code which needs to be
executed and decodes them into a variable-sized execution buffer. The decoded native
code is essentially the same as the original native code in the text segment. The
differences are that addresses which refer to the data segment are adjusted to point to the
new data segment and text addresses are converted back from soft addresses into the
hard addresses of the decoded section, if the code at the target text address is already in
the execution buffer. Otherwise, a branch to a non-decoded code section consists of a
push of the soft address and a call back to the decoder.
The decoder has the following entry points:
MAIN: This is the entry point from outside the program. The first section of code
is re-constituted, the user's stack, registers, and arguments are initialized, and the first
section of code is called.
TRANSFER: Branch to this entry point to re-constitute and execute code beginning
at the soft address which is on top of the stack. All branches to TRANSFER which are
preceded by a push of this soft address are then converted to a branch directly to the
newly re-constituted native code.
EXTEND: Append a new section of native code after the last executed block. Two
items are on the stack: the soft address of the new code and the hard address to begin
placing the hard-code. Accessed from unconditional branches in the original program
code which are at the end of a re-constituted block, this is really a special case of
TRANSFER which can optimize speed by eliminating unconditional branches.
CALL: Same as TRANSFER, except that a subprogram call has been made.
The contents of memory during program execution is summarized in Figure 6.2.
|
Figure 6.2 Memory Organization for G-Compression |
Aside from a straight executable program, the G-code scheme can be used in ways more closely tied to the machine. For example, G-code might be the actual language of the machine, while the decoder resides in microcode itself. The macro bodies themselves
would be read in when the program is loaded. The rudiments of such a scheme are
employed in the VAX-11 architecture ([DEC 77]), which allows the microcode for certain
instructions to he read in by the user during system initialization.
7. Measurement and Evaluation of Performance
The current implementation of the MCO is written in C ([Kern 78]) and runs on a
VAX-11/750 ([DEC 77]) under Berkeley UNIX Version 4.1c ([UNIX 80]). It optimizes
task files containing 68000 and UNIX machine code generated from C source code
compiled with the Philon FAST/C compilers [Phil 85c]. These files run under Uniplus-UNIX [Instr 81] on the 68000 and Berkeley UNIX on the VAX-11, respectively.
This chapter reports on performance measurements taken on the MCO. Figures do
not include measurements of macro compression, unless specifically noted. I give
statistics in five areas: the running speed and size of the MCO, the space and speed
improvements gained for each target machine, and the programmer time required to retarget
from the 68000 to the VAX.
7.1. Test Input
For purposes of the statistics in this chapter, two sample input files were used,
compiled for the 68000 and the VAX. They are production versions of two passes of a
compiler for a dialect of BASIC [Phil 84cb]. They are called p1.68 and p2.68 when
compiled for the 68000 and p1.vx and p2.vx when compiled for the VAX. They are
ideal for statistical purposes since they are production programs which execute a mix of
computation and I/O bound code in a batch mode. Also, p1 contains most of its code
within the 32,767 span limitation of these machines and p2 exceeds that limit by almost a
factor of 2. The sizes of the text and data segments for the four task images are given in
Table 7.1.
Table 7.1 Sizes of the Test Programs
Program |
Text bytes |
Data bytes |
p1.68 |
33,684 |
12,664 |
p2.68 |
57,482 |
11,054 |
p1.vx |
29,296 |
13,800 |
p1.vx |
47,104 |
11,496 |
7.2. Speed of the MCO
First I report on the time required to run the various phases of the MCO on the sample input programs. Table 7.2 gives this information in terms of CPU time on a VAX-11/750. This is a measure, by the operating system, of how much time the CPU spent executing instruction in that phase. In parenthesis, I/O time is given for phases which had significant I/O usage. These figures give the amount of time that the operating spent performing I/O operations on behalf of that phase.
Table 7.2 Execution time for the MCO on a VAX 11/750
Phase |
p1.68 |
p2.68 |
p1.vx |
p2.vx |
Input & Instr Parse |
0:24 (0:02) |
0:46 (0:04) |
0:50 (0:02) |
1:30 (0:05) |
Text Blocking |
0:00 |
0:01 |
0:00 |
0:01 |
Operand Linking |
0:11 (0:17) |
0:20 (0:21) |
0:07 (0:03) |
0:13 (0:05) |
Code Elimination |
0:02 |
0:05 |
0:01 |
0:02 |
Code Distribution |
|
σ0 |
0:06 |
0:10 |
0:05 |
0:08 |
σ1 |
1:29 |
3:01 |
1:04 |
2:03 |
σ0 + σ1 |
1:24 |
3:07 |
1:07 |
2:16 |
Minimize |
0:12 |
0:19 |
0:37 |
0:54 |
Lengthen |
0:21 |
0:39 |
0:35 |
0:42 |
Code Relocation |
0:02 |
0:03 |
0:02 |
0:03 |
Output |
0:12 (0:29) |
0:16 (0:46) |
0:08 (0:17) |
0:10 (0:23) |
Total (σ0 + σ1) |
2:48 (0:48) |
5:36 (1:14) |
3:27 (0:22) |
5:51 (0:33) |
Macro Compression |
|
Build suffix tree |
3:22 |
|
Build prio queue |
0:15 |
|
Modify code |
0:52 |
|
Table 7.2 is clarified by the following points:
- The task of parsing the instruction sequence is, by far, the most time-consuming
aspect of the first phase. On the 68000, this is done by a large routine (28 pages of
source code) to disassemble the byte sequence. For the VAX, parsing is done by a
tiny routine which relies almost entirely on the static tables which describe the
architecture. The interpretation of those tables greatly speeded development of the
instruction parser for the VAX and did not slow the routine. The execution time
of that phase for VAX input doubled because instructions on the VAX are parsed
twice (see
Section §2.2).
-
The I/O time required for operand linking is spent reading the relocation
information from the input task file.
- Most of the time spent in the MINIMIZE phase of operand reduction is in building
and processing the translate class. The VAX, which has larger translate classes for
instructions due to the more orthogonal addressing mode structure, requires about
three times longer in this phase.
- Conversely, the LENGTHEN phase builds a translate class only if a span restriction
is exceeded. Most of the time here is spent passing over the code until all span
restrictions are satisfied. While this convergence could require many passes, in
practice few passes are needed. These test cases required three passes for p1.68 and
four passes for the others. No program run through the MCO during testing or
production use has ever required more than 5 passes.
7.3. Space Requirements of the MCO
The size of the MCO is reported in two aspects: the static space needed for program
code and data and the dynamic space required for the dynamic data structures as a
function of the input program size.
Table 7.3 lists the number of bytes used for the 68000 and VAX versions of the MCO.
Table 7.3 Static Space Required by the MCO
Size of … |
68.mco |
vx.mco |
Text bytes |
34,816 |
31,744 |
Data bytes – tables |
4,832 |
15,496 |
Other data bytes |
10,992 |
9,608 |
Total static size |
50,656 |
56,864 |
Note that the text segment of the VAX version is smaller due to the table driven instruction parser described in the last section. This is reflected in the substantially larger space required to store the static data structures.
Table 7.4 reports on the space required to represent all the dynamic data structures which are built.
Table 7.4 Dynamic Space Required by the MCO
Bytes for … |
68.mco |
vx.mco |
p1.68 |
p2.68 |
p1.vx |
p2.vx |
Total mem required |
515,032 |
854,592 |
420,264 |
639,704 |
Size of task image |
-50,656 |
-50,656 |
-56,864 |
-56,864 |
Dynamic memory required |
464,376 |
803,936 |
363,400 |
582,840 |
Text in test prog |
33,684 |
37,482 |
29,696 |
47,104 |
Dynamic data per target byte |
13.79 |
13.98 |
12.24 |
12.37 |
These figures represent the total number of bytes for dynamic data with no effort to free this space. For example, these figures reflect no space savings for free space as a result of code elimination. These figures show that the MCO requires 12-14 times as much memory as the text segment of the input program. Of course, the MCO could be modified to maintain these structures on secondary storage.
7.4. Effect on Program Space
This section presents statistics on the reduction in the size of the text segment for
the input programs.
Table 7.5 gives the number of bytes saved by each phase of the MCO. For code
elimination, I give the savings with unlabeled elimination and the additional savings
with subprogram elimination.
Table 7.5 Effect of MCO on Size of Text Segment
Phase |
p1.68 |
p2.68 |
p1.vx |
p2.vx |
Initial text bytes |
33,684 |
| 57,482 |
| 29,696 |
| 47,104 |
|
Unlabeled Elimination |
2,458 | -7.3% |
4,900 | -8.5% |
2,639 | -8.9% |
5,154 | -10.9% |
Added Subprogram Elimination |
682 | -2.0% |
596 | -1.0% |
799 | -2.7% |
931 | -2.0%
|
Code Distribution |
|
σ0 – σ1 |
104 | -0.3% |
652 | -1.1% |
46 | -0.2% |
486 | -1.0%
|
σ0 |
-58 | |
428 | |
-36 | |
374 | |
σ1 |
24 | |
470 | |
43 | |
383 | |
Operand Reduction |
2,624 | -7.8% |
3,476 | -6.1% |
3,632 | -12.2% |
4,207 | -8.9%
|
Total (σ0 – σ1) |
5,868 | |
9,624 | |
7,116 | |
10,778 | |
Text segment reduced |
17.4% |
16.7% |
24.0% |
22.9% |
Macro Compression |
2,310 | -6.9% | |
The following points should be noted in reference to these statistics:
- The code elimination statistics are dependent on the way runtime libraries are structured on a given language and compiler can vary greatly.
- As expected, code distribution is far more useful on programs whose text segment exceeds a span restriction imposed by the architecture.
- The savings of about 1% for code distribution on files which exceed 32K bytes of text does not seem to justify the time required for this phase of the MCO (see Table 7.2). However, given that C code is typically written with heavy reliance on stack based data rather than static data, task images generated from other source languages would probably benefit more from this optimization.
7.5. Effect on Program Speed
To test the speed of the original and optimized version of the test programs, they
were run on their target machines and timed. The VAX target machine was a VAX-11/750 and the 68000 was a Pixel 100/AP [Instr 81] with a 10 MHz CPU. Since the input
code was the first two passes of a BASIC compiler, each was run on the same 123 line,
3,988 character source file.
Table 7.6 reports the CPU time statistics returned by UNIX as described earlier.
The I/O time was affected more by system load than by any optimization performed, and
is not reported in this table.
Table 7.6 Effect of the MCO on Program Speed
Program run |
p1.68 |
p2.68 |
p1.vx |
p2.vx |
Original program |
6.73 | |
2.85 | |
6.08 | |
2.62 | |
MCO with no code distr |
6.41 |
4.8% |
2.46 |
13.7% |
5.95 |
2.1% |
2.55 |
2.7% |
MCO with σ0 – σ1 |
6.32 |
6.1% |
2.48 |
13.0% |
5.92 |
2.6% |
2.50 |
4.6% |
Macro Compression |
7.73 | (neg. 14.9%)
| |
The basic thing to note about these figures is that the improvement on the 68000 version was much greater than the VAX. I conjecture that, due to an instruction buffer maintained by the VAX instruction decoder, the processing of semantically equivalent memory references is not done faster for shorter operands. This is supported in the survey of VAX instruction timings reported in
[Shiel 84].
8. Conclusions
It is my thesis that a class of optimization techniques, which can be performed only
at the machine code level, is effective toward the goals of program optimization.
Furthermore, these techniques can be implemented in a straightforward manner and in a
reasonably machine-independent fashion. I begin by reviewing the implementation
and theoretical work done on the MCO, describing other proposed ideas for
optimizations at this level, and suggest areas for future research.
8.1. Review of Work Done
The core of this work has been to define a class of inefficiencies which exist on
certain architectures and environments and build an optimizer, the MCO, to remove
these inefficiencies.
The inefficiencies relate mainly to programs which consist of many modules and
which are linked using a linker which cannot resolve inter-module references efficiently.
Generally, when a single module is compiled, the most general and most costly
addressing mode must be used for inter-module references since no information as to the
relative or absolute location of the target is available. Hence, I deal with inefficiencies
which can only be removed during or after the link phase of compilation.
A basic inefficiency is the presence of unreferenced subprograms in the task file.
I review existing techniques for eliminating such code and develop and implement an
augmented version of one of them, called subprogram elimination.
Another inefficiency concerns the order in which code and data appear in a task file.
I review the problem of data distribution, which places data objects throughout the
code segment of the program and cite earlier work which shows the problem to be NP-Complete.
I also review the problem of code distribution, which shuffles the
subprograms of the code segment to reduce the distance between operands and their
targets in the code segment. I show this problem to be NP-Complete also. I then
implement efficient heuristics for code distribution which improve the ordering of
subprograms in the code segment.
I then approach the problem of installing addressing modes in operands of
instructions which take advantage of the proximity of targets. I develop an algorithm,
called operand reduction, for installing the minimum sized addressing mode for any given
operand. This algorithm is largely machine-independent; it relies almost entirely on a set
of data structures which describe the machine architecture.
I then discuss a technique, macro compression, which reduces the storage
requirements of a program, but which carries an associated speed penalty. I describe
earlier work and report on the results of a trial implementation of macro compression. A
more aggressive technique, called G-compression, which carries a larger speed penalty
but offers the possibility of much greater code compaction, is also described, although no
attempt at implementation was made.
Finally, statistics are reported on the performance of the MCO and its effect on
target programs. The results indicate that the MCO yields a substantial space
improvement and smaller speed improvements.
I conclude that the techniques applied by the MCO attain many of the
performance advantages of segmented architectures on linear address space machines
without imposing restrictions on addressing.
8.2. Proposals for Further Investigation
In the light of the effectiveness of the current MCO, a number of areas deserve
further investigation:
- Investigate improved algorithms and heuristics for code distribution.
- Implement some version of register tracking as described in
Section §5.3. Also, the
performance improvement from adding register tracking to operand reduction
should be measured.
- Implement a G-code scheme, as described in
Section §6.5, to determine the space
savings and speed degradation involved. This scheme could be useful in
applications where interpreters are currently used to deal with severe memory
restrictions.
- Investigate algorithms for improved recognition of common subsequences. These
algorithms could relax the definition of ‘common subsequence’ to allow instructions
which are out of order, renaming of registers, etc.
- Investigate a macro compression scheme which would allow code macros to take
parameters. This could be used to allow non-conforming subsequences to be
replaced by macro calls by supplying an argument to the macro body.
In addition, a number of the following techniques may be applicable to the MCO:
- The full implementation of register tracking implies the need for algorithms similar
to data-flow algorithms on higher-level program representations. The
implementation of such algorithms on machine code to track live-dead information
on registers should be investigated. Also, such algorithms can be used to implement
other transformations. For example, a register does not need to be saved and
restored in a subprogram if no call to that subprogram needs that register as live.
- If a constant operand is used often enough, space can be saved on some
architectures by building local tables of these constants which can be accessed by
some span-dependent addressing mode. However, this degrades execution speed on
many architectures.
- Subprograms which are called once can be moved in-line. Local repair and
optimization can then be done at the entry and exit points to save stack
manipulation.
Appendix A. Definition of Text and Block Node Fields
This Appendix provides a description of the fields in text nodes and block nodes
and their contents.
For each instruction, the text node contains the following fields:
OPC
The opcode for this instruction. This number is independent of the actual bit
pattern for the instruction: it is an ordinal index into the static data structures which
describe the instruction set on the target architecture.
SIZE
A code which denotes the number of bytes being operated on by this instruction. This field is used to reduce the number of opcodes by combining instructions which
perform similar operations on different sized objects into a single opcode.
IADDR, FADDR
The initial and final addresses for the instruction. The IADDR field gives the load
address at which this instruction would have been loaded as specified in the input
task file. FADDR specifies the load address for the instruction in the output task
file; it is initialized to IADDR and gets incrementally changed as the code is
improved.
INSTR
A pointer to the bytes of the instruction. As instructions are parsed, this field is
initialized to point directly into the image of the text segment read into memory. However, if an instruction is ever expanded past its original length, the bytes must
be stored elsewhere (in a dynamically allocated buffer).
IBYTES, NBYTES
The initial and current number of bytes in the instruction. IBYTES is needed so
that the INSTR field can be reset properly if the instruction needs to be expanded
beyond its original length.
NEXT
A pointer to the next text or data node on the list. Initially, all the text and data
nodes are linked together in a single list in the order they appear in the input file. During code distribution, the original list is partitioned into a set of lists which are
re-ordered.
REF, JSR
The count of the number of times the instruction is referenced and how many times
it is referred to as the target of a subprogram call instruction. A reference could
consist of a jump to or a call of the instruction, a pointer to the instruction in some
data area, or a constant in an instruction operand or data area which names the
instruction.
OP
An array of operand descriptors. Each operand descriptor holds information for a
single operand of the instruction and has the following fields:
ADDR
For relocatable operands, this field holds the effective address which this operand
referenced when it was initially read in. This reference might be to another
instruction or to a data item in some data area. For non-relocatable operands, this
field is used during operand reduction to preserve the value specified by the
operand while the addressing mode of the operand is being altered. Refer to
Chapter 5 for the specifics.
TARGET
A pointer to a text or data node that contains the object referenced by a relocatable
operand. This field is NULL if the operand is non-relocatable.
MODE
A code giving the addressing mode used by the operand. Like the OPC field, this
code is used to index into the static data structures which describe the addressing
modes on this architecture.
OFFSET
An index into the bytes of the instruction telling where any extension word
associated with this operand begins. This field is updated whenever some code
improvement changes the addressing mode of the operand.
REG
An array of register descriptors giving the machine registers used by this operand. The significance of each element of the array depends on the addressing mode in
use.
In addition, each operand has an operand position, OPNUM(op), associated with it
which is simply that operand's position in the OP array of descriptors.
Each block node has the following fields:
SADDR, EADDR
The IADDR of the first node in the subprogram and the SADDR of the next block
(zero for the last block).
TEXT
A pointer to a linked list of text and data nodes for this block. The last node on
this list has a NULL NEXT field.
INEXT, NEXT
Pointers to successor block nodes as they appear textually in the source code. We
maintain the initial successor and the successor as modified by later optimizations.
REF, RELOC, DRELOC
Fields used during the code distribution algorithm.
Appendix B.
Low Level Implementation of Data Structures
In this appendix, I present the low level implementation of the data structures
involved in forming the TRANSLATE_CLASS and the operand reduction algorithm.
They are coded in C ([Kern 78]) and appear as they do in the production version, except
for the following modifications, which hold for this and following appendices:
- Certain type declarations have been simplified for ease of reading this section of
code independently from the rest of the MCO.
-
All debugging, tracing, and much of the assertion checking has been removed.
- This code actually appears in several separate modules in the production version.
- The comment conventions have been altered as well as other cosmetic and
typographic changes.
#if TM68000 or TVAX11 or TTI32000
typedef long m_addr;
#endif
#if TM68000 or TTI32000
#define m_opcount 2
#endif
#if TVAX11
#define m_opcount 6
#endif
#if TM68000
#define m_maxreg 3
#endif
#if TTI32000 or TVAX 11
#define m_maxreg 2
#endif
#if TM68000 or TVAX11 or TTI32000
typedef byte m_reg;
#endif
#if TM68000 or TTI32000
typedef byte m_opc,
#endif
#if TVAX11
typedef short m_opc;
#endif
-- The type of an addressing mode descriptor.
#if TM68000 or TTI32000 or TVAX11
typedef byte m_mode;
#endif
#if TM68000
.
#define spanSmm (-126)
#define spanSmax 129
#define spanl6min (-32766)
#dcfine spanl6max 32769
#endif
#if TVAX11
#define spanSmin (-128+2)
#define spanSmax 127
#define spanl6min (-32768+4)
#define spanl6max 32767
#define span6min
#define span6max 63
#endif
typedef struct {
m_addr op_addr;
Struct tx_tag *op_target;
m_mode op_mode;
byte op_offset;
m_reg op_reg[m_maxreg];
} operand;
typedef struct tx_tag {
m_opc tx_opc;
#if TM68000
byte tx_size;
#define siz_byte 0
#define siz_word 1
#define siz_long 2
#define siz_illegal 3
#endif
byte *tx_instr;
struct tx_tag *tx_next;
m_addr tx_iaddr;
.
m_addr tx_faddr;
byte tx_ibytes;
byte tx_nhytes;
byte tx_ref;
byte tx_jsr;
#if OPSYMBOL
char *tx.label;
#endif
operand tx_op[l];
} tx_node:
typedef struct dt_tag {
m_opc dt_opc;
byte *dt_data;
struct dt_tag *dt_next;
m_addr dt_iaddr;
m_addr dt_faddr;
long dt_nbytes;
} dt_node;
#define nbytes_of(tx) (((tx)→tx_opc == o_data) ? (tx)→dt_nhytes : (tx)→tx_nhytes)
#define ibytes_of(tx) (((tx)→tx_opc == o_data) ? (tx)→dt_nbytes : (tx)→tx_ibytes)
#define is_text(tx) ((tx)→tx_opc != o_data)
#define is_data(dt) ((dt)→dt_opc == o_data)
#if TM68000
#define dalign(a) (a)
#endif
#if TVAX11
#define dalign(a) (((a) - 0x03FF) bitand OxFFFFFC00)
#endif
typedef struct bl_tag {
tx_node *bl_text;
struct bl_tag *bl_inext;
struct bl_tag *bl_next;
m_addr bl_saddr;
m_addr bl_eaddr;
long bl_ubreloc;
long bl_ubref;
long bl_ref;
long bl_reloc;
long bl_dreloc;
} bl_node;
#define for_all_text(bl,tx)
for(bl=bl_first; bl; bl = bl→bl_next) for(tx=bl→bl_next; tx; tx = tx→tx_next)
typedef struct am_tag {
byte am_size;
byte am_speed;
long am_icount;
long am_fcount;
text *am_name;
m_mode *am_oec;
bool *(am_span_ok());
} am_node;
typedef struct id_tag {
char *id_name;
byte id_noper;
byte id_speed;
short id_icount;
short id_fcount;
m_opc id_iec;
m_modc •id_class[m_opcount|;
bool id_source[m_opcount];
bool id_dest[m_opcount];
} id_node;
typedef struct tc_tag {
m_opc tc_opc;
m_mode tc_mode;
m_reg tc_reg[m_ma.xreg];
} tc_node;
Appendix C.
Low Level Implementation of Algorithms
In this appendix, I present the low-level implementation of the routines
FORM_TC and LENGTHEN.
tc_node tc[max_tc];
form_tc(tx. i)
tx_node *tx;
long i;
{
tc_node *tcptr;
m_opc firstopc;
m_opc ope;
m_mode am;
m_mode *oec;
m_mode *oecptr;
m_mode *acptr;
bool found;
long j;
tcptr = &tc[0];
firstopc = tx→tx_opc;
opc = firstopc;
oec = am_table[tx→tx_op[i].op_mode].am_oec;
forever {
if (opc != firstopc) {
for (j = 0; j < id_table[opc].id_noper; --j) {
if (i == j) {
continue;
}
am = tx→tx_op[j].op_mode;
found = false;
for (acptr = id_table[opc].id_class[j]; *acptr; - -acptr) (
if (*acptr == am) {
found = true;
break;
}
}
if (not found) {
goto ncxt_instr;
}
}
}
for (acptr = id_table[opc].id_class[i]; *acptr; --acptr) (
oecptr = oec;
found = false;
if (oecptr) {
while (*oecptr) {
if (*oecptr++ == *acptr) {
found = true:
break;
}
}
)
else if (tx→tx_op[i].op_mode == *acptr) {
found = true;
}
if (not found) {
continue;
}
tcptr→tc_opc = opc;
tcptr→tc_mode = *acptr;
tcptr--;
}
next_instr:
if (id_table[opc].id_iec == NULL) {
break;
}
opc = id_table[opc].id_iec;
if (opc == firstopc) {
break;
}
}
tcptr→tc_opc = o_none;
}
lengthen()
{
bool change;
tx_node *tx;
bl_node *bl;
m_mode am;
short alter;
short i, j, k;
long span;
byte cond;
tc_node *tcptr;
tc_node *besttc;
long bestcost;
long newcost;
short oldsize;
change = true;
while (change) {
change = false;
--st_npasses;
set_faddr();
for_all_text(bl, tx) {
if (is_data(tx)) {
continue;
}
alter = 0;
for (i = 0; i < id_table[tx→tx_opc].id_noper; --i) {
if (not tx→tx_op[i].op_addr) {
continue;
}
if (* (am_table[tx→tx_op[i].op_mode].am_span_ok)(tx, i)) {
continue;
}
form_tc(tx, i);
besttc = NULL;
bestcost = 99999;
for (tcptr = &tc[0]; tcptr→tc_opc != o_none; --tcptr) {
newcost = id_table[tcptr→tc_opc].id_speed +
am_table[tcptr→tc_mode].am_speed +
am_table[tcptr→tc_mode].am_size;
if (newcost ≥ bestcost) {
continue;
}
if (tcptr→tc_opc == tx→tx_opc and tcptr→tc_mode == am) {
continue;
}
else if (span_ok(tx, i, tcptr, span)) {
bestcost = newcost;
besttc = tcptr;
}
}
tx→tx_opc = besttc→tc_opc;
tx→tx_op[i].op_mode = besttc→tc_mode;
change = true;
}
if (change) {
oldsize -= tx→tx_nbytes;
expand_offsets{tx);
alter = tx→tx_nbytes - oldsize;
assert(2791, alter ≥ 0);
st_lengthen -= alter;
}
}
}
}
Bibliography
[Aho 77]
A. V. Aho and J. D. Ullman, Principles of Compiler Design. Addison-Wesley. Reading, Mass., 1977, 604 p.
[Apple 85]
Inside Macintosh. Apple Computer, Inc., 1985.
[Ankl 82]
P. Anklam, D. Cutler, R. Heincn, and M. D. MacLaren, Engineering a Compiler.
Digital Press, 1982, 269 p.
[ANSI 74]
American National Standard Programming Language COBOL, ANSI X3.23-1974.
[Bell 71]
C. G. Bell and A. Newell, Computer Structures: Readings and Examples. McGraw-
Hill Book Co., 1971, 668 p.
[Brown 76]
P. J. Brown, “Throw-away Compiling”, Software-Practice and Experience, Vol. 6,
No. 3, pp. 423-434.
[Brown 79]
P. J. Brown, Writing Interactive Compilers and Interpreters, John Wiley and Sons,
Ltd., 1975, 265 p.
[Contr 75]
Fortran Extended. Version 4. Reference Manual. Control Data Corp., 1975.
[DEC 75]
PDP-11 Processor Handbook. Digital Equipment Corporation, Maynard. Mass.
1975.
[DEC 77]
VAX 11/780 Architecture Handbook, Digital Equipment Corporation, Maynard,
Mass. 1977.
[Dewar 79a]
R. B. K. Dewar, M. C. Golumbic, and C. F. Goss, “MICRO SPITBOL”, Computer
Science Technical Report No. 11. New York University, 19 p. 1979.
[Dewar 79b]
R. B. K. Dewar, A. Grand, S. Liu, J. T. Schwartz, and E. Schonberg, “Program by
Refinement, as Exemplified by the SETL Representation Sublanguage”, ACM
Transactions on Programming Languages and Systems. Vol. 1, No. 1 (July 1979), pp.
27-49.
[Even 75]
S. Even and Y. Shiloach, “NP-Completeness of Several Arrangement Problems”,
Technical Report #43. Dept. of Comp. Sci., Technion, Haifa, Israel, Jan 1975.
[Even 79]
S. Even, Graph Algorithms. Computer Science Press, 249 p., 1979.
[Fras 84]
C. W. Fraser, E. W. Myers, and A. L. Wendt. "Analyzing and Compressing
Assembly Code", Proceedings of the SIGPLAN '84 Symposium on Compiler
Construction. SIGPLAN Notices, Vol. 19. No. 6 (June 1984), pp. 117-121.
[Fried 76]
G. Frieder and H. J. Saal, “A Process for the Determination of Addresses in
Variable Length Addressing”, Communications of the ACM. Vol. 19, No. 6 (June
1976), pp. 335-338.
[Gall 78]
R. G. Gallager, “Variations on a Theme of Huffman”, IEEE Transactions on
Information Theory. Vol. IT-24. No. 6 (Nov 1978), pp. 668-674.
[Garey 76]
.M. R. Garey, D. S. Johnson, and L. J. Stockmeyer, “Some Simplified NP-Complete
Graph Theory Problems”, Theoretical Computer Science, Vol. 1, 1976, pp. 237-267.
[Garey 79]
M. R. Garey and D. S. Johnson, A Guide to the Theory of NP-Completeness. Wit.
Freeman, San Francisco, 1979.
[Gieg 83]
R. Giegerich, “A Formal Framework for the Derivation of Machine-Specific
Optimizers”, ACM Transactions on Programming Languages and Systems, Vol. 5, No.
3 (July 1983), pp. 478-498.
[Golum 80]
M. C. Golumbic, C. F. Goss, and R. B. K. Dewar, “Macro Substitutions in .MICRO
SPITBOL - A Combinatorial Analysis”, Congressus Numerantium. Vol. 29 (1980),
pp. 485-495.
[Hans 74]
G. J. Hansen. Adaptive Systems for the Dynamic Run Time Optimization nj Programs.
Ph. D. Dissertation, Carnegie-Mellon University, Pittsburgh, Pa.
[IBM 74]
IBM System/360 and System/370 FORTRAN IV Language. Eleventh Edition. May 1974,
169 p.
[Instr 81]
The Pixel 100:AP User's Manual. Instrumentation Laboratory, 1981.
[Kern 78]
B. W. Kernighan and D. M. Ritchie. The C Programming Language. Prentice-Hall,
Inc., 1978, 228 p.
[Knuth 73]
D. E. Knuth. The Art of Computer Programming. Vol 1. Fundamenlal Algorithms,
Second Edition, Addison-Wesley, Reading, Mass., 1973.
[Knuth 77]
D. E. Knuth, J. H. Morris, Jr, and V. R. Pratt, “Fast Pattern Matching in Strings”,
SIAM Journal of Computing, Vol. 6, No. 2 (1977), pp. 323-350.
[Lever 80]
B. Leverett and T. G. Szymanski, “Chaining Span-Dependent Jump Instructions”,
ACM Transactions on Programming Languages and Systems, Vol. 2, No. 3 (July
1980), pp. 274-289.
[Lower 69]
E. S. Lowery and C. W. Medlock, “Object Code Optimization”, Communications of
the ACM. Vol 12. No. 1 (Jan 1969), pp. 13-22.
[McCr 76]
E. M. McCreight, “A Space-Economical Suffix Tree Construction Algorithm”,
Journal of the ACM. Vol. 23, No. 2 (April 1976), pp. 262-272.
[McKee 65]
W. M. McKeeman, “Peephole Optimization”. Communications of the ACM. Vol. 8,
No. 7 (July 1965). pp. 443-444.
[Motor 84a]
M68000 16/32-Bit Microprocessor - Programmer's Reference Manual, Motorola, Inc.,
Fourth Edition, 1984, 217 p.
[Motor 84b]
M68000 32-Bit Microprocessor User's Manual. Motorola, Inc., Prentice-Hall. 1984.
[Parts 83]
H. Partsch and R. Steinbruggen, “Program Transformation Systems”, ACM
Computing Surveys, Vol. 15, No. 3 (Sept 1983), pp. 199-236.
[Peter 61]
W. W. Peterson, Error-Correcting Codes, The M. I. T. Press and John Wiley &
Sons, 1961, 285 p.
[Phil 85c]
Philon FAST/C for the MC68000 Under Unix. Philon, Inc., New York, N.Y., 1985.
[Phil 85co]
Philon FAST/COBOL for the MC68000 Under Unix. Philon. Inc., New York, N.Y. 1985.
[Phil 85ft]
Philon FAST/FORTRAN for the MC68000 Under Unix. Philon, Inc., New York,
N.Y., 1985.
[Phil 84cb]
Philon FAST/BASIC-C for the MC68000 Under Unix. Philon, Inc., New York, N.Y.,
1984.
[Raman 84]
M. V. S. Ramanath and M. Solomon, “Jump Minimization in Linear Time”, ACM
Transactions on Programming Languages and Systems. Vol. 6, No. 4 (October 1984),
pp. 527-545.
[Rich 71]
D. L. Richards, “How to Keep the Addresses Short”, Communications of the ACM. Vol. 14, No. 5 (May 1971), pp. 346-349.
[Rober 79]
E. L. Robertson, “Code Generation and Storage Allocation for Machines with Span-
Dependent Instructions”, ACM Transactions of Programming Languages and Systems.
Vol. 1, No. 1 (July 1979) pp. 71-83.
[Russ 80]
R. Rector and G. Alexv, The 8086 Book. Osborne, McGraw-Hill, 1980.
[Schn 73]
P. B. Schneck and E. Angel, “A Fortran to Fortran Optimizing Compiler”, Computer
Journal. Vol. 16, No. 4 (April 1973), pp. 322-330.
[Shiel 84]
L. D. Shields, Measuring SETL Performance. Ph. D. Dissertation. September 1983,
173 p.
[Szym 78]
T. G. Szymanski, “Assembling Code for Machines with Span-Dependent
Instructions”, Communications of the ACM. Vol. 21, No. 4, pp. 300-308.
[Texas 85]
TI32000 Programmer's Reference Manual. Texas Instruments, 1985.
[Tyner 81]
P. Tyner, iAPX 432 General Data Processor Architecture Reference Manual. Intel
Corp., 1981.
[UNIX 80]
UNIX Programmer's Manual. Seventh Edition. Virtual VAX-11 Version, November
1980.
[Wiln 72]
W. T. Wilner, “Design of the Burroughs B1700”, Proceedings of the AFIPS 1972 Fall
Joint Computer Conference. Vol 41, AFIPS Press, Montvale, N. J., pp. 489-497.
[Wulf 75]
W. Wulf, R. K. Johnsson, C. B. Weinstock, S. O. Hobbs, C. M. Geschke, The
Design of an Optimizing Compiler. Elsevier North-Holland, 1975, 165 p.
Addenda
This section contains additional references that were inadvertently omitted from the original 1986 publication.
[Aho 86]
A. V. Aho, Ravi Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools.
Addison-Wesley. Reading, Mass., 1986.
[Phil 84mb]
Philon FAST/BASIC-M for the MC68000 Under Unix. Philon, Inc., New York, N.Y.,
1984.
In addition, the following editorial changes were made:
- The incorrect references to [Lowry 69] and [McKee 67] in Section 1, ¶1 in the original were corrected to [Lower 69] and [McKee 65].
- Commas were added to some numbers to enhance readability. For example: 33,684 rather than 33684.
- The font was changed from Times Roman to Verdana to enhance readability on the computer displays.
- Expressions that are inline in the text were typeset in italics to improve readability.
- Footnotes were re-numbered and were place in-line in the text, immediately below the paragraph that references them.
- Some minor spelling corrections were made (“targetted” ⇒ “targeted”,
“ellucidate” ⇒ “elucidate”,
“exsiting” ⇒ “existing”).
- References to the author were changed from plural to singular.
|