Last update July 31, 2012

DMDSource Guide



DMD Source Guide

If it's wrong, please correct it. If it's not here, please add it.

Front end

FileFunction
access.cAccess check (private, public, package ...)
aliasthis.cImplements the alias this D symbol.
argtypes.cConvert types for argument passing (e.g. char are passed as ubyte).
arrayop.c Array operations (e.g. a[] = b[] + c[]).
attrib.c Attributes i.e. storage class (const, @safe ...), linkage (extern(C) ...), protection (private ...), alignment (align(1) ...), anonymous aggregate, pragma, static if and mixin.
bit.cGenerate bit-level read/write code. Requires backend support.
builtin.cIdentify and evaluate built-in functions (e.g. std.math.sin)
cast.cImplicit cast, implicit conversion, and explicit cast (cast(T)), combining type in binary expression, integer promotion, and value range propagation.
class.cClass declaration
clone.cDefine the implicit opEquals, opAssign, post blit and destructor for struct if needed, and also define the copy constructor for struct.
cond.cEvaluate compile-time conditionals, i.e. debug, version, and static if.
constfold.cConstant folding
cppmangle.cMangle D types according to Intel's Italium C++ ABI.
declaration.cMiscellaneous declarations, including typedef, alias, variable declarations including the implicit this declaration, type tuples, ClassInfo, ModuleInfo and various TypeInfos.
delegatize.cConvert an expression expr to a delegate { return expr; } (e.g. in lazy parameter).
doc.c Ddoc documentation generator ( NG:digitalmars.D.announce/1558)
dsymbol.cD symbols (i.e. variables, functions, modules, ... anything that has a name).
dump.cDefines the Expression::dump method to print the content of the expression to console. Mainly for debugging.
e2ir.cExpression to Intermediate Representation; requires backend support
eh.cGenerate exception handling tables
entity.cDefines the named entities to support the "\&Entity;" escape sequence.
enum.cEnum declaration
expression.hDefines the bulk of the classes which represent the AST at the expression level.
func.cFunction declaration, also includes function/delegate literals, function alias, (static/shared) constructor/destructor/post-blit, invariant, unittest and allocator/deallocator.
glue.cGenerate the object file for function declarations and critical sections; convert between backend types and frontend types
hdrgen.cGenerate headers (*.di files)
iasm.cInline assembler
identifier.cIdentifier (just the name).
idgen.cMake id.h and id.c for defining built-in Identifier instances. Compile and run this before compiling the rest of the source. ( NG:digitalmars.D/17157)
impcvngen.cMake impcnvtab.c for the implicit conversion table. Compile and run this before compiling the rest of the source.
imphint.cImport hint, e.g. prompting to import std.stdio when using writeln.
import.cImport.
inifile.cRead .ini file
init.c Initializers (e.g. the 3 in int x = 3).
inline.cCompute the cost and perform inlining.
interpret.cAll the code which evaluates CTFE
irstate.cIntermediate Representation state; requires backend support
json.cGenerate JSON output
lexer.cLexically analyzes the source (such as separate keywords from identifiers)
libelf.cELF object format functions
libmach.cMach-O object format functions
libomf.cOMF object format functions
link.cCall the linker
macro.cExpand DDoc macros
mangle.cMangle D types and declarations
mars.cAnalyzes the command line arguments (also display command-line help)
module.cRead modules.
msc.c?
mtype.cAll D types.
opover.cApply operator overloading
optimize.cOptimize the AST
parse.cParse tokens into AST
ph.cCustom allocator to replace malloc/free
root/aav.cAssociative array
root/array.cDynamic array
root/async.cAsynchronous input
root/dchar.cConvert UTF-32 character to UTF-8 sequence
root/gnuc.cImplements functions missing from GCC, specifically stricmp and memicmp.
root/lstring.cLength-prefixed UTF-32 string.
root/man.cStart the internet browser.
root/port.cPortable wrapper around compiler/system specific things. The idea is to minimize #ifdef's in the app code.
root/response.cRead the response file.
root/rmem.cImplementation of the storage allocator uses the standard C allocation package.
root/root.cBasic functions (deal mostly with strings, files, and bits)
root/speller.cSpellchecker
root/stringtable.cString table
s2ir.cStatement to Intermediate Representation; requires backend support
scope.cScope
statement.cHandles while, do, for, foreach, if, pragma, staticassert, switch, case, default , break, return, continue, synchronized, try/catch/finally, throw, volatile, goto, and label
staticassert.cstatic assert.
struct.cAggregate (struct and union) declaration.
template.cEverything related to template.
tk/?
tocsym.cTo C symbol
toctype.cConvert D type to C type for debug symbol
tocvdebug.c CodeView4 debug format.
todt.c?; requires backend support
toelfdebug.cEmit symbolic debug info in Dwarf2 format. Currently empty.
toir.cTo Intermediate Representation; requires backend support
toobj.cGenerate the object file for Dsymbol and declarations except functions.
traits.c__traits.
typinf.cGet TypeInfo from a type.
unialpha.cCheck if a character is a Unicode alphabet.
unittests.cRun functions related to unit test.
utf.cUTF-8.
version.cHandles version

Back end

FileFunction
html.cExtracts D source code from .html files
......

A few observations

  • idgen.c is not part of the compiler source at all. It is the source to a code generator which creates id.h and id.c, which defines a whole lot of Identifier instances. (presumably, these are used to represent various 'builtin' symbols that the language defines)
  • impcvngen.c follows the same pattern as idgen.c. It creates impcnvtab.c, which appears to describe casting rules between primitive types.
  • Unspurprisingly, the code is highly D-like in methodology. For instance, root.h defines an Object class which serves as a base class for most, if not all of the other classes used. Class instances are always passed by pointer and allocated on the heap.
  • root.h also defines String, Array, and File classes, as opposed to using STL. Curious. (a relic from the days when templates weren't as reliable as they are now?)
  • lots of files with .c suffixes contain C++ code. Very confusing.

Abbreviations

STC
STorage Class
ILS
InLine State
IR
Intermediate Representation

Abbreviations (Back end)

VBE
Very Busy Expression ( http://web.cs.wpi.edu/~kal/PLT/PLT9.6.html)
CP
Copy Propagation info (?)
AE
Arithmetic Expression?

Whirlwind tour of the AST

The AST built by the compiler is comprised from classes of three main types, all of which inherit Object:
  • Expression - Nodes for operations, assignments, and the like derive Expression. All expressions have an interpret method which does CTFE
  • Statement - Base class for top-level function statements. Among these is ExpStatement, which is a statement which is an expression.
  • DSymbol - A "D symbol". Serves as an abstract base for anything which is declared, such as classes/structs and variable declarations. Most (all?) objects which inherit from Dsymbol wind up getting written to the ouput object file.
    • AttribDeclaration - Base class for things like access modifiers, pragma, debug (which is in turn the base class of version)
    • Import, enum, and static assert are also subclasses of Dsymbol
    • ScopeDsymbol - A symbol which creates a scope for its children. Base class of with blocks, enum declarations, and templates.
      • Class, struct, and interface declarations also inherit ScopeDsymbol. (actually, InterfaceDeclaration extends ClassDeclaration)
    • Declaration - Base class for pretty much all declarations
      • typedef, alias,
      • variables, typeinfo,
      • functions, function literals,
      • ctor, dtor, invariant, new, unittest, etc
      • Not class or struct (above)
  • Other notable AST types which extend Object directly:
    • Type, Argument, Initializer, Identifier
    • Catch
    • StringTable (a precurser to our associative arrays, I think)
    • DsymbolTable - "a table of Dsymbols".

How to make the thing compile

There are a number of types that are stored in various nodes that are never actually used in the front end. They are merely stored and passed around as pointers.

  • Symbol - Appears to have something to do with the names used by the linker. Appears to be used by Dsymbol and its subclasses.
  • dt_t - "Data to be added to the data segment of the output object file" source: todt.c
  • elem - A node in the internal representation.
The code generator is split among the various AST nodes. Certain methods of almost every AST node are part of the code generator.

(it's an interesting solution to the problem. It would have never occurred to a Java programmer)

Most notably:

  • all Statement subclasses must define a toIR method
  • All Expression subclasses must define a toElem method
  • Initializers and certain Expression subclasses must define toDt
  • Declarations must define toObjFile
  • Dsymbol subclasses must define toSymbol

Other things

Floating point libraries seem to be atrociously incompatible between compilers. Replacing strtold with strtod may be necessary, for instance. (this does "break" the compiler, however: it will lose precision on literals of type 'real')

 -- AndyFriesen

Intermediate Representation

From NG:D.gnu/762

I've been looking at trying to hook the DMD frontend up to LLVM (www.llvm.org), but I've been having some trouble. The LLVM IR (Intermediate Representation) is very well documented, but I'm having a rough time figuring out how DMD holds its IR. Since at least three people (David, Ben, and Walter) seem to have understand, I thought I'd ask for guidance.

What's the best way to traverse the DMD IR once I've run the three semantic phases? As far as I can tell it's all held in the SymbolTable as a bunch of Symbols. Is there a good way to traverse that and reconstruct it into another IR?


From NG:D.gnu/764

There isn't a generic visitor interface. Instead, there are several methods with are responsible for emiting code/data and then calling that method for child objects. Start by implementing Module::genobjfile and loop over the 'members' array, calling each Dsymbol object's toObjFile method. From there, you will need to implement these methods:

Dsymbol (and descendents) ::toObjFile -- Emits code and data for objects that have generally have a symbol name and storage in memory. Containers like ClassDeclaration also have a 'members' array with child Dsymbols. Most of these are descendents of the Declaration class.

Statement (and descendents) ::toIR -- Emits instructions. Usually, you just call toObjFile, toIR, toElem, etc. on the statement's fields and string the results together in the IR.

Expression (and descendents) ::toElem -- Returns a back end representation of numeric constants, variable references, and operations that expression trees are composed of. This was very simple for GCC because the back end already had the code to convert expression trees to ordered instructions. If LLVM doesn't do this, I think you could generate the instructions here since LLVM has SSA.

Type (and descendents) ::toCtype -- Returns the back end representation of the type. Note that a lot of classes don't override this -- you just need to do a switch on the 'ty' field in Type::toCtype.

Dsymbol (and descendents) ::toSymbol -- returns the back end reference to the object. For example, FuncDeclaration::toSymbol could return a llvm::Function. These are already implemented in tocsym.c, but you will probably rewrite them to create LLVM objects.


(Thread: DigitalMars:d/archives/D/gnu/762.html)

The Back End

DMD's internal representation uses expression trees with 'elem' nodes (defined in el.h). The "Rosetta Stone" for understanding the backend is enum OPER in oper.h. This lists all the types of nodes which can be in an expression tree.

If you compile dmd with debug on, and compile with:

  -O --c

you'll get reports of the various optimizations done.

Other useful undocumented flags:

 --b  show block optimisation
 --f  full output
 --r  show register allocation
 --x  suppress predefined C++ stuff
 --y  show output to Intermediate Language (IL) buffer

Others which are present in the back-end but not exposed as DMD flags are:
 debuge show exception handling info
 debugs show common subexpression eliminator

The most important entry point from the front-end to the backend is writefunc() in out.c, which optimises a function, and then generates code for it.

writefunc() sets up the parameters, then calls codgen() to generate the code inside the function. it generates code for each block. Then puts vars in registers. generates function start code, does pinhole optimisation. (cod3.pinholeopt()). does jump optimisation emit the generated code in codout(). writes switch tables writes exception tables (nteh_gentables() or except_gentables()

In cgcod.c, blcodgen() generates code for a basic block. Deals with the way the block ends (return, switch, if, etc).

cod1.gencodelem() does the codegen inside the block. It just calls codelem().

cgcod.codelem() generates code for an elem. This distributes code generation depending on elem type.

Most x86 integer code generation happens in cod1,cod2, cod3, cod4, and cod5.c Floating-point code generation happens in cg87. Compared to the integer code generation, the x87 code generator is extremely simple. Most importantly, it cannot cope with common subexpressions. This is the primary reason why it is less efficient than compilers from many other vendors.

Optimiser

The main optimiser is in go.c, optfunc(). This calls:
  • blockopt.c blockopt(iter) -- branch optimisation on basic blocks, iter = 0 or 1.
  • gother.c constprop() -- constant propagation
  • gother.c copyprop() -- copy propagation
  • gother.c rmdeadass() -- remove dead assignments
  • gother.c verybusyexp() -- very busy expressions
  • gother.c deadvar() -- eliminate dead variables
  • gloop.c loopopt() -- remove loop invariants and induction vars. Do loop rotation
  • gdag.c boolopt() -- optimize booleans.
  • gdag.c builddags() -- common subexpressions
  • el.c el_convert() -- Put float and string literals into the data segment
  • el.c el_combine() -- merges two expressions (uses a comma-expression to join them).
  • glocal.c localize() -- improve expression locality
  • cod3.c pinholeopt() -- Performs peephole optimisation. Doesn't do much, could do a lot more.

Code generation

The code generation for each function is done individually. Each function is placed into its own COMDAT segment in the obj file. The function is divided into blocks, which are linear sections of code ending with a jump or other control instruction ( http://en.wikipedia.org/wiki/Basic_block).

Scheduler (cgsched.c)

Pentium only


FrontPage | News | TestPage | MessageBoard | Search | Contributors | Folders | Index | Help | Preferences | Edit

Edit text of this page (date of last change: July 31, 2012 14:49 (diff))