Old-school assemblers were typically hand-coded in assembler and used adhoc parsing techniques to process assembly source lines to produce actual assembler code. When assembler syntax is simple (e.g. Always OPCODE REG, OPERAND) this worked well enough.
Old-school assemblers were typically hand-coded in assembler and used adhoc parsing techniques to process assembly source lines to produce actual assembler code. When assembler syntax is simple (e.g. Always OPCODE REG, OPERAND) this worked well enough. Modern machines have messy, nasty instruction sets with lots of instruction variations and operands, which may be expressed with complex syntax allowing multiple index registers to participate in the operand expression.
Allowing sophisticated assembly-time expressions with fixed and relocatable constants with various types of addition operators complicates this. Sophisticated assemblers allowing conditional compilation, macros, structured data declarations, etc.All add new demands on syntax. Processing all this syntax by ad hoc methods is very hard and is the reason that parser generators were invented.
Using a BNF and a parser generator is very reasonable way to build a modern assembler, even for a legacy processor such as the Z80. I have built such assemblers for Motorola 8 bit machines such as the 6800/6809, and am getting ready to do the same for a modern x86. I think you're headed down exactly the right path.
****** EDIT ************ The OP asked for example lexer and parser definitions. I've provided both here. These are excerpts from real specifications for a 6809 asssembler.
The complete definitions are 2-3x the size of the samples here.To keep space down, I have edited out much of the dark-corner complexity which is the point of these definitions. One might be dismayed by the apparant complexity; the point is that with such definitions, you are trying to describe the shape of the language, not code it procedurally. You will pay a significantly higher complexity if you code all this in an ad hoc manner, and it will be far less maintainable.It will also be of some help to know that these definitions are used with a high-end program analysis system that has lexing/parsing tools as subsystems, called the The DMS Software Reengineering Toolkit.
DMS will automatically build ASTs from the grammar rules in the parser specfication, which makes it a lot easier to buid parsing tools. Lastly, the parser specification contains so-called "prettyprinter" declarations, which allows DMS to regenreate source text from the ASTs.(The real purpose of the grammer was to allow us to build ASTs representing assembler instructions, and then spit them out to be fed to a real assembler!) One thing of note: how lexemes and grammar rules are stated (the metasyntxax! ) varies somewhat between different lexer/parser generator systems.
The syntax of DMS-based specifications is no exception. DMS has relatively sophisticated grammar rules of its own, that really aren't practical to explain in the space available here. You'll have to live with idea that other systems use similar notations, for EBNF for rules and and regular expression variants for lexemes.
Given the OP's interests, he can implement similar lexer/parsers with any lexer/parser generator tool, e.g. , FLEX/YACC, JAVACC, ANTLR, ... ****** LEXER ********** -- M6809. Lex: Lexical Description for M6809 -- Copyright (C) 1989,1999-2002 Ira D. Baxter %% #mainmode Label #macro digit "0-9" #macro hexadecimaldigit "|a-fA-F" #macro comment_body_character "\u0009 \u0020-\u007E" -- does not include NEWLINE #macro blank "\u0000 \ \u0009" #macro hblanks "+" #macro newline "\u000d \u000a?
\u000c? | \u000a \u000c? " -- form feed allowed only after newline #macro bare_semicolon_comment "\; * " #macro bare_asterisk_comment "\* * " ...snip #macro hexadecimal_digit " | a-fA-F" #macro binary_digit "01" #macro squoted_character "\' \u0021-\u007E" #macro string_character "\u0009 \u0020-\u007E" %%Label -- (First mode) processes left hand side of line: labels, opcodes, etc.#skip "(*)+" #skip "(*)*+" > #precomment "" #preskip "(*)+" #preskip "(*)*+" > -- Note that an apparant register name is accepted as a label in this mode #token LABEL STRING "" = ThisCharacterCode Ordinala) (-= ThisCharacterCode #20) ; fold to upper case )ifthen (= (@ Result) (append (@ Result) (coerce character ThisCharacterCode)))= );; )while );; )local (=?
:Lexeme:Literal:String:Format (LiteralFormat:MakeCompactStringLiteralFormat 0)) ; nothing interesting in string (GotoLabelList? ) >> %%OpcodeField #skip "" > #ifnotoken > -- Opcode field tokens #token 'ABA' "aAbBaA" > #token 'ABX' "aAbBxX" > #token 'ADC' "aAdDcC" > #token 'ADCA' "aAdDcCaA" > #token 'ADCB' "aAdDcCbB" > #token 'ADCD' "aAdDcCdD" > #token 'ADD' "aAdDdD" > #token 'ADDA' "aAdDdDaA" > #token 'ADDB' "aAdDdDbB" > #token 'ADDD' "aAdDdDdD" > #token 'AND' "aAnNdD" > #token 'ANDA' "aAnNdDaA" > #token 'ANDB' "aAnNdDbB" > #token 'ANDCC' "aAnNdDcCcC" > ...long list of opcodes snipped #token IDENTIFIER STRING "" = ThisCharacterCode Ordinala) (-= ThisCharacterCode #20) ; fold to upper case )ifthen (= (@ Result) (append (@ Result) (coerce character ThisCharacterCode)))= );; )while );; )local (=? :Lexeme:Literal:String:Format (LiteralFormat:MakeCompactStringLiteralFormat 0)) ; nothing interesting in string (GotoOperandField?
) >> #token '#' "\#" -- special constant introduction (FDB) > #token NUMBER NATURAL "" > #token NUMBER NATURAL "\$ +" > #token NUMBER NATURAL "\% +" > #token CHARACTER CHARACTER "" > %%OperandField #skip "" > #ifnotoken > -- Tokens signalling switch to index register modes #token ',' "\," > #token '' "\" > -- Operators for arithmetic syntax #token '! ' "\! \!
" #token '! ' "\!" #token '##' "\#\#" #token '#' "\#" #token '&' "\&" #token '(' "\(" #token ')' "\)" #token '*' "\*" #token '+' "\+" #token '-' "\-" #token '/' "\/" #token '//' "\/\/" #token '' "\>" #token '>' "\>" #token '>=' "\>\=" #token '>>' "\>\>" #token '>/' "\>\/" #token '\\' "\\" #token '|' "\|" #token '||' "\|\|" #token NUMBER NATURAL "" > #token NUMBER NATURAL "\$ +" > #token NUMBER NATURAL "\% +" > -- Notice that an apparent register is accepted as a label in this mode #token IDENTIFIER STRING "" = ThisCharacterCode Ordinala) (-= ThisCharacterCode #20) ; fold to upper case )ifthen (= (@ Result) (append (@ Result) (coerce character ThisCharacterCode)))= );; )while );; )local (=? :Lexeme:Literal:String:Format (LiteralFormat:MakeCompactStringLiteralFormat 0)) ; nothing interesting in string >> %%Register -- operand field for TFR, ANDCC, ORCC, EXG opcodes #skip "" #ifnotoken > %%RegisterField -- handles registers and indexing mode syntax -- In this mode, names that look like registers are recognized as registers #skip "" > #ifnotoken > #token '' "\" #token '' "\" #token '--' "\-\-" #token '++' "\+\+" #token 'A' "aA" #token 'B' "bB" #token 'CC' "cCcC" #token 'DP' "dDpP | dDpPrR" -- DPR shouldnt be needed, but found one instance #token 'D' "dD" #token 'Z' "zZ" -- Index register designations #token 'X' "xX" #token 'Y' "yY" #token 'U' "uU" #token 'S' "sS" #token 'PCR' "pPcCrR" #token 'PC' "pPcC" #token ',' "\," -- Operators for arithmetic syntax #token '!
' "\! \! " #token '!
' "\!" #token '##' "\#\#" #token '#' "\#" #token '&' "\&" #token '(' "\(" #token ')' "\)" #token '*' "\*" #token '+' "\+" #token '-' "\-" #token '/' "\/" #token '' "\>" #token '>' "\>" #token '>=' "\>\=" #token '>>' "\>\>" #token '>|' "\>\|" #token '\\' "\\" #token '|' "\|" #token '||' "\|\|" #token NUMBER NATURAL "" > ... snip %% -- end M6809. Lex ************ PARSER ********** -- M6809. ATG: Motorola 6809 assembly code parser -- (C) Copyright 1989;1999-2002 Ira D.
Baxter; All Rights Reserved m6809 = sourcelines ; sourcelines = ; sourcelines = sourcelines sourceline EOL ; >: { V(CV(sourcelines1),H(sourceline,A(EOL))); } -- leading opcode field symbol should be treated as keyword. Sourceline = ; sourceline = labels ; sourceline = optional_labels 'EQU' expression ; >: { H(optional_labels,A('EQU'),A(expression)); } sourceline = LABEL 'SET' expression ; >: { H(A(LABEL),A('SET'),A(expression)); } sourceline = optional_label instruction ; >: { H(optional_label,instruction); } sourceline = optional_label optlabelleddirective ; >: { H(optional_label,optlabelleddirective); } sourceline = optional_label implicitdatadirective ; >: { H(optional_label,implicitdatadirective); } sourceline = unlabelleddirective ; sourceline = '? ERROR' ; >: { A('?
ERROR'); } optional_label = labels ; optional_label = LABEL ':' ; >: { H(A(LABEL),':'); } optional_label = ; optional_labels = ; optional_labels = labels ; labels = LABEL ; >: { A(LABEL); } labels = labels ',' LABEL ; >: { H(labels1,',',A(LABEL)); } unlabelleddirective = 'END' ; >: { A('END'); } unlabelleddirective = 'END' expression ; >: { H(A('END'),A(expression)); } unlabelleddirective = 'IF' expression EOL conditional ; >: { V(H(A('IF'),H(A(expression),A(EOL))),CV(conditional)); } unlabelleddirective = 'IFDEF' IDENTIFIER EOL conditional ; >: { V(H(A('IFDEF'),H(A(IDENTIFIER),A(EOL))),CV(conditional)); } unlabelleddirective = 'IFUND' IDENTIFIER EOL conditional ; >: { V(H(A('IFUND'),H(A(IDENTIFIER),A(EOL))),CV(conditional)); } unlabelleddirective = 'INCLUDE' FILENAME ; >: { H(A('INCLUDE'),A(FILENAME)); } unlabelleddirective = 'LIST' expression ; >: { H(A('LIST'),A(expression)); } unlabelleddirective = 'NAME' IDENTIFIER ; >: { H(A('NAME'),A(IDENTIFIER)); } unlabelleddirective = 'ORG' expression ; >: { H(A('ORG'),A(expression)); } unlabelleddirective = 'PAGE' ; >: { A('PAGE'); } unlabelleddirective = 'PAGE' HEADING ; >: { H(A('PAGE'),A(HEADING)); } unlabelleddirective = 'PCA' expression ; >: { H(A('PCA'),A(expression)); } unlabelleddirective = 'PCC' expression ; >: { H(A('PCC'),A(expression)); } unlabelleddirective = 'PSR' expression ; >: { H(A('PSR'),A(expression)); } unlabelleddirective = 'TABS' numberlist ; >: { H(A('TABS'),A(numberlist)); } unlabelleddirective = 'TITLE' HEADING ; >: { H(A('TITLE'),A(HEADING)); } unlabelleddirective = 'WITH' settings ; >: { H(A('WITH'),A(settings)); } settings = setting ; settings = settings ',' setting ; >: { H*; } setting = 'WI' '=' NUMBER ; >: { H*; } setting = 'DE' '=' NUMBER ; >: { H*; } setting = 'M6800' ; setting = 'M6801' ; setting = 'M6809' ; setting = 'M6811' ; -- collects lines of conditional code into blocks conditional = 'ELSEIF' expression EOL conditional ; >: { V(H(A('ELSEIF'),H(A(expression),A(EOL))),CV(conditional1)); } conditional = 'ELSE' EOL else ; >: { V(H(A('ELSE'),A(EOL)),CV(else)); } conditional = 'FIN' ; >: { A('FIN'); } conditional = sourceline EOL conditional ; >: { V(H(sourceline,A(EOL)),CV(conditional1)); } else = 'FIN' ; >: { A('FIN'); } else = sourceline EOL else ; >: { V(H(sourceline,A(EOL)),CV(else1)); } -- keyword-less directive, generates data tables implicitdatadirective = implicitdatadirective ',' implicitdataitem ; >: { H*; } implicitdatadirective = implicitdataitem ; implicitdataitem = '#' expression ; >: { A(H('#',expression)); } implicitdataitem = '+' expression ; >: { A(H('+',expression)); } implicitdataitem = '-' expression ; >: { A(H('-',expression)); } implicitdataitem = expression ; >: { A(expression); } implicitdataitem = STRING ; >: { A(STRING); } -- instructions valid for m680C (see Software Dynamics ASM manual) instruction = 'ABA' ; >: { A('ABA'); } instruction = 'ABX' ; >: { A('ABX'); } instruction = 'ADC' 'A' operandfetch ; >: { H(A(H('ADC','A')),A(operandfetch)); } instruction = 'ADC' 'B' operandfetch ; >: { H(A(H('ADC','B')),A(operandfetch)); } instruction = 'ADCA' operandfetch ; >: { H(A('ADCA'),A(operandfetch)); } instruction = 'ADCB' operandfetch ; >: { H(A('ADCB'),A(operandfetch)); } instruction = 'ADCD' operandfetch ; >: { H(A('ADCD'),A(operandfetch)); } instruction = 'ADD' 'A' operandfetch ; >: { H(A(H('ADD','A')),A(operandfetch)); } instruction = 'ADD' 'B' operandfetch ; >: { H(A(H('ADD','B')),A(operandfetch)); } instruction = 'ADDA' operandfetch ; >: { H(A('ADDA'),A(operandfetch)); } ..snip... -- condition code mask for ANDCC and ORCC conditionmask = '#' expression ; >: { H*; } conditionmask = expression ; target = expression ; operandfetch = '#' expression ; --immediate >: { H*; } operandfetch = memoryreference ; operandstore = memoryreference ; memoryreference = '' indexedreference '' ; >: { H*; } memoryreference = indexedreference ; indexedreference = offset ; indexedreference = offset ',' indexregister ; >: { H*; } indexedreference = ',' indexregister ; >: { H*; } indexedreference = ',' '--' indexregister ; >: { H*; } indexedreference = ',' '-' indexregister ; >: { H*; } indexedreference = ',' indexregister '++' ; >: { H*; } indexedreference = ',' indexregister '+' ; >: { H*; } offset = '>' expression ; -- page zero ref >: { H*; } offset = ': { H*; } offset = expression ; offset = 'A' ; offset = 'B' ; offset = 'D' ; registerlist = registername ; registerlist = registerlist ',' registername ; >: { H*; } registername = 'A' ; registername = 'B' ; registername = 'CC' ; registername = 'DP' ; registername = 'D' ; registername = 'Z' ; registername = indexregister ; indexregister = 'X' ; indexregister = 'Y' ; indexregister = 'U' ; -- not legal on M6811 indexregister = 'S' ; indexregister = 'PCR' ; indexregister = 'PC' ; expression = sum '=' sum ; >: { H*; } expression = sum ': { H*; } expression = sum ': { H*; } expression = sum ': { H*; } expression = sum ': { H*; } expression = sum '>>' sum ; >: { H*; } expression = sum '>/' sum ; >: { H*; } expression = sum '>=' sum ; >: { H*; } expression = sum '>' sum ; >: { H*; } expression = sum '#' sum ; >: { H*; } expression = sum ; sum = product ; sum = sum '+' product ; >: { H*; } sum = sum '-' product ; >: { H*; } sum = sum '! ' product ; >: { H*; } sum = sum '! ' product ; >: { H*; } product = term '*' product ; >: { H*; } product = term '||' product ; -- wrong?
>: { H*; } product = term '/' product ; >: { H*; } product = term '//' product ; >: { H*; } product = term '&' product ; >: { H*; } product = term '##' product ; >: { H*; } product = term ; term = '+' term ; >: { H*; } term = '-' term ; >: { H*; } term = '\\' term ; -- complement >: { H*; } term = '&' term ; -- not term = IDENTIFIER ; term = NUMBER ; term = CHARACTER ; term = '*' ; term = '(' expression ')' ; >: { H*; } numberlist = NUMBER ; numberlist = numberlist ',' NUMBER ; >: { H*; }.
Ira, thank you for good feedback(I tried to rank this up but I'm not experienced enough lol). I wondered if you had a lexer and/or parser example which uses this approach so that I might be able to take a look or if not, any chance of some pseudo code to get me moving forward with this :). Best Regards – Gary Paluk Aug 23 '09 at 9:40 1 @Gary: Beware what you ask for :} See edits to my small answer turning it into a giant one.
– Ira Baxter Aug 23 '09 at 17:44 @Gary: PS, to "rank this up", simply click the upward facing triangle above the score next to the start of the answer. :-} – Ira Baxter Aug 23 '09 at 17:46 This is frickin awesome...gonna go and read it thoroughly! Thanks Sadly, I don't have a high enough personal ranking to allow me to rank this up!
Could someone please do it for me! :D – Gary Paluk Aug 23 '09 at 19:53 OK...Starting with the lexer, I know your code is doing what I want to acheive, I have no idea how to represent it in my target language ECMA Script. Staring at the very begining I come stuck nearly instantly with how to represent the macros, I wrote some methods to do checks such as: // Hexidecimal Digit public function hexidecimalDigit( str: String ): Boolean { return decimalDigit( str ) || "A" || "B" || "C" || "D" || "E" || "F"; } or checks but once I get to multiple character inputs it goes terribly wrong lol... public function opCode( str: String ): Boolean { //?
} GP – Gary Paluk Aug 23 '097 at 2:07.
BNF is more generally used for structured, nested languages like Pascal, C++, or really anything derived from the Algol family (which includes modern languages like C#). If I were implementing an assembler, I might use some simple regular expressions to pattern-match the opcode and operands. It's been a while since I've used Z80 assembly language, but you might use something like: /\s*(\w{2,3})\s+((\w+)(,\w+)?)?
/ This would match any line which consists of a two- or three-letter opcode followed by one or two operands separated by a comma. After extracting an assembler line like this, you would look at the opcode and generate the correct bytes for the instruction, including the values of the operands if applicable. The type of parser I've outlined above using regular expressions would be called an "ad hoc" parser, which essentially means you split and examine the input on some kind of block basis (in the case of assembly language, by text line).
I don't think you need overthink it. There's no point making a parser that takes apart “LD A,A� Into a load operation, destination and source register, when you can just string match the whole thing (modulo case and whitespace) into one opcode directly.
There aren't that many opcodes, and they aren't arranged in such a way that you really get much benefit from parsing and understanding the assembler IMO. Obviously you'd need a parser for the byte/address/indexing arguments, but other than that I'd just have a one-to-one lookup.
Thanks for your feedback... I agree that going down a more simplistic path but also I'm interested in extending this out into a more complex language and figure I would like make use of the features from the beginning... Also there are differences in some other parts of ASM such as equ, . Db, . Ds, .
Dw, #include, () and then we start to get into simple case statements IF ELSE. Furthermore, this is also an excercise in learning the concepts of BNF and use it for this more simplistic implementation. – Gary Paluk Aug 23 '09 at 1:51.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.