Antlr Lexer 无法识别实数

Antlr Lexer cant recognize Real Number

提问人:Jacob Bauer 提问时间:10/29/2023 最后编辑:FrantJacob Bauer 更新时间:10/29/2023 访问量:49

问:

所以我一直在从头开始开发自己的编译器。我完成了汇编程序,并且我让它与整数一起工作。我想添加对组装成 binary32 或标准浮点格式的实数的支持。

我不断收到如下错误

line 194:9 mismatched input '8.0' expecting NUMBER
line 200:10 mismatched input '1.0' expecting NUMBER
line 203:10 mismatched input '2.0' expecting NUMBER
line 217:10 mismatched input '5.0' expecting NUMBER
line 221:10 mismatched input '3.1415' expecting NUMBER
line 222:10 mismatched input '12.0' expecting NUMBER

我正在使用的语法发布在下面:

请特别注意 REAL_NUMBER 和 NUMBER 词法分析器规则。 我的理解是,所有这些数字都应该与REAL_NUMBER的较长规则相匹配,但我想状态机只是卡在 NUMBER 状态内并给了我上述所有错误。

grammar ArmAssembler;

program: instructionOrDirective+;

instructionOrDirective: LABEL? (instruction | 
                                wordDirective | 
                                byteDirective);

instruction: bInstr
       | blInstr
           | bxInstr
       | ldmInstr
       | ldrSignedInstr
       | ldrDefInstr
       | mlaInstr
       | mrsInstr
       | msrDefInstr
       | msrPrivInstr
       | mulInstr
       | stmInstr
       | strSignedInstr
       | strDefInstr
       | swiInstr
       | swpInstr
       | addInstr
       | andInstr
       | eorInstr
       | subInstr
       | rsbInstr
       | adcInstr
       | sbcInstr
       | rscInstr
       | tstInstr
       | teqInstr
       | cmpInstr
       | cmnInstr
       | orrInstr
       | movInstr
       | bicInstr
       | mvnInstr
       | stopInstr
       ;

wordDirective: DOT_WORD number;
byteDirective: DOT_BYTE number;

bInstr : BRANCH expression;
blInstr : BRANCH_WITH_LINK expression;
bxInstr : BRANCH_WITH_EXCHANGE REG;
ldmInstr : LOAD_MEMORY REG EXP? COMMA rList BXOR?;
ldrSignedInstr : LOAD_SIGNED_REGISTER REG COMMA address;
ldrDefInstr : LOAD_REGISTER REG COMMA address;
mlaInstr : MULTIPLY_AND_ACUMULATE REG COMMA REG COMMA REG COMMA REG;
mrsInstr : MRS_INSTR REG COMMA psr;
msrDefInstr : MSR_INSTR psr COMMA REG;
msrPrivInstr : MSR_INSTR (psrf COMMA (REG | poundExpression));
mulInstr : MULTIPLY REG COMMA REG COMMA REG;
stmInstr : STORE_MEMORY REG EXP? COMMA rList BXOR?;
strSignedInstr : STORE_SIGNED_REGISTER REG COMMA address;
strDefInstr : STORE_REGISTER REG COMMA address;
swiInstr : SOFTWARE_INTERRUPT expression;
swpInstr : SWAP REG COMMA REG COMMA LBRACK REG RBRACK;
addInstr : ADDITION REG COMMA REG COMMA op2;
andInstr : LOGICAL_AND REG COMMA REG COMMA op2;
eorInstr : EXCLUSIVE_OR REG COMMA REG COMMA op2;
subInstr : SUBTRACTION REG COMMA REG COMMA op2;
rsbInstr : REVERSE_SUBTRACTION REG COMMA REG COMMA op2;
adcInstr : ADDITION_WITH_CARRY REG COMMA REG COMMA op2;
sbcInstr : SUBTRACTION_WITH_CARRY REG COMMA REG COMMA op2;
rscInstr : REVERSE_SUBTRACTION_WITH_CARRY REG COMMA REG COMMA op2;
orrInstr : LOGICAL_OR_INSTRUCTION REG COMMA REG COMMA op2;
bicInstr : BIT_CLEAR_INSTRUCTION REG COMMA REG COMMA op2;
tstInstr : TEST_BITS REG COMMA op2;
teqInstr : TEST_EQUALITY REG COMMA op2;
cmpInstr : COMPARE REG COMMA op2;
cmnInstr : COMPARE_NEGATIVE REG COMMA op2;
movInstr : MOVE REG COMMA op2;
mvnInstr : MOVE_NEGATIVE REG COMMA op2;
stopInstr: STOP;

op2 : REG (COMMA shift)?
    | poundExpression
    ;

shift : shiftName REG
      | shiftName poundExpression
      | RPX
      ;

rList : LCURL rValue (COMMA rValue)*  RCURL;
rValue : REG (MINUS REG)?;

/*
 * Below is code for dealing with expressions
 */
poundExpression: HASH expression;
expression : andExpr (LOR expression)?;
andExpr : relational (LAND andExpr)?;
relational : primary ((REQ|RNE|RLT|RGT|RLE|RGE) primary)?;
primary : bitwise ((PLUS|MINUS) primary)?;
bitwise : term ((BOR|BAND|BXOR) bitwise)?;
term: unary ((TIMES|DIV|MOD|LSHIFT|RSHIFT) term)?;
unary: (PLUS|MINUS)? single;
single: realNumber
      | number
      | identifier
      ;
identifier: IDENT
          | BRANCH
          | BRANCH_WITH_LINK
          | BRANCH_WITH_EXCHANGE
      | LOAD_MEMORY
      | LOAD_SIGNED_REGISTER
      | LOAD_REGISTER
      | MULTIPLY_AND_ACUMULATE
      | MRS_INSTR
      | MSR_INSTR
      | MULTIPLY
      | STORE_MEMORY
      | STORE_SIGNED_REGISTER
      | STORE_REGISTER
      | SOFTWARE_INTERRUPT
      | SWAP
      | ADDITION
      | LOGICAL_AND
      | EXCLUSIVE_OR
      | SUBTRACTION
      | REVERSE_SUBTRACTION
      | ADDITION
      | ADDITION_WITH_CARRY
      | SUBTRACTION_WITH_CARRY
      | REVERSE_SUBTRACTION_WITH_CARRY
      | LOGICAL_OR_INSTRUCTION
      | BIT_CLEAR_INSTRUCTION
      | TEST_BITS
      | TEST_EQUALITY
      | COMPARE
      | COMPARE_NEGATIVE
      | MOVE
      | MOVE_NEGATIVE
      | STOP
      | shiftName
      | psr
      | psrf
      | REG
      | RPX
      ;

realNumber: REAL_NUMBER;
number: NUMBER;


/*
 * Below is the code for dealing with addresses
 */

address : expression
    | preIndexedAddressing
    | postIndexedAddressing
    ;

preIndexedAddressing: LBRACK REG RBRACK
            | LBRACK REG COMMA poundExpression RBRACK EXP?
            | LBRACK REG COMMA (PLUS | MINUS)? REG 
                      (COMMA shift)? RBRACK EXP?
                    ;

postIndexedAddressing: LBRACK REG RBRACK COMMA poundExpression
                 | LBRACK REG RBRACK COMMA (PLUS | MINUS)? REG
                       (COMMA shift)?
             ;

shiftName: LSL
     | LSR
     | ASR
     | ROR
     ;

psr: CPSR
   | CPSR_ALL
   | SPSR
   | SPSR_ALL
   ;

psrf: CPSR_FLG
    | SPSR_FLG
    ;

/*
 * Below are the shift name variables
 */

ASL : A S L;
LSL : L S L;
LSR : L S R;
ASR : A S R;
ROR : R O R;

RPX : R P X;

/*
 * Below is code for condition codes
 */

BRANCH: B CONDITION_CODE?;
BRANCH_WITH_LINK: BL CONDITION_CODE?;
BRANCH_WITH_EXCHANGE: BX CONDITION_CODE?;
LOAD_MEMORY: LDM CONDITION_CODE? ADDRESSING_MODE;
LOAD_REGISTER: LDR CONDITION_CODE? B? T?;
LOAD_SIGNED_REGISTER: LDR CONDITION_CODE? TRANSFER_TYPE;
MULTIPLY_AND_ACUMULATE: MLA CONDITION_CODE? S?;
MRS_INSTR: MRS CONDITION_CODE?;
MSR_INSTR: MSR CONDITION_CODE?; 
MULTIPLY: MUL CONDITION_CODE? S?;
STORE_MEMORY: STM CONDITION_CODE? ADDRESSING_MODE?;
STORE_REGISTER: STR CONDITION_CODE? B? T?;
STORE_SIGNED_REGISTER: STR CONDITION_CODE? TRANSFER_TYPE;
SOFTWARE_INTERRUPT: SWI CONDITION_CODE?;
SWAP: SWP CONDITION_CODE? B?;
ADDITION: ADD CONDITION_CODE? S?;
LOGICAL_AND: AND CONDITION_CODE? S?;
EXCLUSIVE_OR: EOR CONDITION_CODE? S?;
SUBTRACTION: SUB CONDITION_CODE? S?;
REVERSE_SUBTRACTION: RSB CONDITION_CODE? S?;
ADDITION_WITH_CARRY: ADC CONDITION_CODE? S?;
SUBTRACTION_WITH_CARRY: SBC CONDITION_CODE? S?;
REVERSE_SUBTRACTION_WITH_CARRY: RSC CONDITION_CODE? S?;
LOGICAL_OR_INSTRUCTION: ORR CONDITION_CODE? S?;
BIT_CLEAR_INSTRUCTION: BIC CONDITION_CODE? S?;
TEST_BITS: TST CONDITION_CODE?;
TEST_EQUALITY: TEQ CONDITION_CODE?;
COMPARE: CMP CONDITION_CODE?;
COMPARE_NEGATIVE: CMN CONDITION_CODE?;
MOVE: MOV CONDITION_CODE? S?;
MOVE_NEGATIVE: MVN CONDITION_CODE? S?;
STOP: STP;

REG: R (([1][0-5]?)|[02-9]);
LABEL: IDENT WS* COLON;
IDENT: LETTER LETTER_OR_UNDERSCORE_OR_NUMBER*;

DOT_WORD: '.' WORD;
DOT_BYTE: '.' BYTE;

REAL_NUMBER: [0-9]+ '.' [0-9]+;
NUMBER: [+-]?(([1-9] [0-9]*)|[0]);

CPSR: C P S R;
CPSR_ALL: C P S R '_' A L L;
CPSR_FLG: C P S R '_' F L G;
SPSR: S P S R;
SPSR_ALL: S P S R '_' A L L;
SPSR_FLG: S P S R '_' F L G;    

EXP : '!';

WS : [ \t\r\n]+ -> skip;

//And here are some operators

COMMA : ',';
LCURL : '{';
RCURL : '}';
LBRACK: '[';
RBRACK: ']';
REQ : '==';
RNE : '!=';
RLE : '<=';
RLT : '<';
RGE : '>=';
RGT : '>';
TIMES : '*';
MINUS : '-';
PLUS : '+';
MOD : '%';
DIV : '/';
LSHIFT : '<<';
RSHIFT : '>>';
BAND : '&';
BOR : '|';
BXOR : '^';
LAND : '&&';
LOR : '||';
HASH : '#';
COLON: ':';

/*
 * The following are used for ldm and store memory instructions 
 */

 /*
 * Below is definitions of all of the tokens to be used
 * B is also declared bit it is declared later at the bottom
 */

fragment STP: S T P;
fragment ADC : A D C;
fragment ADD : A D D;
fragment AND : A N D;
fragment BIC : B I C;
fragment BL  : B L;
fragment BX  : B X;
fragment CMP : C M P;
fragment CMN : C M N;
fragment EOR : E O R;
fragment LDC : L D C;
fragment LDM : L D M;
fragment LDR : L D R;
fragment MCR : M C R;
fragment MLA : M L A;
fragment MOV : M O V;
fragment MRC : M R C;
fragment MRS : M R S;
fragment MSR : M S R;
fragment MUL : M U L;
fragment MVN : M V N;
fragment ORR : O R R;
fragment RSB : R S B;
fragment RSC : R S C;
fragment SBC : S B C;
fragment STC : S T C;
fragment STM : S T M;
fragment STR : S T R;
fragment SUB : S U B;
fragment SWI : S W I;
fragment SWP : S W P;
fragment TEQ : T E Q;  
fragment TST : T S T;
fragment BYTE: B Y T E;
fragment WORD: W O R D;

fragment SB: S B;
fragment SH: S H;

fragment FD: F D;
fragment ED: E D;
fragment FA: F A;
fragment EA: E A;
fragment IA: I A;
fragment IB: I B;
fragment DA: D A;
fragment DB: D B;

fragment TRANSFER_TYPE: H
              | SB
              | SH
              ;

fragment ADDRESSING_MODE: FD
            | ED
            | FA
            | EA
            | IA
            | IB
            | DA
            | DB
            ;

fragment CONDITION_CODE: EQ 
               | NE 
               | CS 
               | CC 
               | MI 
               | PL 
               | VS 
               | VC 
               | HI 
               | LS 
               | GE 
               | LT 
               | GT 
               | LE 
               | AL
               ;

fragment EQ : E Q;
fragment NE : N E;
fragment CS : C S;
fragment CC : C C;
fragment MI : M I;
fragment PL : P L;
fragment VS : V S;
fragment VC : V C;
fragment HI : H I;
fragment LS : L S;
fragment GE : G E;
fragment LT : L T;
fragment GT : G T;
fragment LE : L E;
fragment AL : A L;

fragment LETTER_OR_UNDERSCORE_OR_NUMBER: LETTER | '_' | [0-9];
fragment LETTER: (A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z);

fragment A : ('A'|'a');
fragment B : ('B'|'b');
fragment C : ('C'|'c');
fragment D : ('D'|'d');
fragment E : ('E'|'e');
fragment F : ('F'|'f');
fragment G : ('G'|'g');
fragment H : ('H'|'h');
fragment I : ('I'|'i');
fragment J : ('J'|'j');
fragment K : ('K'|'k');
fragment L : ('L'|'l');
fragment M : ('M'|'m');
fragment N : ('N'|'n');
fragment O : ('O'|'o');
fragment P : ('P'|'p');
fragment Q : ('Q'|'q');
fragment R : ('R'|'r');
fragment S : ('S'|'s');
fragment T : ('T'|'t');
fragment U : ('U'|'u');
fragment V : ('V'|'v');
fragment W : ('W'|'w');
fragment X : ('X'|'x');
fragment Y : ('Y'|'y');
fragment Z : ('Z'|'z');

我尝试将 realNumber 解析器规则更改为 DIGIT+ DOT DIGIT+,但它会抛出解析器错误。另外,我不希望它与 12 .10 之类的东西匹配(包括空格)。

因此,词法分析器规则应该是合适的解决方案。

这可能只是 Antlr 中的一个错误。 我正在使用 Antlr 4.3 作为 apache maven 插件。

antlr antlr4 语法

评论


答:

1赞 Mike Cargal 10/29/2023 #1

简短的回答...您的词法分析器规则确实将该输入识别为令牌,但您处于需要令牌的解析器规则中;因此,错误。REAL_NUMBERNUMBER

错误消息基本上说解析器正在寻找令牌,但是,它找到了文本为“8.0”的令牌。默认错误处理程序不会标识它遇到的输入的令牌类型,因为它是针对分析器用户的错误,因此文本输入对他们更有意义。您可以重写该错误处理程序,并吐出一条消息,该消息确实命名了在开发编译器时使用的令牌类型。NUMBER

您在示例中提供的示例输入(例如:)将按照语法中的以下词法分析器规则被标记为标记(而不是标记)。3.1415REAL_NUMBERNUMBER

REAL_NUMBER: [0-9]+ '.' [0-9]+;
NUMBER:      [+-]? (([1-9] [0-9]*) | [0]);

错误消息将指示分析器正在查找令牌,因此这将是预期行为。NUMBER

顺便说一句......有很多可能应该重新设计,特别是在你对 Lexer 规则的方法中,但我要指出的解析器规则是,以下规则对你没有任何好处。(并且可能会让您感到困惑)

realNumber: REAL_NUMBER;
number:     NUMBER;

从您的问题中,确实不可能确定哪个解析器规则正在寻找您拥有的位置(似乎它需要是一个正在寻找或规则的规则,因此最佳猜测是规则。但是,如果您提供测试输入以及您正在使用的起始规则,那将非常有帮助)。NUMBERREAL_NUMBERwordDirectivebyteDirectiveinstructionOrDirective

也。。转储输入的令牌流总是一个好主意,以确保您首先了解 Lexer 规则识别并传递给解析器的令牌流(with flag 是实现此目的的一种方法(这也是 ANTLR 插件中的常见功能)grun-tokens

评论

0赞 Jacob Bauer 10/29/2023
感谢您的回复。我简直不敢相信我错过了。老实说,这是我第一次使用堆栈溢出。感谢您的快速回复。