gavr-vlad-s / myauka

Lexical analyzer generator

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introduction

Project Myauka is a generator of lexical analyzers, generating the text of a lexical analyzer in C++. By now, there are quite a few such generators, for example Coco/R, flex, flex++, flexc++, and this list is far from complete. However, all these generators have one common drawback. The disadvantage is that these generators essentially automate only the tasks of checking the correctness of writing and detecting the beginning of lexemes, and the generation of the value of the lexeme by its string representation must be performed by the function written by the user of the generator, called after verification of the correctness of the token. In this case, firstly, the passage through the fragment of the input text is performed twice, and secondly, it is necessary to manually implement part of the finite automaton constructed by the generator of lexical analyzers. The proposed generator is aimed at eliminating this disadvantage.

Input file format

The input file with the description of a lexical analyzer consists of the sequence of the following commands (only the command %codes is required of them), which can go in any order:
%scaner_name name_of_scaner
%codes_type name_of_type_of_lexeme_codes
%ident_name name_of_identifier
%token_fields added_fields_of_lexeme
%class_members added_members_of_scaner_class
%header_additions additions_to_the_header_file
%impl_additions additions_to_the_implementation_file
%lexem_info_name name_of_the_type_of_a_lexeme_information
%newline_is_lexem
%codes name_of_lexeme_code {, name_of_lexeme_code}
%keywords [actions_after_finishing:] string_representing_the_keyword : code_of_the_keyword {, string_representing_the_keyword : code_of_the_keyword}
%delimiters [actions_after_finishing:] string_representing_the_delimiter : code_of_the_delimiter {, string_representing_the_delimiter : code_of_the_delimiter}
%idents '{'description_of_the_identifier_begin'}' '{'description_of_the_identifier_body'}'
%numbers [initialization_actions]:[actions_after_finishing] {%action name_of_the_action action_definition} '{'expression'}'
%strings [initialization_actions]:[actions_after_finishing] {%action name_of_the_action action_definition} '{'expression'}'
%comments [%single_lined begin_of_a_single-line_comment] [%multilined [%nested] begin_of_multi-line_comment : end_of_multi-line_comment]

Before explaining the meaning of each of the above constructions, we agree that everything enclosed in square brackets is optional, and everything enclosed in braces can be repeated any number of times, including never. In addition, '{' and '}' denote the curly braces themselves.

Further, we note that under the string literal of Myauka (hereinafter simply a string literal) will be understood any (including empty) character sequence enclosed in double quotes. If you want to specify the double quotation in this sequence, then this quote must be doubled.

Let's now turn to the explanation of the commands describing the lexical analyzer (hereinafter, for brevity, the scanner).

First of all, if the command

%scaner_name name_of_scaner

is specified, then an entry of the form

class name_of_scaner {
    ...
};

is appeared.

And this header file will be called name_of_scaner'.h. The corresponding implementation file will be called name_of_scaner'.cpp, where name_of_scaner' is name_of_scaner converted to lowercase. Default name_of_scaner is Scaner.

Further, if the command

%codes_type name_of_type_of_lexeme_codes

is specified, then the command %codes generates an entry of the form

enum name_of_type_of_lexeme_codes : unsigned short {  
    NONE,  
    UNKNOWN,  
    name_of_lexeme_code1,  
    ...  
    name_of_lexeme_codeN  
};

in the file name_of_scaner'.h, where где name_of_lexeme_code1, ..., name_of_lexeme_codeN are names of lexeme codes defined in the section %codes. The default name of the type of lexeme codes is Lexem_code.

The command

%ident_name name_of_identifier

specifies the name of the code of the lexeme for the lexeme 'identifier'. If there are no identifiers in the language for which the scanner is written, then the command %ident_name is optional.

If you need to add some fields into the lexeme description, then you need to write the command

%token_fields added_fields_of_lexeme

where added_fields_of_lexeme{ is the string literal with the description of the needed fields. For example, if the lexeme can take both values of the type __float128, and the values of the type __int128, and (according to the conditions of the problem) the field having the type __float128 must be named x, but the field having the type __int128 must be named y, then a string literal with fields added to the lexeme can look, for example, like this:

"__float128 x;
__int128 y;"

In addition, if you need to add members that are necessary for some calculations, then you need to write

%class_members added_members_of_scaner_class

where added_members_of_scaner_class is a string literal containing a list of members added to scanner. For example, if you need to add

__int128 integer_value;
__float128 integer_part;
__float128 fractional_part;
__float128 exponent;

then instead of added_members_of_scaner_class you need to write

"__int128 integer_value;
__float128 integer_part;
__float128 fractional_part;
__float128 exponent;"

If it is necessary that the character '\n' (the newline character) be a separate lexeme, and not whitespace character, then you need to specify the command

%newline_is_lexem

The required section %codes contains the comma separated list of identifiers. These identfiers are names of codes of lexemes. Rules for constructing identifiers are the same as in C++. For instance, if the enumeration name with the token codes is not specified by the command %codes_type, and the section %codes has the form

%codes
Kw_if, Kw_then, Kw_else, Kw_endif

then the enumeration

enum Lexem_code : unsigned short {  
    NONE,    UNKNOWN,  
    Kw_if,   Kw_then,   
    Kw_else, Kw_endif  
};

will be generated. In other words, two special lexeme codes are always defined: NONE, denoting the end of the processed text, and UNKNOWN, which denotes the unknown lexeme.

If you need to insert some text in the beginning of the header file of the generated scanner, then you need to write

%header_additions additions_to_the_header_file

where additions_to_the_header_file is the string literal with the inserted text.

Similarly, if you need to insert some text in the beginning of the implementation file of the generated scanner, then you need to write

%impl_additions additions_to_the_implementation_file

where additions_to_the_implementation_file is the string literal with the inserted text.

The command

%lexem_info_name name_of_the_type_of_a_lexeme_information

defines the name of the type of the lexem information, and name_of_the_type_of_a_lexeme_information is the identifier that is the name of the type of the lexem information. The default name of this type is Lexem_info.

In the section %keywords, the keywords of the language for which the scaner is written and the corresponding lexeme codes are specified. Codes of lexemes are taken from the section %codes. For example, if there are keywords__if__, then, else, endif, and coressponding lexeme codes are Kw_if, Kw_then, Kw_else, Kw_endif, then the section %keywords should have the following form:

%keywords
...
"if" : Kw_if,
"then" : Kw_then,
"else" : Kw_else,
"endif" : Kw_endif
...

Here the ellipsis indicates (possibly existing) a description of other keywords.

The %idents section defines the structure of the identifier of the language for which the scanner is written. More precisely, description_of_the_identifier_begin defines what can be at the beginning of the identifier, and description_of_the_identifier_body defines the structure of the identifier body.

In the section %delimiters, the operation signs and the delimiters of the language for which the scaner is written and the corresponding lexeme codes are specified. Codes of lexemes are taken from the section %codes. For example, if the language has delimiters <, >, <=, >=, =, !=, and corresponding lexeme codes are del_LT, del_GT, del_LEQ, del_GEQ, del_EQ, del_NEQ, then the section %delimiters should have the form

%delimiters
...
"<" : del_LT,
">" : del_GT,
"<=" : del_LEQ,
">=" : del_GEQ,
"=" : del_EQ,
"!=" : del_NEQ
...

Here the ellipsis indicates (possibly existing) a description of other delimiters and operation signs.

The section %numbers specifies the regular expression that defines the structure of numbers. In this regular expression, character processing actions are embedded. Each of actions must be descripted by the command

%action name_of_the_action action_definition

where name_of_the_action is the C ++ language identifier, which is the name of the action to be determined, and action_definition is the string literal containing C++ code that performs the required action.

The section %strings describes the structure of character literals and string literals (if there are such literals). The section %string is organized in the same way as the section %numbers. If the section %string is specified, then the scanner class contains members std::string buffer и int char_code.

Finally, the section %comments describes the structure of comments.

Here the command

%single_lined begin_of_a_single-line_comment

defines the structure of the single-line comment, where begin_of_a_single-line_comment is a string literal representing the sequence of characters that is the beginning of a single-line comment.

The command

%multilined [%nested] begin_of_multi-line_comment : end_of_multi-line_comment

defines the structure of the multi-line comment. Namely, begin_of_multi-line_comment and end_of_multi-line_comment are string literals that are sequences of characters beginning and ending multi-line comment, respectively. If the keyword %nested is specified, then the multi-line comment can be nested.

We now explain that the generator of lexical analyzers Myauka is meant by the description the beginning identifier, a description of the body of the identifier and the regular expression:

description_of_the_identifier_begin → expr
description_of_the_identifier_body → expr
expr → expr0 {'|' expr0}
expr0 → expr1 { expr1}
expr1 → expr2[?|* |+]
expr2 → character | character_class
character_class → [:Latin:] | [:latin:] | [:Russian:] | [:russian:] | [:bdigits:] | [:odigits:] | [:digits:] | [:xdigits:] | [:Letter:] | [:letter:] | [:nsq:] | [:ndq:]

expression → expression0 {'|' expression0}
expression0 → expression1 {expression1}
expression1 → expression2[?|*|+]
expression2 → expression3[$name_of_the_action]
expression3 → character | character_class | (expression)

In this grammar, the word ''character'' means the following. Any non-whitespace character, except for characters '|', '*', +', '?', '\$', '\', '"', and character on a new line in the file describing the scanner is himself. If these symbols need to specify in the regular expression, you should write them down as '\|', '\*', '\+', '\?', '\$', '\\', '\"', '\n', respectively. All whitespace characters (that is, the characters, the codes of which do not exceed the code of space) are ignored by Myauka.

Below is the list of allowed character classes.

  • [:Latin:] Uppercase Latin letters from 'A' to 'Z'.
  • [:latin:] Lowercase Latin letters from 'a' to 'z'.
  • [:Russian:] Uppercase Russian letters from 'А' to 'Я' (including the letter 'Ё').
  • [:russian:] Lowercase Russian letters from 'а' to 'я' (including the letter 'ё').
  • [:bdigits:] Characters of binary digits, i.e. characters '0' и '1'.
  • [:odigits:] Characters of octal digits, i.e. characters '0', '1', '2', '3', '4', '5', '6', '7'.
  • [:digits:] Characters of decimal digits, i.e. characters '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'.
  • [:xdigits:] Characters of hexadecimal digits, i.e. characters '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'a', 'b', 'c', 'd', 'e', 'f'.
  • [:Letter:] Uppercase Latin letters from 'A' to 'Z' and uppercase Russian letters from 'А' to 'Я' (including the letter 'Ё').
  • [:letter:] Lowercase Latin letters from 'a' to 'z' and lowercase Russian letters from 'а' to 'я' (including the letter 'ё')
  • [:nsq:] Symbols other than an unary quote (').
  • [:ndq:] Symbols other than a double quote (").

Here character classes [: nsq:] and [: ndq:] are allowed only in the section %strings.

Building

To build the generator Myauka, you need to use the build system Murlyka. The only external dependency for the project Myauka is boost::system and boost::filesystem.

About

Lexical analyzer generator

License:GNU General Public License v3.0


Languages

Language:C++ 97.7%Language:C 2.3%