A program is a set of instructions we give to the computer to obtain the output we wish. These set of instructions are generally written in a high-level language.
When we code the compiler of the programming language points out the mistakes we have made (debugger + compiler) so that we can correct them while also converting the code from high-level programming language to a lower-level language. This then creates an executable file with which we can proceed to the successful execution of our code. This is important as the computer doesn’t understand our language and needs to converse in an assembly level language. For this conversion process to complete many steps are followed, where the next step takes input from the previous one.
Syntax analysis and lexical analysis are both part of these steps. These two steps discover the syntactic structure of a program while acting as the scanner and parser respectively. It is important to understand their features to understand why they are important to the compilation process.
Lexical and syntax analyzers are not only needed in compiler design but are also popular for various other applications such as programs that compute the complexity of programs and programs that must analyze the contents of a configuration file.
Lexical analysis vs Syntax analysis
Lexical analysis | Syntax analysis |
It is responsible for converting a sequence of characters into a pattern of tokens. | Process of analyzing a string of symbols either in natural language or computer languages that satisfies the rules of a formal grammar. |
Reads the program one character at a time, the output is meaningful lexemes. | Tokens are taken as input and a parse tree is generated as output. |
It is the first phase of the compilation process. | It is the second phase of the compilation process. |
It can also be referred to as lexing and tokenization. | It can also be referred to as syntactic analysis and parsing. |
A lexical analyser is a pattern matcher. | A syntax analysis involves forming a tree to identify deformities in the syntax of the program. |
Less complex approaches are often used for lexical analysis. | Syntax analysis requires a much more complex approach. |
The lexical analyzer may not be portable. | The parser is always portable. |
What is lexical analysis?
This is the first phase of compilation where the source program is scanned and one character at a time is converted to meaningful lexemes or tokens.
One must wonder what is a lexeme or a token, so to explain, a lexeme is a sequence of characters in the source program with the lowest level of syntactic meanings. A token on the other hand is a category of lexemes and is the basic building block of programs. A lexeme is an instance of a token.
The input is the source code and processed in such a way that it throws away ignorable text such as spaces, new-lines, and comments. The output goes to the syntax analysis phase for further conversion. The token is formatted as <token-name, attribute-value>.
A lexical analyzer performs lexical analysis and acts as a pattern matcher. The lexical syntax a sequence of code can be split into a sequence of lexemes along with separating the whitespaces and comments and discarding them. It is also used to tokenize sources(identify all lexemes and their categories), report lexical errors if any, save the text of interesting tokens, save source locations and implement preprocessor functions.
Separation of the steps of lexical and syntax analysis allows optimization of the lexical analyzer and thus improves the efficiency of the process. It also simplifies the parser and keeps it portable as a lexical analyzer may not always be portable.
What is syntax analysis?
The second phase of the compilation process is syntax analysis and takes the tokens produced by lexical analysis as input and generates a parse tree. Here, the token organizations or lexemes are checked against the source code grammar and furthermore, the parser performs syntax analysis. Syntax analyzers perform syntax analysis and they are based directly on the grammar. They are also responsible for a structure-rich representation of the input that is convenient to process.
The parse tree also called a syntax tree checks whether the expression made by the tokens is syntactically correct. It ensures no deformities are present in the structure of the system. Additionally, the compilation process also involves phases such as semantic analysis, intermediate code generation, code generation, code optimization etc. There are many techniques for parsing algorithms, the two main classes of the algorithm are top-down and bottom-up parsing.
Syntax analyzer produces either a complete parse tree or at least trace the structure of the complete parse tree or else it produces a parse tree to deal with subsets of the problem which causes the parsing problem. Even so, syntax analysis is more powerful than lexical analysis.
Author
Shriya Upasani
MIT World Peace University