參考
Kaleidoscope: Kaleidoscope Introduction and the Lexer
Kaleidoscope: Implementing a Parser and AST
1. 前言
Kaleidoscope語(yǔ)言是LLVM官方實(shí)現(xiàn)的一個(gè)用于教學(xué)的玩具語(yǔ)言,很多實(shí)現(xiàn)上都不是遵從軟件工程的良好規(guī)劃,力求實(shí)現(xiàn)的簡(jiǎn)單,以求更好的說(shuō)明 LLVM 的特性和用法。比如說(shuō),Kaleidoscope語(yǔ)言中的數(shù)據(jù)類(lèi)型都是64位浮點(diǎn)型(所以試用Kaleidoscope語(yǔ)言不需要申明數(shù)據(jù)類(lèi)型),僅支持四種算子(加、減、乘、小于)。
2. 詞法標(biāo)簽(Token)
Kaleidoscope中僅支持四種詞法:
enum Token {
tok_eof = -1,
// commands
tok_def = -2,
tok_extern = -3,
// primary
tok_identifier = -4,
tok_number = -5,
};
static std::string IdentifierStr; // 全局變量,F(xiàn)illed in if tok_identifier
static double NumVal; // 全局變量,F(xiàn)illed in if tok_number
定義好 Token,接下來(lái)是 Token 解析函數(shù):
// 如果 Token 無(wú)法識(shí)別,那么返回的是0-255中的一個(gè)值,否則返回 enum Token 中的一個(gè)值
static int gettok() {
static int LastChar = ' ';
// 跳過(guò)空格.
while (isspace(LastChar))
LastChar = getchar();
if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
IdentifierStr = LastChar;
while (isalnum((LastChar = getchar())))
IdentifierStr += LastChar;
if (IdentifierStr == "def")
return tok_def;
if (IdentifierStr == "extern")
return tok_extern;
return tok_identifier;
}
// 對(duì)于1.23.45.67這種情況,直接處理為1.23
if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
std::string NumStr;
do {
NumStr += LastChar;
LastChar = getchar();
} while (isdigit(LastChar) || LastChar == '.');
NumVal = strtod(NumStr.c_str(), nullptr);
return tok_number;
}
if (LastChar == '#') {
// Comment until end of line.
do
LastChar = getchar();
while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
if (LastChar != EOF)
return gettok();
}
// Check for end of file. Don't eat the EOF.
if (LastChar == EOF)
return tok_eof;
// Otherwise, just return the character as its ascii value.
int ThisChar = LastChar;
LastChar = getchar();
return ThisChar;
}
3. 抽象語(yǔ)法樹(shù)(AST)
我們希望把語(yǔ)言中的每一個(gè)結(jié)構(gòu)用一個(gè)對(duì)象(object)來(lái)表示,AST 可以比較契合地去對(duì)語(yǔ)言建模。Kaleidoscope 語(yǔ)言中,有多個(gè)表達(dá)式(expressions)類(lèi)、一個(gè)prototype類(lèi)、一個(gè)function類(lèi)。
3.1. 表達(dá)式 AST
/// ExprAST - Base class for all expression nodes.
class ExprAST {
public:
virtual ~ExprAST() {}
};
/// NumberExprAST - Expression class for numeric literals like "1.0".
class NumberExprAST : public ExprAST {
double Val;
public:
NumberExprAST(double Val) : Val(Val) {}
};
/// VariableExprAST - Expression class for referencing a variable, like "a".
class VariableExprAST : public ExprAST {
std::string Name;
public:
VariableExprAST(const std::string &Name) : Name(Name) {}
};
/// BinaryExprAST - Expression class for a binary operator.
class BinaryExprAST : public ExprAST {
char Op;
std::unique_ptr<ExprAST> LHS, RHS;
public:
BinaryExprAST(char op, std::unique_ptr<ExprAST> LHS,
std::unique_ptr<ExprAST> RHS)
: Op(op), LHS(std::move(LHS)), RHS(std::move(RHS)) {}
};
/// CallExprAST - Expression class for function calls.
class CallExprAST : public ExprAST {
std::string Callee;
std::vector<std::unique_ptr<ExprAST>> Args;
public:
CallExprAST(const std::string &Callee,
std::vector<std::unique_ptr<ExprAST>> Args)
: Callee(Callee), Args(std::move(Args)) {}
};
由于我們都是用雙精度浮點(diǎn)型數(shù)據(jù)類(lèi)型,所以 Expr 中沒(méi)有用于標(biāo)記字段類(lèi)型的字段。
3.2. 函數(shù) AST
下一步,我們需要的是描述函數(shù)原型(prototype)和描述函數(shù)自身(function)。前者存儲(chǔ)了函數(shù)名和參數(shù)列表,后者存儲(chǔ)的是函數(shù)自身的定義(即前面的表達(dá)式)。
/// PrototypeAST - This class represents the "prototype" for a function,
/// which captures its name, and its argument names (thus implicitly the number
/// of arguments the function takes).
class PrototypeAST {
std::string Name;
std::vector<std::string> Args;
public:
PrototypeAST(const std::string &name, std::vector<std::string> Args)
: Name(name), Args(std::move(Args)) {}
const std::string &getName() const { return Name; }
};
/// FunctionAST - This class represents a function definition itself.
class FunctionAST {
std::unique_ptr<PrototypeAST> Proto;
std::unique_ptr<ExprAST> Body;
public:
FunctionAST(std::unique_ptr<PrototypeAST> Proto,
std::unique_ptr<ExprAST> Body)
: Proto(std::move(Proto)), Body(std::move(Body)) {}
};
4. 詞法解析(Parser)
Kaleidoscope 語(yǔ)言中,Parser 采用的是遞歸降序解析(Recursive Descent Parsing)和算子優(yōu)先解析(Operator-Precedence Parsing)的結(jié)合:后者用于解析二元表達(dá)式,前者用于其他的情況。Parser 的輸出就是一個(gè)抽象語(yǔ)法樹(shù)(AST)。
此處我們?cè)黾觾蓚€(gè) helper 函數(shù)
/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
/// token the parser is looking at. getNextToken reads another token from the
/// lexer and updates CurTok with its results.
static int CurTok;
static int getNextToken() {
return CurTok = gettok();
}
4.1. 基礎(chǔ)等式的解析
4.1.1. 數(shù)字字面值解析
/// numberexpr ::= number
static std::unique_ptr<ExprAST> ParseNumberExpr() {
auto Result = std::make_unique<NumberExprAST>(NumVal);
getNextToken(); // consume the number
return std::move(Result);
}
4.1.2. 括號(hào)的解析如下
/// parenexpr ::= '(' expression ')'
static std::unique_ptr<ExprAST> ParseParenExpr() {
getNextToken(); // eat (.
auto V = ParseExpression();
if (!V)
return nullptr;
if (CurTok != ')')
return LogError("expected ')'");
getNextToken(); // eat ).
return V;
}
兩個(gè)注意點(diǎn):
- 如果是語(yǔ)法有錯(cuò)誤,則返回 nullptr,上層函數(shù)判斷出 nullptr 則報(bào)錯(cuò)。
- 調(diào)用了遞歸函數(shù)ParseExpression,遞歸可以讓每一個(gè)產(chǎn)出更簡(jiǎn)單
4.1.3. 變量和函數(shù)解析
/// identifierexpr
/// ::= identifier
/// ::= identifier '(' expression* ')'
static std::unique_ptr<ExprAST> ParseIdentifierExpr() {
std::string IdName = IdentifierStr;
getNextToken(); // eat identifier.
if (CurTok != '(') // Simple variable ref.
return std::make_unique<VariableExprAST>(IdName);
// Call.
getNextToken(); // eat (
std::vector<std::unique_ptr<ExprAST>> Args;
if (CurTok != ')') {
while (1) {
if (auto Arg = ParseExpression())
Args.push_back(std::move(Arg));
else
return nullptr;
if (CurTok == ')')
break;
if (CurTok != ',')
return LogError("Expected ')' or ',' in argument list");
getNextToken();
}
}
// Eat the ')'.
getNextToken();
return std::make_unique<CallExprAST>(IdName, std::move(Args));
}
4.1.4. Wrapper
有了四個(gè)基本表達(dá)式解析邏輯,提供一個(gè) helper 函數(shù)進(jìn)行封裝。
/// primary
/// ::= identifierexpr
/// ::= numberexpr
/// ::= parenexpr
static std::unique_ptr<ExprAST> ParsePrimary() {
switch (CurTok) {
default:
return LogError("unknown token when expecting an expression");
case tok_identifier:
return ParseIdentifierExpr();
case tok_number:
return ParseNumberExpr();
case '(':
return ParseParenExpr();
}
}
4.2. 二元表達(dá)式解析
二元表達(dá)式解析比較復(fù)雜,特別是涉及到運(yùn)算符優(yōu)先級(jí)問(wèn)題。這里采用算子優(yōu)先級(jí)解析法進(jìn)行優(yōu)先級(jí)判斷。
/// BinopPrecedence - This holds the precedence for each binary operator that is
/// defined.
static std::map<char, int> BinopPrecedence;
/// GetTokPrecedence - Get the precedence of the pending binary operator token.
static int GetTokPrecedence() {
if (!isascii(CurTok))
return -1;
// Make sure it's a declared binop.
int TokPrec = BinopPrecedence[CurTok];
if (TokPrec <= 0) return -1;
return TokPrec;
}
int main() {
// Install standard binary operators.
// 1 is lowest precedence.
BinopPrecedence['<'] = 10;
BinopPrecedence['+'] = 20;
BinopPrecedence['-'] = 20;
BinopPrecedence['*'] = 40; // highest.
...
}
如上代碼,定義好了算子的優(yōu)先級(jí)和方法。
現(xiàn)在開(kāi)始看看表達(dá)式“a+b+(c+d)ef+g”的解析方法。算子優(yōu)先級(jí)解析將表達(dá)式當(dāng)作被二元算子分開(kāi)的基本表達(dá)式流,從a 開(kāi)始,可以看到這樣的成對(duì)序列:[+, b] [+, (c+d)] [*, e] [*, f] and [+, g]。此處,括號(hào)也是當(dāng)作基本表達(dá)式的。
/// expression
/// ::= primary binoprhs
///
static std::unique_ptr<ExprAST> ParseExpression() {
auto LHS = ParsePrimary();
if (!LHS)
return nullptr;
return ParseBinOpRHS(0, std::move(LHS));
}
上面代碼中,ParseBinOpRHS就是用于解析成對(duì)序列。
/// binoprhs
/// ::= ('+' primary)*
// ExprPrec:表示可消化算子的最小優(yōu)先級(jí)
static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec,
std::unique_ptr<ExprAST> LHS) {
// If this is a binop, find its precedence.
while (true) {
int TokPrec = GetTokPrecedence();
// If this is a binop that binds at least as tightly as the current binop,
// consume it, otherwise we are done.
if (TokPrec < ExprPrec)
return LHS;
// Okay, we know this is a binop.
int BinOp = CurTok;
getNextToken(); // eat binop
// Parse the primary expression after the binary operator.
auto RHS = ParsePrimary();
if (!RHS)
return nullptr;
// If BinOp binds less tightly with RHS than the operator after RHS, let
// the pending operator take RHS as its LHS.
// 簡(jiǎn)單說(shuō)就是下一個(gè)操作符優(yōu)先級(jí)更高,則先處理下一個(gè),
// 并將處理的結(jié)果作為一個(gè)整體,當(dāng)作當(dāng)前二元表達(dá)式的右操作數(shù)。
// 例如 a+b*c;
// 注意,a+(b+c)*c這種是把 b+c 作為一個(gè)基本表達(dá)式的,所以這里的 RHS 已經(jīng)是 (b+c) 了
int NextPrec = GetTokPrecedence();
if (TokPrec < NextPrec) {
RHS = ParseBinOpRHS(TokPrec + 1, std::move(RHS));
if (!RHS)
return nullptr;
}
// Merge LHS/RHS.
LHS = std::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
std::move(RHS));
}
}
4.3. 函數(shù)的解析
在Kaleidoscope中,函數(shù)的解析包括定義和聲明(extern)。
prototype的解析如下,看起來(lái)就比較簡(jiǎn)單了:
/// prototype
/// ::= id '(' id* ')'
static std::unique_ptr<PrototypeAST> ParsePrototype() {
if (CurTok != tok_identifier)
return LogErrorP("Expected function name in prototype");
std::string FnName = IdentifierStr;
getNextToken();
if (CurTok != '(')
return LogErrorP("Expected '(' in prototype");
// Read the list of argument names.
std::vector<std::string> ArgNames;
while (getNextToken() == tok_identifier)
ArgNames.push_back(IdentifierStr);
if (CurTok != ')')
return LogErrorP("Expected ')' in prototype");
// success.
getNextToken(); // eat ')'.
return std::make_unique<PrototypeAST>(FnName, std::move(ArgNames));
}
函數(shù)定義的解析如下:
/// definition ::= 'def' prototype expression
static std::unique_ptr<FunctionAST> ParseDefinition() {
getNextToken(); // eat def.
auto Proto = ParsePrototype();
if (!Proto) return nullptr;
if (auto E = ParseExpression())
return std::make_unique<FunctionAST>(std::move(Proto), std::move(E));
return nullptr;
}
另外支持下 extern 這種僅聲明或者前向聲明的情況:
/// external ::= 'extern' prototype
static std::unique_ptr<PrototypeAST> ParseExtern() {
getNextToken(); // eat extern.
return ParsePrototype();
}
最后是支持任意的頂層表達(dá)式,采用的是無(wú)參的匿名函數(shù)。
/// toplevelexpr ::= expression
static std::unique_ptr<FunctionAST> ParseTopLevelExpr() {
if (auto E = ParseExpression()) {
// Make an anonymous proto.
auto Proto = std::make_unique<PrototypeAST>("", std::vector<std::string>());
return std::make_unique<FunctionAST>(std::move(Proto), std::move(E));
}
return nullptr;
}
5. 驅(qū)動(dòng)器
其實(shí)就是main 函數(shù)的Loop
/// top ::= definition | external | expression | ';'
static void MainLoop() {
while (1) {
fprintf(stderr, "ready> ");
switch (CurTok) {
case tok_eof:
return;
case ';': // ignore top-level semicolons.
getNextToken();
break;
case tok_def:
HandleDefinition();
break;
case tok_extern:
HandleExtern();
break;
default:
HandleTopLevelExpression();
break;
}
}
}
6. 完整代碼和測(cè)試
完成代碼見(jiàn)代碼清單
編譯運(yùn)行如下:
# Compile
$ clang++ -g -O3 toy.cpp `llvm-config --cxxflags`
# Run
$ ./a.out
ready> def foo(x y) x+foo(y, 4.0);
Parsed a function definition.
ready> def foo(x y) x+y y;
Parsed a function definition.
Parsed a top-level expr
ready> def foo(x y) x+y );
Parsed a function definition.
Error: unknown token when expecting an expression
ready> extern sin(a);
ready> Parsed an extern
ready> ^D