Project3 Query Execution

实现了:

  1. 使用迭代器模型实现各种sql操作的底层执行(增删改查、聚合、连接、排序)
  2. 针对语法树实现部分查询优化规则(nlj->hash-join、谓词下推、列剪裁、公共表达式消除、恒假条件剪枝、支持索引范围查找)

大概花了我十天左右的时间。不过总体体验感觉比b+树好一些,因为这里主要还是代码量多,并且不用考虑并发控制,不像b+树那样复杂。

In this project, you will implement the components that allow BusTub to execute queries. You will create the operator executors that execute SQL queries and implement optimizer rules to transform query plans.

实现SQL查询的执行,并且实现语句优化。

In this project, you will add new operator executors and query optimizations to BusTub.

BusTub uses the iterator (i.e., Volcano) query processing model, in which every executor implements a Next function to get the next tuple result.

When the DBMS invokes an executor’s Next function, the executor returns either:

(1) a single tuple

​ In BusTub’s implementation of the iterator model, 除了元组外还会返回record identifier (RID)

(2) an indicator that there are no more tuples.

With this approach, each executor implements a loop that continues calling Next on its children to retrieve tuples and process them one by one.

Background

Bustub Framewor

image-20231227153858926

AST

介绍完了bustub的框架之后,它对通过语法树进行查询优化进行了详细的样例介绍。

首先温习一下什么是语法树(abstract syntax tree, AST ):

SQL语句

1
2
3
Select `title`
From Books, Borrowers, Loans
Where Books.LC_NO = Loans.LC_NO and Borrowers.CARD_NO = Loans.CARD_NO and DATE <= 1/1/78

其语法树表示+优化结果如下图所示:

image-20231227155236633

算法如下,其关键思路就是选择投影尽早做,能移多下去就移多下去

image-20231227155806019

而这里15445介绍的也是这样的语法树优化算法。

首先记录一下它这几个专有名词对应的操作:

  1. Projection:投影
  2. Filter:选择
  3. MockScan:对一个表进行的扫描操作
  4. Aggregation:聚合函数
  5. NestedLoopJoin:嵌套循环连接

再结合它给的几个语法树的例子:

1
2
3
4
5
6
7
SELECT * FROM __mock_table_1;

=== PLANNER ===
Projection { exprs=[#0.0, #0.1] } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)
MockScan { table=__mock_table_1 } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)
=== OPTIMIZER ===
MockScan { table=__mock_table_1 } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)
1
2
3
4
5
6
7
8
SELECT colA, MAX(colB) FROM
(SELECT * FROM __mock_table_1, __mock_table_3 WHERE colA = colE) GROUP BY colA;

=== OPTIMIZER ===
Agg { types=[max], aggregates=[#0.1], group_by=[#0.0] }
NestedLoopJoin { type=Inner, predicate=(#0.0=#1.0) }
MockScan { table=__mock_table_1 }
MockScan { table=__mock_table_3 }

image-20231227160450894

1
2
3
4
5
SELECT * FROM __mock_table_1 WHERE colA > 1;

=== OPTIMIZER ===
Filter { predicate=(#0.0>1) } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)
MockScan { table=__mock_table_1 } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)
1
2
3
4
5
6
7
values (1, 2, 'a'), (3, 4, 'b');

=== PLANNER ===
Projection { exprs=[#0.0, #0.1, #0.2] } | (__values#0.0:INTEGER, __values#0.1:INTEGER, __values#0.2:VARCHAR)
Values { rows=2 } | (__values#0.0:INTEGER, __values#0.1:INTEGER, __values#0.2:VARCHAR)
=== OPTIMIZER ===
Values { rows=2 } | (__values#0.0:INTEGER, __values#0.1:INTEGER, __values#0.2:VARCHAR)

可以看到,它大概是用缩进来表示了AST的父子关系。

我们课上学习的语法树中每个table标志对应着一个MockScan;笛卡尔积+选择操作可以表示为一个NestedLoopJoin。

对于这些输出的意义,指导书也给了详细的解释:

ColumnValueExpression

也即类似exprs=[#0.0, #0.1]#0意为第一个子节点(不是第一个表的意思。。)

Volcano Model

introduction

火山模型和优化(向量化执行、编译执行) 这篇文章写得很详细,下文也摘抄自该博客

数据库内核通过 code-gen 提升性能的探索

火山模型又称 Volcano Model 或者 Pipeline Model(或者迭代器模型)。该计算模型将关系代数中每一种操作抽象为一个 Operator,将整个 SQL 构建成一个 Operator 树,从根节点到叶子结点自上而下地递归调用 next() 函数。

一般Operator的next() 接口实现分为三步:

  • 调用子节点Operator的next() 接口获取一行数据(tuple);
  • 对tuple进行Operator特定的处理(如filter 或project 等);
  • 返回处理后的tuple。

因此,查询执行时会由查询树自顶向下的调用next() 接口,数据则自底向上的被拉取处理。火山模型的这种处理方式也称为拉取执行模型(Pull Based)。

大多数关系型数据库都是使用迭代模型的,如 SQLite、MongoDB、Impala、DB2、SQLServer、Greenplum、PostgreSQL、Oracle、MySQL 等。

火山模型的优点是,处理逻辑清晰,简单,每个Operator 只要关心自己的处理逻辑即可,耦合性低。但是缺点也非常明显:

  • 每处理一行需要调用多次next() 函数,而next()为虚函数,开销大。

    编译器无法对虚函数进行inline优化,同时也带来分支预测的开销,且很容易预测失败,导致CPU流水线执行混乱。

  • 数据以行为单位进行处理,不利于CPU cache 发挥作用。

pipeline breaker

火山模型显而易见是以从上到下一个流水线形式执行的,它的最理想情况是每个流水线节点所需的这个tuple都存储在寄存器中。然而,有一些操作,如聚合函数等等,需要对整个表进行操作才能获取到当前所需tuple,而整个表显然最多只能读入到内存中,这样的操作就被称为pipeline breaker

下面的实现中的aggregation、sort、hash join的build阶段都是pipeline breaker,这些复杂的操作阶段都需要在init()函数中进行。

ADDITIONAL INFORMATION

System Catalog

The entirety of the catalog implementation is in src/include/catalog/catalog.h. You should pay particular attention to the member functions Catalog::GetTable() and Catalog::GetIndex(). You will use these functions in the implementation of your executors to query the catalog for tables and indexes.

它意思大概是说在实现executor时可能需要用到catelog里这两个函数。

GetTable返回一个TableInfo

1
2
3
4
5
6
7
8
9
10
struct TableInfo {
/** The table schema */
Schema schema_;
/** The table name */
const std::string name_;
/** An owning pointer to the table heap */
std::unique_ptr<TableHeap> table_;
/** The table OID */
const table_oid_t oid_;
};

For the table modification executors (InsertExecutor, UpdateExecutor, and DeleteExecutor) you must modify all indexes for the table targeted by the operation. You may find the Catalog::GetTableIndexes() function useful for querying all of the indexes defined for a particular table. Once you have the IndexInfo instance for each of the table’s indexes, you can invoke index modification operations on the underlying index structure.

In this project, we use your implementation of B+ Tree Index from Project 2 as the underlying data structure for all index operations. Therefore, successful completion of this project relies on a working implementation of the B+ Tree Index.

话说index是那个索引吗,就是每张表有几个建立在某个属性的索引,也即一张表可以有n棵b+树

GetIndex返回一个IndexInfo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
struct IndexInfo {
/** The schema for the index key */
Schema key_schema_;
/** The name of the index */
std::string name_;
/** An owning pointer to the index */
std::unique_ptr<Index> index_;
/** The unique OID for the index */
index_oid_t index_oid_;
/** The name of the table on which the index is created */
std::string table_name_;
/** The size of the index key, in bytes */
const size_t key_size_;
};

Optimizer Rule Implementation Guide

The BusTub optimizer is a rule-based optimizer. Most optimizer rules construct optimized plans in a bottom-up way(自底向上). Because the query plan has this tree structure, before applying the optimizer rules to the current plan node, you want to first recursively apply the rules to its children.

At each plan node, you should determine if the source plan structure matches the one you are trying to optimize, and then check the attributes in that plan to see if it can be optimized into the target optimized plan structure.

In the public BusTub repository, we already provide the implementation of several optimizer rules. Please take a look at them as reference.

Task1 Access Method Executors

In the background section above, we saw that the BusTub can already retrieve data from mock tables in SELECT queries.

This is implemented without real tables by using a MockScan executor to always generate the same tuples using a predefined algorithm.

This is why you cannot update these tables.

In this task, you will implement executors that read from and write to tables in the storage system.

  • src/execution/seq_scan_executor.cpp
  • src/execution/insert_executor.cpp
  • src/execution/update_executor.cpp
  • src/execution/delete_executor.cpp
  • src/execution/index_scan_executor.cpp

而我们本次实验就是需要实现这么一大堆的executor。看来又是个体力活了。

seq_scan

一些想法

c++知识

image-20240115121419677

可以看到,前缀++重载的运算符方法和后缀++是不一样的。

这里我理解得还是肤浅了…… 根据 这篇文章++i 的内部类定义为 T& T:: operator++();,而 i++ 的内部类定义为 T T:: operator++(int);[1]前置操作返回引用,后置操作返回值。后置操作的 int 参数是一个虚拟参数,用于区分运算符 ++ 的前置和后置。理论上,i++ 会产生临时对象,实践中,编译器会对内置类型进行优化;而对于自定义类型(如这里的 Iterator),++i 的性能通常优于 i++

MockScan

值得一提的是它跟MockScan的关系。MockScan是一种模拟操作,所以各种表都是硬编码在它的mock_scan.h里的;而SeqScan就是真正的遍历操作了,它需要获取tuple就需要通过各种复杂的物理操作和封装一步步读取了。

mockscan executor不是真的查表,而是返回固定的元组。

看了一遍代码,感觉大概明白了。我们可以来看一下迭代器的Next函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
auto MockScanExecutor::Next(Tuple *tuple, RID *rid) -> bool {
if (cursor_ == size_) {
// Scan complete
return EXECUTOR_EXHAUSTED;
}
if (shuffled_idx_.empty()) {
*tuple = func_(cursor_);
} else {
*tuple = func_(shuffled_idx_[cursor_]);
}
++cursor_;
*rid = MakeDummyRID();
return EXECUTOR_ACTIVE;
}

其核心就是调用func_来获取表的元组。

也就是说是这样的,每个MockScanExecutor用来执行一个plan,那么也就对应着某一个table。通过执行某一个table特定的迭代function,就可以返回元组。

这个迭代function比如说对于表tas_2023是这样的:

1
2
3
4
5
6
7
8
if (table == "__mock_table_tas_2023") {
return [plan](size_t cursor) {
std::vector<Value> values{};
values.push_back(ValueFactory::GetVarcharValue(ta_list_2023[cursor]));
values.push_back(ValueFactory::GetVarcharValue(ta_oh_2023[cursor]));
return Tuple{values, &plan->OutputSchema()};
};
}

也即MockScanExecutor负责对表指针的管理,function负责实际对表的物理访问。这样就成功解耦了。

physical layer

通过实现SeqScan,我们可以初步窥探整个bustub物理层面交互的架构。

跟之前project中的索引entry一样,实际的数据tuple也保存在page中,其对应类为TablePage。并且是堆文件组织结构:

image-20240115114448798

TablePage的结构值得一提。

在它的成员定义中,我们可以看到其中有两个柔性数组成员(Flexible array member):

1
2
char page_start_[0];
TupleInfo tuple_info_[0];

之前的Project2,我们只接触过一个的case,这里的两个感觉其实也同理可得,相当于page_start_tuple_info_都指向最末尾空闲空间的开始。

TablePage的实际存储结构如下:

1
2
3
4
->  increase											   increase  <-
| ------------------------------- | ********************************* |
↑ ↑
page_start & tuple_info +TUPLE_INFO_SIZE*sizeof(TupleInfo)

也即tuple info存储在前半部分,tuple data存储在后半部分,并且二者增长方式相反。

而多页TablePage就构成了一个TableHeap,也即其物理存储空间。每次创建表时,我们就会分配对应的heap空间和相关meta data。TableHeap对外提供了增删改查元组的方法,也提供了一个迭代器实现TableIterator,用于遍历里面的元素。

而由于元组tuple存储在磁盘中,所以我们需要在读取它的值的时候先进行反序列化DeserializeFrom,这个过程需要用到表的类型信息和offset信息之类的,所以Tuple::GetValue需要传入schema参数。

实现

它基本原理也就是顺序遍历整张表,没什么好说的。

在本次的sequence scan实现中,我们就需要首先获取表对应的iterator:

1
2
// 巨长一串
table_iterator_ = std::make_unique<TableIterator>(exec_ctx_->GetCatalog()->GetTable(plan_->GetTableOid())->table_->MakeIterator());

然后通过这个iterator不断迭代获取元素即可。

有一点要注意的,应该是对删除元组的处理,毕竟sequence scan算是是实现其他二级操作的基石了,所以我们必须在这里处理删除元组。具体逻辑如下:

1
2
3
4
5
6
7
8
9
do {
if (table_iterator_->IsEnd()) {
return EXECUTOR_EXHAUSTED;
}

get tuple;
++ iterator;

} while(tuple_meta.is_deleted_);

insert

一些想法

recursive execute

对于SQL的嵌套子查询,bustub采用的是递归实现。具体来说,以insertion为例:

外界调用情况如下所示。

1
2
3
4
5
6
7
8
// Execute a query plan.
auto Execute(...) -> bool {
// Construct the executor for the abstract plan node
auto executor = ExecutorFactory::CreateExecutor(exec_ctx, plan);
executor->Init();
PollExecutor(executor.get(), plan, result_set);
PerformChecks(exec_ctx);
}

CreateExecutor是一个递归函数,递归创建每个子查询的实例,把对应的executor返回给父查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
auto ExecutorFactory::CreateExecutor(...)
-> std::unique_ptr<AbstractExecutor> {

switch (plan->GetType()) {

case PlanType::SeqScan: {
return std::make_unique<SeqScanExecutor>(exec_ctx, dynamic_cast<const SeqScanPlanNode *>(plan.get()));
}

case PlanType::Insert: {
auto insert_plan = dynamic_cast<const InsertPlanNode *>(plan.get());
// 递归创建每个子查询的实例
auto child_executor = ExecutorFactory::CreateExecutor(exec_ctx, insert_plan->GetChildPlan());
// 把对应的executor返回给父查询
return std::make_unique<InsertExecutor>(exec_ctx, insert_plan, std::move(child_executor));
}
}
}

然后我们再在父查询的Init中调用子查询的Init和Next等方法

1
2
3
4
void InsertExecutor::Init() {
child_executor_->Init();
...
}

如此,就能递归实现嵌套子查询。

实现

The InsertExecutor inserts tuples into a table and updates any affected indexes.

The planner will ensure that the values have the same schema as the table. The executor will produce a single tuple of integer type as the output, indicating how many rows have been inserted into the table.

这里将Insert语句插入的值视为一个匿名子表,对其初始化后使用它的迭代器进行元素访问即可。

update

一些想法

expression

bustub将一切表达式抽象为了这么几个类:

1
2
3
4
5
6
7
AbstractExpression // 基类
ConstantValueExpression // 常量值表达式
ColumnValueExpression // 列值表达式,访问某一列的值
ArithmeticExpression // 算术表达式,树递归结构,子节点是值or算术表达式
ComparisonExpression // 比较表达式,表示两个表达式
LogicExpression // 逻辑表达式
StringExpression // 字符串表达式,包括原字符串or upper之类的

而从UpdatePlanNode中,我们可以获取到update字句的所有表达式:

1
2
/** The new expression at each column */
std::vector<AbstractExpressionRef> target_expressions_;

比如此处:

1
2
3
4
5
bustub> explain (o,s) update test_1 set colB = 15445;
=== OPTIMIZER ===
// 可以注意这边target_exprs的值
Update { table_oid=20, target_exprs=[#0.0, 15445, #0.2, #0.3] } | (__bustub_internal.update_rows:INTEGER)
SeqScan { table=test_1 } | (test_1.colA:INTEGER, test_1.colB:INTEGER, test_1.colC:INTEGER, test_1.colD:INTEGER)

然后我们分别计算每个expression的值,就可以获取更新之后的元组:

1
2
3
4
5
6
7
8
   // insert again
std::vector<Value> insert_values;
for (auto exp : plan_->target_expressions_) {
// tuple为旧值元组
insert_values.push_back(exp->Evaluate(&tuple, table_info->schema_));
}
// 注意table_info应为要插入的表的info,此处易写为update plan子表的info
table_heap->InsertTuple(TupleMeta(), Tuple(insert_values, &(table_info->schema_)));
lazy delete

删除元组的实现似乎只是简单地标记is_delete_为true就好了。但是我在实际的代码实现(InsertTuple)中似乎并没有看到重组删除空间or覆盖删除空间,每次插入页满只是简单地再申请新的一页,不会再回头。也许是为了简化起见暂不实现这个吧。

不过改进方法也很简单,对每个表进行固定分配页(或者说提供一个数据量达到百分之几的时候扩容的机制),然后页面间组织成环形链表,这样就能充分覆盖删除空间,同时也兼顾一定性能了。

实现

update的实现也不会很难,只需先删除原来的元组,再加个新元组即可。

delete

delete的实现完全照搬update就行,没什么好说的。

index_scan

The IndexScanExecutor iterates over an index to retrieve RIDs for tuples. The operator then uses these RIDs to retrieve their tuples in the corresponding table. It then emits these tuples one at a time.

You can test your index scan executor by SELECT FROM <table> ORDER BY <index column>. We will explain why ORDER BY can be transformed into IndexScan in Task #3.

如果order-by的对象没有index,那么bustub会先给生成一个sortplan针对关键字排序,具体见task #3。

BusTub only supports indexes with a single, unique integer column. Our test cases will not contain duplicate keys. The type of the index object in the plan will always be BPlusTreeIndexForTwoIntegerColumn in this project. You can safely cast it and store it in the executor object:

1
2
3
using BPlusTreeIndexForTwoIntegerColumn = BPlusTreeIndex<IntegerKeyType, IntegerValueType, IntegerComparatorType>;

tree_ = dynamic_cast<BPlusTreeIndexForTwoIntegerColumn *>(index_info_->index_.get())

但我看测试里怎么好像有两个键的index?

You can then construct an index iterator from the index object, scan through all the keys and tuple IDs, lookup the tuple from the table heap, and emit all tuples in order.

是的,project2的b+树实现确实只存了rid,然后我们通过rid就能知道实际的物理位置了

一些想法

索引实现

通过b+树组织索引结构,索引结点中存的是RID,RID可以用来指示tuple的物理位置,于是我们通过RID就可以获取到tuple,从而减少了磁盘IO。RID结构如下:

1
2
3
4
5
6
7
8
9
10
11
12
// RID: Record Identifier
// 高32位是pgid,低32位是slot num
class RID {
public:
explicit RID(int64_t rid) : page_id_(static_cast<page_id_t>(rid >> 32)), slot_num_(static_cast<uint32_t>(rid)) {}

inline auto Get() const -> int64_t { return (static_cast<int64_t>(page_id_)) << 32 | slot_num_; }

private:
page_id_t page_id_{INVALID_PAGE_ID};
uint32_t slot_num_{0}; // logical offset from 0, 1...
};

并且,bustub保证了对于有索引的表,是不会有重复元组的,故而b+树实际上应该是一个稠密索引。

(毕竟这个情况似乎有点复杂……物理存储上应该是按插入顺序顺序存储的,故而重复元组可能不放在一起,而我们实现的b+树又不支持重复结点,所以就会g。如果想要支持重复元组,可能就需要从两个改变思路入手,要么是修改b+树支持重复索引结点,此时b+树依然为稠密索引;要么是修改为链式存储结构以支持重复元组放在一起,此时b+树为稀疏索引。)

c++知识

非常非常崩溃,怎么保存索引尝试了很久都没做到:

1
2
3
4
5
6
7
// 这样不行……
std::unique_ptr<BPlusTreeIndexIteratorForTwoIntegerColumn> iterator_;
iterator_ = std::make_unique<BPlusTreeIndexIteratorForTwoIntegerColumn>(std::move(tree_->GetBeginIterator()));

// 这样也不行……
BPlusTreeIndexIteratorForTwoIntegerColumn iterator_;
iterator_ = std::move(tree_->GetBeginIterator());

没办法,最终只能保存tree,iterator在next里动态获取了,我真是服了。等之后看完c++primer或者c++水平有所提升了再来解决这个问题吧。

最终解决了这个问题,需要这样:

1
2
std::unique_ptr<BPlusTreeIndexIteratorForTwoIntegerColumn> current_iterator_;
current_iterator_ = std::make_unique<BPlusTreeIndexIteratorForTwoIntegerColumn>(tree->GetBeginIterator());

然后在Iterator实现中加一个move constructor即可:

1
IndexIterator(INDEXITERATOR_TYPE &&it) noexcept;

实现

难绷,本来以为这个index-scan应该是最简单的,毕竟只用调用现成索引接口,没想到居然写了最久,可能足足两三个小时。。。

首先的一个大难点就是如何保存迭代器了。在之前的seq-scan的时候,使用的是unique ptr,然而这里却不行会报一堆奇奇怪怪的错误(具体见一些想法-c++知识)。最后只能换一个思路,不保存迭代器而是保存next_key_了。然而又由于之前b+树的实现bug问题,导致对end iterator解引用是合法的,所以会产生各种奇奇怪怪的错误。解决了这个之后,之前写的insert、update、delete的更新索引部分又出了问题,rid和insert_key弄错了,诸如此类。

总之,解决了这一大堆小问题之后,才总算通过了index-scan的测试,真是令人南蚌。具体改了什么bug可以详情见b8d3ba546cfdea6fc576ad8d668322c87f6386c1这个commit。

同时,也跟上面的sequence scan一样,都需要对标识为deleted的元组进行跳过处理。

这里我也是没想太多……事实上,index scan无需实时检测is_deleted字段并做处理,因为索引是会随着修改实时更新的,被删除的tuple不会在索引中。

Task2 Aggregation & Join Executors

In this task you will add an aggregation executor, several join executors, and enable the optimizer to select between a nested loop join and hash join when planning a query.

You will complete your implementation in the following files:

  • src/execution/aggregation_executor.cpp
  • src/execution/nested_loop_join_executor.cpp
  • src/execution/hash_join_executor.cpp
  • src/optimizer/nlj_as_hash_join.cpp

aggregation

The AggregationPlanNode is used to support queries like the following:

1
2
3
4
EXPLAIN SELECT colA, MIN(colB) FROM __mock_table_1 GROUP BY colA;
EXPLAIN SELECT COUNT(colA), mi(colB) FROM __mock_table_1;
EXPLAIN SELECT colA, MIN(colB) FROM __mock_table_1 GROUP BY colA HAVING MAX(colB) > 10;
EXPLAIN SELECT DISTINCT colA, colB FROM __mock_table_1;

也即聚合函数和DISTINCT、GROUP这种。

此处注意DINSTINCT也是通过aggregation实现的:

1
2
3
4
EXPLAIN SELECT DISTINCT colA, colB FROM __mock_table_1;
=== OPTIMIZER ===
Agg { types=[], aggregates=[], group_by=[#0.0, #0.1] } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)
MockScan { table=__mock_table_1 } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)

The aggregation executor computes an aggregation function for each group of input. 作用于每一组

It has exactly one child. The output schema consists of the group-by columns followed by the aggregation columns.

1
2
3
4
5
6
7
8
9
10
11
EXPLAIN SELECT colA, MIN(colB) FROM __mock_table_1 GROUP BY colA;
=== OPTIMIZER ===
// types标志聚合的种类,aggregates标识聚合的目标,group_by单独用于表示是否有group
Agg { types=[min], aggregates=[#0.1], group_by=[#0.0] } | (__mock_table_1.colA:INTEGER, <unnamed>:INTEGER)
MockScan { table=__mock_table_1 } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)


EXPLAIN SELECT COUNT(colA), min(colB) FROM __mock_table_1
=== OPTIMIZER === // 如果没有group,则其字段为空
Agg { types=[count, min], aggregates=[#0.0, #0.1], group_by=[] } | (<unnamed>:INTEGER, <unnamed>:INTEGER)
MockScan { table=__mock_table_1 } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)

As discussed in class, a common strategy for implementing aggregation is to use a hash table, with the group-by columns as the key.

In this project, you may assume that the aggregation hash table fits in memory. This means that you do not need to implement a multi-stage, partition-based strategy, and the hash table does not need to be backed by buffer pool pages.

也就是说这里采取的是基于hashtable的实现而非基于归并排序的,并且为了简单起见将hash table保存在内存中,所以无需进行多趟划分扫描。

We provide a SimpleAggregationHashTable data structure that exposes an in-memory hash table (std::unordered_map) but with an interface designed for computing aggregations. This class also exposes an SimpleAggregationHashTable::Iterator type that can be used to iterate through the hash table. You will need to complete the CombineAggregateValues function for this class.

Note: The aggregation executor itself won’t need to handle the HAVING predicate. The planner will plan aggregations with a HAVING clause as an AggregationPlanNode followed by a FilterPlanNode.

Hint: In the context of a query plan, aggregations are pipeline breakers. This may influence the way that you use the AggregationExecutor::Init() and AggregationExecutor::Next() functions in your implementation. Carefully decide whether the build phase of the aggregation should be performed in AggregationExecutor::Init() or AggregationExecutor::Next().

一些想法

countstar

值得注意的是,这里的实现将COUNT(*)COUNT(colum)区分开了:

1
enum class AggregationType { CountStarAggregate, CountAggregate };

因为这两者似乎语义上是有区别的,大概体现为以下几点:

  1. 当没有结果时,CountStar返回0,Count返回integer_null
  2. CountStar只记录行数,不管值是否为空;Count只记录所要求的列非空的那些行数
hash aggregation

关于hashtable实现聚合的相关原理及相关示例,具体可见 这篇文章。感觉这系列文章都写得挺好的,如对TiDB有兴趣可以细看。

在 SQL 中,聚合操作对一组值执行计算,并返回单个值。TiDB 实现了 2 种聚合算法:Hash Aggregation 和 Stream Aggregation

在 Hash Aggregate 的计算过程中,我们需要维护一个 Hash 表,Hash 表的键为聚合计算的 Group-By值为聚合函数的中间结果 sumcount

计算过程中,只需要根据每行输入数据计算出键,在 Hash 表中找到对应值进行更新即可。输入数据输入完后,扫描 Hash 表并计算,便可以得到最终结果

故而思路也是很清晰了。我们在aggregation的实现中要做的,就是把child executor逐行喂给hashtable,最后再遍历hashtable得到结果即可。故而,我们重点需要实现hashtable的InsertCombine函数和hashtable的iterator。

实现

理解了hash-aggregation的算法原理后,代码逻辑方面就不算难了,其余最主要的难点应该是空值的处理。

总结一下,bustub对空值的处理大概有以下几个要点:

  1. 聚合函数对空值处理

    COUNT(*):计入空值

    COUNT/MAX/MIN/SUM(v1):跳过空值

  2. 空值自身运算性质

    任意运算若有一个操作数为空,那么结果也为空。

    故而,当没有使用group by关键字的时候(也即hashtable的key为空),此时不能天真地传入一个空的AggregationKey,而应该给它随便塞某个值。不然的话,hashtable内部的比较函数在处理空值的时候恒返回false,会导致检索失败。

  3. 空表情况处理

    当表为空的时候,要求:

    1
    2
    3
    4
    select COUNT(*), MAX(v1), COUNT(v1) from table_;
    # 0 integer_null integer_null
    select COUNT(*), MAX(v1), COUNT(v1) from table_ group by v2;
    # no-output

    这个操作我着实不懂为什么。。。所以我最终代码只能面向测试用例:

    1
    2
    3
    4
    5
    6
    if (!has_next && plan_->GetGroupBys().empty()) {
    // 当表为空并且不使用聚合函数时,输出一个默认情况对
    AggregateKey agg_key;
    agg_key.group_bys_.push_back(Value(TypeId::INTEGER, 1));
    aht_->InsertCombine(agg_key, MakeAggregateValue(nullptr));
    }

NestedLoopJoin

The DBMS will use NestedLoopJoinPlanNode for all join operations, by default.

You will need to implement an inner join and left join for the NestedLoopJoinExecutor using the simple nested loop join algorithm from class.

The output schema of this operator is all columns from the left table followed by all columns from the right table.

For each tuple in the outer table, consider each tuple in the inner table and emit an output tuple if the join predicate is satisfied.

也即嵌套循环实现的join,与在课上学的sort merge join一样,都是古法join实现。

nested join的实现相比之前的思路确实会复杂一些。我们需要学习如何迭代地调用Next来实现一次嵌套循环。思路大概是这样:

1
2
3
4
5
6
7
8
9
10
11
12
Init():
Init left, right
Move left to get current_left_tuple_
Next():
while (1):
if (Move(right)) :;
else:
Move left
Init right
continue;
if (checkPredict):
break;

然而其中有这几个细节需要进行处理:

  1. 左连接的实现

    需要增加逻辑:当right遍历完之后,current_left_tuple_仍未被组装进结果过,此时需要帮其拼接上空right tuple。

  2. 空表情况

    这个分支中:

    1
    2
    3
    4
    else:
    Move left
    Init right
    continue;

    不能这样:

    1
    2
    3
    4
    else:
    Move left
    Init right
    Move right

    这是为了防止空表情况,使得Move right一直返回false,导致之后checkPredict报空指针异常。

  3. 测试要求left->Next()调用次数与right->Init()调用次数相同。

    这是为了强制让NestedLoopJoin的实现不是Pipeline Break,从而导致它性能垃圾了

HashJoin

The DBMS can use HashJoinPlanNode if a query contains a join with a conjunction of equi-conditions between two columns (equi-conditions are seperated by AND).

也就是说,当连接条件为一/多个列相等时,就可以用hash join。可以看到这是类似等值连接。

1
2
3
4
5
6
EXPLAIN SELECT * FROM __mock_table_1, __mock_table_3 WHERE colA = colE;
EXPLAIN SELECT * FROM __mock_table_1 INNER JOIN __mock_table_3 ON colA = colE;
EXPLAIN SELECT * FROM __mock_table_1 LEFT OUTER JOIN __mock_table_3 ON colA = colE;
EXPLAIN SELECT * FROM test_1 t1, test_2 t2 WHERE t1.colA = t2.colA AND t1.colB = t2.colC;
EXPLAIN SELECT * FROM test_1 t1 INNER JOIN test_2 t2 on t1.colA = t2.colA AND t2.colC = t1.colB;
EXPLAIN SELECT * FROM test_1 t1 LEFT OUTER JOIN test_2 t2 on t2.colA = t1.colA AND t2.colC = t1.colB;

You will need to implement the inner join and left join for HashJoinExecutor using the hash join algorithm from class.

The output schema of this operator is all columns from the left table followed by all columns from the right table.

As with aggregation, you may assume that the hash table used by the join fits entirely in memory.

Hint: Your implementation should correctly handle the case where multiple tuples have hash collisions (on either side of the join). 必须正确处理哈希冲突的情况

Hint: You will want to make use of the join key accessors functions GetLeftJoinKey() and GetRightJoinKey() in the HashJoinPlanNode to construct the join keys for the left and right sides of the join, respectively.

Hint: You will need a way to hash a tuple with multiple attributes in order to construct a unique key. As a starting point, take a look at how the SimpleAggregationHashTable in the AggregationExecutor implements this functionality. 可以参考 SimpleAggregationHashTable的实现

Hint: As with aggregation, the build side of a hash join is a pipeline breaker. You should again consider whether the build phase of the hash join should be performed in HashJoinExecutor::Init() or HashJoinExecutor::Next().

具体什么是hash join,可以参考 这篇文章

img

其大概思路也很简单,hash table就是一个map<key, vector<value>>这样的数据结构,然后将两个输入的关系选举出一个小表作为Build(建立hash table),另一个作为Probe(扫描,并根据hash table->second进行迭代组合)。它其实就是一个精确了范围的nested loop join的变种,将nested里层的针对整个关系的大循环缩小为针对hash table一个bucket的小循环。

具体到这里,思路可以是这样的。首先为了简单起见,我们就不进行选举小表的判断了,固定将right child作为Build,left child作为Probe。建表的话,我们就简单粗暴地遍历right table,然后以right_key_expressions_为keyTuple为value直接建表(反正也是in-memory即可。。。)。然后之后,就仿照之前思路即可。

Optimizing NestedLoopJoin to HashJoin

Hash joins usually yield better performance than nested loop joins. You should modify the optimizer to transform a NestedLoopJoinPlanNode into a HashJoinPlanNode when it is possible to use a hash join.

Specifically, the hash join algorithm can be used when a join predicate is a conjunction of equi-conditions between two columns. For the purpose of this project, handling a single equi-condition, and also two equi-conditions connected by AND, will earn full credit.

Consider the following example:

1
bustub> EXPLAIN (o) SELECT * FROM test_1 t1, test_2 t2 WHERE t1.colA = t2.colA AND t1.colB = t2.colC;

Without applying the NLJAsHashJoin optimizer rule, the plan may look like:

1
2
3
NestedLoopJoin { type=Inner, predicate=((#0.0=#1.0)and(#0.1=#1.2)) } 
SeqScan { table=test_1 }
SeqScan { table=test_2 }

After applying the NLJAsHashJoin optimizer rule, the left and right join key expressions will be extracted from the single join predicate in the NestedLoopJoinPlanNode. The resulting plan will look like:

1
2
3
HashJoin { type=Inner, left_key=[#0.0, #0.1], right_key=[#0.0, #0.2] } 
SeqScan { table=test_1 }
SeqScan { table=test_2 }

Note: Please check the Optimizer Rule Implementation Guide section for details on implementing an optimizer rule.

Hint: Make sure to check which table the column belongs to for each side of the equi-condition. It is possible that the column from outer table is on the right side of the equi-condition. You may find ColumnValueExpression::GetTupleIdx helpful.

Hint: The order to apply optimizer rules matters. For example, you want to optimize NestedLoopJoin into HashJoin after filters and NestedLoopJoin have merged. 这个感觉可能意思就是说优化规则的优先级之类的吧,这里用了个例子说hash join的优先级一般得在filter和nested loop join合并了之后。

Hint: At this point, you should pass SQLLogicTests - #14 to #15.

一些想法

bustub optimizer

The BusTub optimizer is a rule-based optimizer. Most optimizer rules construct optimized plans in a bottom-up way(自底向上). Because the query plan has this tree structure, before applying the optimizer rules to the current plan node, you want to first recursively apply the rules to its children. 基于规则的优化器,规则都是对语法树自底向上实施。感觉跟课内学的差不多。

In the public BusTub repository, we already provide the implementation of several optimizer rules. Please take a look at them as reference.

在课程中学到的语法优化,应该也是基于规则的优化,具体见下图及之后列出的无穷无尽个定理:

image

image-20231227155806019

(本图新增了一条规则:选择+嵌套笛卡尔积=嵌套连接)

查看目录src/optimizer/,我们可以看到:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ tree ../src/optimizer/
../src/optimizer/
├── eliminate_true_filter.cpp # 消除恒真选择
├── merge_filter_nlj.cpp # 合并选择和嵌套连接
├── merge_filter_scan.cpp # 合并选择和scan
├── merge_projection.cpp # 合并多个投影
├── nlj_as_hash_join.cpp # 嵌套连接->hash连接
├── nlj_as_index_join.cpp # 嵌套连接->index连接
├── optimizer.cpp
├── optimizer_custom_rules.cpp
├── optimizer_internal.cpp
├── order_by_index_scan.cpp
└── sort_limit_as_topn.cpp # 针对 top-N queries 进行优化

在本小节任务中,我们需要做的,就是参照其他的规则来实现nlj_as_hash_join。但在此之前,我们不妨先研究一下它语法优化的总体架构。

1
2
3
4
5
6
7
8
9
auto Optimizer::OptimizeCustom(const AbstractPlanNodeRef &plan) -> AbstractPlanNodeRef {
auto p = plan;
p = OptimizeMergeProjection(p); // 首先合并影响相同的投影
p = OptimizeMergeFilterNLJ(p); // 然后合并选择和嵌套连接
p = OptimizeNLJAsHashJoin(p); // 然后把嵌套连接改为hash join
p = OptimizeOrderByAsIndexScan(p); // 根据索引进行查找
p = OptimizeSortLimitAsTopN(p); // 针对 top-N queries 进行优化
return p;
}

可以看到,它的实际原理很简单,就是按照这样的优先级顺序对语法树运用规则进行优化。

merge filter nlj

OptimizeMergeFilterNLJ为例,我们可以研究一下它的整体架构:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
auto Optimizer::OptimizeMergeFilterNLJ(const AbstractPlanNodeRef &plan) -> AbstractPlanNodeRef {
// 首先自底向上地对其所有子节点进行优化,采用DFS
std::vector<AbstractPlanNodeRef> children;
for (const auto &child : plan->GetChildren()) {
children.emplace_back(OptimizeMergeFilterNLJ(child));
}

auto optimized_plan = plan->CloneWithChildren(std::move(children));
// 仅当当前结点为filter,并且其唯一子节点为nlj时,才进行重写优化
if (optimized_plan->GetType() == PlanType::Filter) {
const auto &filter_plan = dynamic_cast<const FilterPlanNode &>(*optimized_plan);
const auto &child_plan = optimized_plan->children_[0]; // Has exactly one child
if (child_plan->GetType() == PlanType::NestedLoopJoin) {
const auto &nlj_plan = dynamic_cast<const NestedLoopJoinPlanNode &>(*child_plan);
// 这里可能简单起见,仅当nlj为纯纯的笛卡尔积时,才会进行合并
// 所以看起来就无法处理多个连续的选择的情况,或许在planner阶段规避了这种情况?
if (IsPredicateTrue(nlj_plan.Predicate())) {
// 将该filter+nlj结点重写为一个新的连接结点
return std::make_shared<NestedLoopJoinPlanNode>(
filter_plan.output_schema_, nlj_plan.GetLeftPlan(), nlj_plan.GetRightPlan(),
RewriteExpressionForJoin(filter_plan.GetPredicate(),
nlj_plan.GetLeftPlan()->OutputSchema().GetColumnCount(), nlj_plan.GetRightPlan()->OutputSchema().GetColumnCount()), nlj_plan.GetJoinType());
}
}
}
return optimized_plan;
}

可见,对语法树运用该merge filter nlj规则是采用自底向上的顺序,并且仅合并那些filter-笛卡尔积的结点。那么接下来,我们可以具体关注RewriteExpressionForJoin的实现。

首先,我们需要明确bustub中对expression的抽象。以#0.0=#1.0为例,expression的结构树如下所示:

image-20240120104722170

每个叶子结点都是一个基本的expression类型,如column value、constant value等等等,整个子树构成一个其他expression类型,如comparation expr、arithmetic expr等等等。

在未优化前,我们是先做笛卡尔积,再做选择。故而,假设t1有2列,t2有2列,选择条件为t1.col1 = t2.col4,在未优化前,filter结点的expr将为:#0.0=#0.3(两表经过笛卡尔积合在一起了)。故而,在RewriteExpressionForJoin函数中,我们需要根据t1表和t2表分别的列数,将#0.0=#0.3这样的表达式转化为#0.0=#1.1这样的表达式(其实也就是只用处理所有类型为colum expr的叶结点即可)。而由于expression是递归结构,所以我们需要先针对其所有子节点进行处理。故而,RewriteExpressionForJoin的实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
auto Optimizer::RewriteExpressionForJoin(const AbstractExpressionRef &expr, size_t left_column_cnt, size_t right_column_cnt) -> AbstractExpressionRef {
// 首先自底向上地对其所有子节点进行优化,采用DFS
std::vector<AbstractExpressionRef> children;
for (const auto &child : expr->GetChildren()) {
children.emplace_back(RewriteExpressionForJoin(child, left_column_cnt, right_column_cnt));
}
// 仅对那些类型为column expr的叶子结点进行处理
if (const auto *column_value_expr = dynamic_cast<const ColumnValueExpression *>(expr.get()); column_value_expr != nullptr) {
// #0.1, "0"为tuple_idx,"1"为col_idx
// 此时tuple_idx一定是0,因为filter结点只有一个子节点
BUSTUB_ENSURE(column_value_expr->GetTupleIdx() == 0, "tuple_idx cannot be value other than 0 before this stage.")
auto col_idx = column_value_expr->GetColIdx();
if (col_idx < left_column_cnt) {
return std::make_shared<ColumnValueExpression>(0, col_idx, column_value_expr->GetReturnType()); // 替换为#0.X
}
if (col_idx >= left_column_cnt && col_idx < left_column_cnt + right_column_cnt) {
return std::make_shared<ColumnValueExpression>(1, col_idx - left_column_cnt, column_value_expr->GetReturnType()); // 替换为#1.X
}
throw bustub::Exception("col_idx not in range");
}

// xiunian: do nothing if the filter contains no column value expression
return expr->CloneWithChildren(children);
}

实现

Specifically, the hash join algorithm can be used when a join predicate is a conjunction of equi-conditions between two columns. For the purpose of this project, handling a single equi-condition, and also two equi-conditions connected by AND, will earn full credit.

看完了merge filter nlj的实现之后,本次任务的实现就变得不那么困难了。

当一个nlj的predicate条件是一堆使用AND连接的“=”expr,我们就可以将该nlj转化为hash join。而OptimizeNLJAsHashJoin作用于OptimizeMergeFilterNLJ之后,故而,我们可以直接对所有的nlj结点进行判定重写。

具体来说,我们可以首先实现一个函数CheckIfEquiConjunction,给定expr结构树输入,判断其是否只由AND、”=”、”column expr”构成。这个过程还需要做一件事,就是分离出hash join所需要的key expression,如nlj的连接条件为#0.1=#1.2 AND #1.1=#0.2,则最后形成的hash join为left_key_expr=[#0.1, #0.2], right_key_expr=[#1.2, #1.1]

然后,在OptimizeNLJAsHashJoin函数主体中,我们只需遍历语法树的所有结点,然后对其进行判定,符合条件则将其转化为hash join即可。

Task #3 - Sort + Limit Executors and Top-N Optimization

You will finally implement a few more common executors, completing your implementation in the following files:

  • src/execution/sort_executor.cpp
  • src/execution/limit_executor.cpp
  • src/execution/topn_executor.cpp
  • src/optimizer/sort_limit_as_topn.cpp

You must implement the IndexScanExecutor in Task #1 before starting this task. If there is an index over a table, the query processing layer will automatically pick it for sorting. In other cases, you will need a special sort executor to do this.

For all order by clauses, we assume every sort key will only appear once. You do not need to worry about ties in sorting.

Sort

If a query’s ORDER BY attributes don’t match the keys of an index, BusTub will produce a SortPlanNode for queries such as:

1
EXPLAIN SELECT * FROM __mock_table_1 ORDER BY colA ASC, colB DESC;

如果要求排序的key不是index key,就会用到这个sort executor。

This plan node has the same output scheme as its input schema. You can extract sort keys from order_bys, and then use std::sort with a custom comparator to sort the child node’s tuples. You may assume that all entries in a table will fit entirely in memory.

If the query does not include a sort direction (i.e., ASC, DESC), then the sort mode will be default (which is ASC).

一些想法

comparator实现

我有一个类Tuple,另一个类Executor。我想实现一个Tuple的比较函数,但需要用到类Executor的成员变量,那么我该怎么写一个可以用于std::sort的cmp函数

最终给出的提示是这样的,实现一个函数对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
struct CompareTuplesByOrder {
Schema schema_;
// add any new member

CompareTuplesByOrder(Schema schema, const std::vector<std::pair<OrderByType, AbstractExpressionRef>>& order_by) : schema_(schema) { }

// override the "()" operator
bool operator()(const Tuple &t1, const Tuple &t2) const {
// do any logic
}
};

// use in sort
std::sort(tuples_.begin(), tuples_.end(), CompareTuplesByOrder(GetOutputSchema(), plan_->GetOrderBy()));

可以看到,其本质是通过重载”()”运算符来实现的,感觉是一个很有意思的trick。

实现

它提示的实现思路很简单,就是大概从sort plan node获取所有key,然后用std:sort即可,默认升序,并且所有entry都是in-memory的。

有一点值得注意的是,在sql语言中,排序是可以指定多个关键词+不同顺序(关键词出现顺序表明排序优先级)的,如order by col1 ASC, col3 DESC。所以我们需要在comparator实现中按照优先级(也即order_by_数组顺序)一步步比较。

比较难的地方大概还是c++知识,也即如何为std:sort实现一个较为复杂的comparator。具体操作可见一些想法-comparator实现

Limit

The LimitPlanNode specifies the number of tuples that query will generate. Consider the following example:

1
EXPLAIN SELECT * FROM __mock_table_1 LIMIT 10;

The LimitExecutor constrains the number of output tuples from its child executor. If the number of tuples produced by its child executor is less than the limit specified in the plan node, this executor has no effect and yields all of the tuples that it receives.

This plan node has the same output scheme as its input schema. You do not need to support offsets.

挺简单的,就是限制输出的数量,没什么好说的。

Top-N Optimization Rule

Finally, you should modify BusTub’s optimizer to efficiently support top-N queries. (These were called top-K queries in class.) Consider the following:

1
EXPLAIN SELECT * FROM __mock_table_1 ORDER BY colA LIMIT 10;

By default, BusTub will plan this query as a SortPlanNode followed by a LimitPlanNode. This is inefficient because a heap can be used to keep track of the smallest 10 elements far more efficiently than sorting the entire table.

Implement the TopNExecutor and modify the optimizer to use it for queries containing ORDER BY and LIMIT clauses.

An example of the optimized plan of this query:

1
2
TopN { n=10, order_bys=[(Default, #0.0)]} | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)
MockScan { table=__mock_table_1 } | (__mock_table_1.colA:INTEGER, __mock_table_1.colB:INTEGER)

Hint: See OptimizeSortLimitAsTopN for more information, and check the Optimizer Rule Implementation Guide for details on implementing an optimizer rule.

Note: At this point, your implementation should pass SQLLogicTests #16 to #19. Integration-test-2 requires you to use release mode to run.

感觉标准做法应该是使用快速排序的partion思想。但是这里的话,我打算使用另一个实现更简单的思路,也即维护一个元素个数为limit_的有序set,每次插入元素同其最小值比较即可,这样的时间复杂度为O(nlogk)。当k不会太大的时候,我觉得这样应该还是会比快排要快些的。

小插曲

本来写到这准备开开心心提交了,突然发现自己的版本似乎跟仓库最新不大一样。rebase了感觉一下午(最后甚至还找了个以前写的小bug……),最后才终于提交完获得了full score……

image-20240121200829707

bustub仓库中的每个课程版本都是有这样的小tag了,一开始没发现直接大力出奇迹rebase最新,结果整了半天人麻了。。。

image-20240121200726718

Leaderboard Task

最终优化结果:

image-20240202182014046

Query 1: Where’s the Index?

概述

Consider the following sample database:

1
2
CREATE TABLE t1(x INT, y INT, z INT);
CREATE INDEX t1xy ON t1(x, y);

Now a user comes along and executes the following query.

1
SELECT * FROM t1 WHERE x >= 90 AND y = 10

Even though there is an index on t1.x and t1.y, BusTub does not pick it for the scan!

Recommended Optimizations: Use index scan for this query. Create a clustered index to store the value within the index (you may want to change the value type of index being created). Furthermore, add functionality to the B+ tree iterator to quickly skip keys. You may add your optimizer rule to optimizer_custom_rules.cpp. You may also want to modify the index scan plan to save the predicate.

经过实践可得,bustub只会将索引为关键字的order-by优化为index-scan:

1
2
3
4
5
6
7
8
9
# 未优化为index-scan
bustub> explain (o) SELECT * FROM t1 WHERE x = 90 AND y = 10;
=== OPTIMIZER ===
Filter { predicate=((#0.0=90)and(#0.1=10)) }
SeqScan { table=t1 }
# 优化为index-scan
bustub> explain (o) select * from t1 order by x, y;
=== OPTIMIZER ===
IndexScan { index_oid=0 }

故而,我们在这一小节的优化任务就是,将这种单表查询且条件只涉及索引与常量对比的filter+seqscan优化为index-scan。并且,由于bustub已经保证对于建立了索引的属性不会有键重复的情况,所以我们最需要关注的就是它说的这种范围查找。

并且,bustub对我们的要求事实上是简化了,我们只需优化leaderboard那一个case即可:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
create table t1(x int, y int, z int);
create index t1xy on t1(x, y);

INSERT INTO t1 SELECT * FROM __mock_t1;
----
1000000

explain (o) select * from t1 where x >= 90 and y = 10;

select * from t1 where x >= 90 and y = 10;
----
91 10 910010
92 10 920010
95 10 950010
93 10 930010
98 10 980010
96 10 960010
90 10 900010
99 10 990010
97 10 970010
94 10 940010

我们可以将该任务分为三部分来实现:

  1. 增加optimizer rule,优化filter+seq scan结点为index scan

    其中filter结点的条件需要形如column <comparison> constant及其通过AND连接的这样的表达式。

  2. 修改index executor,使其支持区间范围查找(比如该case,区间即为[(90,10), (+∞,10)]

    故而,我们需要在index executor中新增两个字段,current_iterator_end_iterator_,前者的初值和后者分别表示区间的左右端点。

写了整整一天

实现

optimizer rule

我首先一步就是启用了未启用的OptimizeMergeFilterScan,也即合并filter和sequence scan。因此,需要修改seqscan executor的部分代码,在循环中加入对predicate条件的判定。因为我感觉确实合起来好处感觉比坏处多。

再然后,我们需要增加自定义规则。这里,我将该规则的优先级置于OptimizeMergeFilterScan,故而在判定时只需检测所有sequence scan结点即可。

OptimizeMergeFilterScan我采取了与nlj->hash join差不多的写法(不如说其实整个架构是参照的OptimizeOrderByAsIndexScan),也即先检测出这个是个sequence scan结点,再对其filter expression进行合法性判断,合法则转化为index scan。不过,这里的实现会复杂许多,具体来说有如下几个问题需要注意:

  1. 索引的匹配

    在创建index scan plan node的时候,要求传入索引的序号。参考OptimizeOrderByAsIndexScan的写法,这里需要获取出条件中的column value的下标,然后根据schema获取列名,进行逐一比较即可。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    const auto &columns = index->key_schema_.GetColumns();
    // check index key schema == order by columns
    bool valid = true;
    if (columns.size() == column_indexes.size()) {
    for (size_t i = 0; i < columns.size(); i++) {
    if (columns[i].GetName()
    != table_info->schema_.GetColumn(column_indexes[i]).GetName()) {
    valid = false;
    break;
    }
    }

    if (valid) {
    // do something and return
    }
    }
  2. check函数的实现

    本次我们的目标是检测那些形如column <comparison> constant及其通过AND连接的表达式,并且与此同时获取所有的column value,以及上下界的值(通过具体判断”=””>””<”等符号来决定)。故而,我们的递归思路其实也不会太难。以下是一个简化的代码形式:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    auto Optimizer::CheckIfColumnConstantConjunction(const AbstractExpressionRef &expr,
    std::vector<uint32_t> *column_indexes,
    std::vector<Value> *first_values, std::vector<Value> *last_values)
    -> bool {
    // if is "AND"
    // 如果是AND,我们只需检测其左右孩子是否满足要求即可
    if (expr->type == LogicType::And) {
    // check children
    // 这里是clang format要求用的std::all_of,确实是涨知识了
    return std::all_of(expr->GetChildren().begin(), expr->GetChildren().end(),
    [this, column_indexes, first_values, last_values](const AbstractExpressionRef &child) {
    return CheckIfColumnConstantConjunction(child, column_indexes, first_values, last_values);
    });
    }

    // if is the comparison expr
    // 否则的话,它只可能是比较表达式,并且左孩子为column value,右孩子为constant value
    if (expr->type == Comparison) {
    if (left_child is column) {
    if (right_child is column) {
    // fill in column_indexes, first_values, last_values
    // 此处我多做了一点功夫,也即更面向现实一点,进行了一个column index的去重。
    // 目的是处理这样的式子: v1 > 10 AND v1 < 20
    if (!exist) {
    // 仅当不重复时才加入,并且两个value默认值都是0
    idx = static_cast<int>(column_indexes->size());
    column_indexes->push_back(column_value_expr->GetColIdx());
    first_values->push_back(Value(TypeId::INTEGER, 0));
    last_values->push_back(Value(TypeId::INTEGER, 0));
    }

    std::vector<Column> columns;
    auto v = constant_value_expr->Evaluate(nullptr, Schema(columns));
    // 如果该比较表达式为">" ">=",就可能需要更新左区间
    if (comparison_expr->comp_type_ == ComparisonType::GreaterThan ||
    comparison_expr->comp_type_ == ComparisonType::GreaterThanOrEqual) {
    if ((*first_values)[idx].CompareLessThan(v) == CmpBool::CmpTrue) {
    (*first_values)[idx] = v;
    }
    }
    // 同理,如果该比较表达式为"<" "<=",就可能需要更新右区间,如果该比较表达式为"<=",无条件更新两个区间
    ...
    return true;
    }
    }
    return false;
    }
    return false;
    }

    可见,我事实上是指定了一个规则,也即如果valus里面的成员都是0的话,即表示无穷区间。故而,在主函数返回的时候,需要对这个进行处理。

修改executor

相对而言也是较为简单的。我们只需增加对current_iterator_end_iterator_的初始化:

1
2
3
4
5
6
7
8
// end iterator差不多同理
if (plan_->first_key_.has_value()) {
current_iterator_ = std::make_unique<BPlusTreeIndexIteratorForTwoIntegerColumn>(
tree->GetBeginIterator(plan_->first_key_.value(), true, false));

} else {
current_iterator_ = std::make_unique<BPlusTreeIndexIteratorForTwoIntegerColumn>(tree->GetBeginIterator());
}

然后修改迭代条件即可:

1
2
3
4
5
6
  if (current_iterator_->IsEnd() || *current_iterator_ == *end_iterator_) {
return false;
}

} while (tuple_meta.is_deleted_ ||
(plan_->Predicate() != nullptr && !(plan_->Predicate()->Evaluate(&tuple, GetOutputSchema()).GetAs<bool>())));
修改底层b+树的实现

在这里,我多做了一个本次实验未涉及的小问题,也即当左右区间端点并不是关键字的时候的解决方法。我通过模糊查询解决了此问题,也即:

  1. 当左区间端点不是关键字,索引返回其所在叶结点的起始元素位置;
  2. 当右区间端点不是关键字,索引返回其所在叶结点的下一个叶结点的起始元素位置。

其实就是lower-bound和upper-bound了()

要实现此,就需要对b+树索引新增两个参数,一个是is_ambiguous,表示是否使用模糊查找(直接返回叶结点起始位置);另一个是is_end_ambiguous,开启则返回所在叶结点下一个叶结点的起始元素位置。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
if (root->IsLeafPage()) {
if (is_ambiguous) {
if (is_end_ambiguous) {
return INDEXITERATOR_TYPE(bpm_, root->GetNextPageId());
} else {
return INDEXITERATOR_TYPE(bpm_, res_pgid);
}
}

auto it = INDEXITERATOR_TYPE(bpm_, res_pgid);
while (!(it.IsEnd() || comparator_((*it).first, key) == 0)) {
++it;
}
return INDEXITERATOR_TYPE(it);
}

这样就能支持完美的范围模糊查找了。

其他小问题
  1. 与其他executor的兼容

    众所周知,称为“迭代器”的东西都有一个毛病:不支持边遍历边修改。基于b+树迭代器的index scan也是一样。故而,面对“删”操作和“改”操作需要进行特殊处理。

    对delete操作的处理,我一开始使用了一个小trick,也即记录一个last_tuple,每次在遍历到一个叶结点的时候,对该叶结点上一个结点进行修改。这样就能一定程度上避免当前迭代到的叶结点和delete操作的叶结点是两个结点了。

    不过到后来,我觉得多做可能bug也跟着多了,所以直接简单粗暴,那就是,直接不对delete和update的子树进行index scan的优化……在optimizer rule中,如果遇到update或者delete结点,就迅速返回。只能说滑跪也是一门艺术()

  2. 一个死锁bug

    我原本的实现中,会在整个迭代器生命周期维护一个读锁,但这就会导致边迭代边修改过程产生死锁。然后我又听说迭代器似乎不要求并发保护,所以我就干脆先把这锁去掉了,尽管我目前简单粗暴的实现并不会边迭代边修改。

不足

与此同时,我的实现有以下这几个不足,所以目前只能面向测试用例服务。

  1. 模糊查找

    上面详细讲述了我的模糊查找实现,只是很可惜的是,这样的模糊实现需要修改b_plus_index.h中的函数定义,而gradescope上不能交这个,所以很遗憾我还是只能把这个功能ban了,写了个左右区间端点必是关键字的版本。

  2. 部分索引

    当有比如说v1 = XX AND v2 > XX AND v3 < 10的条件,并且只有在v1和v2的索引时,我的实现是不会将其转化为index scan的,因为我那边index的查找是做了全匹配。要实现真正的那个啥,就可能需要有最长公共子串匹配之类的算法了。

Query 2: Too Many Joins!

概述

Consider the following sample database:

1
2
3
CREATE TABLE t4(x int, y int);
CREATE TABLE t5(x int, y int);
CREATE TABLE t6(x int, y int);

The user is not from CMU and they are writing terrible SQL. They forgot how write queries with joins so they puts all predicates in the WHERE clause.

草,有一说一这简直就是我。。。所以可见平时写sql的时候还是尽量多写join

1
2
3
SELECT * FROM t4, t5, t6
WHERE (t4.x = t5.x) AND (t5.y = t6.y) AND (t4.y >= 1000000)
AND (t4.y < 1500000) AND (t6.x >= 100000) AND (t6.x < 150000);

Recommended Optimizations: Decompose the filter condition to extract hash join keys, and push down the remaining filter conditions to be below the hash join.

这个就是我们在课上也学过的“尽早做选择”了,更学术一点的叫法是“谓词下推”。以leaderboard中的例子为例,它最终是要取得这样的效果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
explain (o) select count(*), max(__mock_t4_1m.x), max(__mock_t4_1m.y), max(__mock_t5_1m.x), max(__mock_t5_1m.y), max(__mock_t6_1m.x), max(__mock_t6_1m.y)
from __mock_t4_1m, __mock_t5_1m, __mock_t6_1m
where (__mock_t4_1m.x = __mock_t5_1m.x)
and (__mock_t6_1m.y = __mock_t5_1m.y)
and (__mock_t4_1m.y >= 1000000)
and (__mock_t4_1m.y < 1500000)
and (__mock_t6_1m.x < 150000)
and (__mock_t6_1m.x >= 100000);
==============================================================
Agg { types=[count_star, max, max, max, max, max, max], aggregates=[1, #0.0, #0.1, #0.2, #0.3, #0.4, #0.5], group_by=[] }
HashJoin { type=Inner, left_key=[#0.3], right_key=[#1.1] }
HashJoin { type=Inner, left_key=[#0.0], right_key=[#1.0] }
Filter { predicate=((#0.1>=1000000)and(#0.1<1500000)) }
MockScan { table=__mock_t4_1m }
MockScan { table=__mock_t5_1m }
Filter { predicate=((#0.0<150000)and(#0.0>=100000)) }
MockScan { table=__mock_t6_1m }

这个图画得就很清楚:(from

img

实现

predicate push down

体感上,这个任务的思维难度和规则实现复杂度是远高于q1的,但由于q1涉及范围太广,所以最终两者耗时还是不相上下,写了我大概6、7个小时。

我的大体思路是,整一个这样的递归主体:

1
auto Optimizer::PredicatePushdown(const AbstractPlanNodeRef &plan, AbstractExpressionRef &expr) -> AbstractPlanNodeRef

它接收一个plan对象和一个expr表达式,plan为当前处理的结点,expr表达式为其祖辈结点下推下来的选择谓词。

由于我将OptimizePredicatePushdown的处理放在了OptimizeMergeFilterNLJOptimizeMergeFilterScan之后,故而只需特别处理nlj结点、seq scan结点以及filter结点(小概率)。

  1. nlj结点

    首先将祖辈下推的expr和自身的predicate通过AND连接,然后通过一个自定义的工具函数SplitExpression,将表达式分成三类:仅含左孩子的表达式(如#0.0 = #0.3#0.0 > 100)、仅含右孩子的表达式,以及左右孩子都包含的表达式(如#0.1 > #1.2)。

    然后,前两类表达式作为对左右孩子进行递归处理的参数,递归处理左右孩子;最后一类由于涉及左右孩子,故而留下作为当前nlj结点的predicate。

  2. seq scan结点

    只需遍历其孩子结点,递归调用谓词下推,然后将expr同它自身原本的predicate连接即可。

  3. filter结点

    经过了OptimizeMergeFilterNLJOptimizeMergeFilterScan前两步合并还幸存的filter结点,大概也算是个小概率的硬茬了,所以我这里对它的处理同seq scan相同。

这样一来,就完成了谓词下推的总体处理逻辑。接下来,便是比较繁琐的SplitExpression的实现了。它其实逻辑上也没什么好说的,就是各种指针对象创建来创建去容易头晕。。。比如这一坨:

1
(*child_expressions)[left_tuple_idx] = std::make_shared<ComparisonExpression>(std::make_shared<ColumnValueExpression>(0, left_column_value_expr->GetColIdx(), left_column_value_expr->GetReturnType()), comparison_expr->GetChildAt(1), comparison_expr->comp_type_);

然后还有一点要注意的是,SplitExpression还需要记得处理我们上面那三种情况都不满足条件的式子(比如别的测试用例中的v1 + 5 = c1),在上面那三种方法都匹配失败的时候,把它整进middle(也即第三类情况)即可:

1
2
3
4
5
if (!(*middle)) {
*middle = expr;
} else {
*middle = std::make_shared<LogicExpression>(*middle, expr, LogicType::And);
}
其他小问题
  1. 关于MockScan

    一点很让人难绷的是,好像似乎没法对这东西进行merge filter,因为mock scan相关类我们都是不能修改的。故而,我的实现只能在遇到mockscan结点的时候,手动给它加一个filter爸爸了。

  2. OptimizePredicatePushdown的处理时机

    这位博主一样,我也纠结了许久OptimizePredicatePushdown的优先级:

    在处理谓词下推规则时,进行的时机让我纠结了一阵:

    • 如果在 OptimizeMergeFilterNLJ 规则之前,那么还需要进行 Filter 下标的重分配
    • 如果在 OptimizeMergeFilterNLJ 规则之后,则下推后,又产生了没有被 Merge 的 Filter 节点

    最后考虑到实现尽可能简单,我在 OptimizeMergeFilterNLJ 规则之后进行谓词下推,并在递归处理子节点的谓词下推之前,对新产生的子节点再次调用 OptimizeMergeFilterNLJ 规则,完成了偷懒。

    我也采取了跟它类似的做法,不过是调用了该规则的helper函数RewriteExpressionForJoin,来处理上面传下来的这个未merge的filter expression:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    if (nlj_plan.Predicate() != nullptr && nlj_plan.Predicate()->ToString() != "true") {
    SplitExpression(nlj_plan.Predicate(), &child_expressions, &middle, left_size, right_size);
    }

    if (expr != nullptr && expr->ToString() != "true") {
    // preprocess before split
    expr = RewriteExpressionForJoin(expr, left_size, right_size);
    SplitExpression(expr, &child_expressions, &middle, left_size, right_size);
    }

Query 3: The Mad Data Scientist

概述

Consider the following example:

1
2
3
4
5
6
7
8
SELECT v, d1, d2 FROM (
SELECT v,
MAX(v1) AS d1, MIN(v1), MAX(v2), MIN(v2),
MAX(v1) + MIN(v1), MAX(v2) + MIN(v2),
MAX(v1) + MAX(v1) + MAX(v2) AS d2
FROM t7 LEFT JOIN (SELECT v4 FROM t8 WHERE 1 == 2) ON v < v4
GROUP BY v
)

(This is not the same as the actual leaderboard query; refer to the test file.)

Recommended Optimizations:

  1. Column pruning – you only need to compute v, d1, d2 from the left table in aggregation,
  2. common expression elimination【公共子表达式消除】,
  3. transform always false filter to dummy scan (values plan node of zero rows)

Hint: You do not need to implement a complete rule for optimizing these queries. (1) a complete predicate pushdown requires you to handle all plan nodes – limit, order by, etc. But to optimize for Q2, you only need to implement push down predicates over hash join / nested loop joins. (2) a complete join reordering requires you to handle predicates correctly (and maybe absorb filters in-between back to the join predicate), and you do not need to do that. Just make your optimizer work with those queries is enough.

也就是说,本次实验要做的就是恒假选择剪枝、列剪裁这两件事。具体来说,我们可以分为三个步骤实现:

  1. 消除恒假条件

  2. 消除无用的投影

  3. 消除公共表达式

实现

eliminate always false filter

该优化必须要在OptimizeMergeFilterScan之前进行。

自顶向下遍历语法树,当检测到filter结点的条件恒假时,就将其子树替换为一个空白的value结点。

其中,检测条件恒假时,我只对用AND或者OR连接的、只涉及常量比较的表达式进行检测。当为AND,两个子表达式有一个恒假即可;当为OR,需要两个表达式都为恒假。

column pruning

在bustub实现中,实现完全的投影的下推感觉会比谓词下推稍显复杂一些。实现谓词下推需处理filter、scan、join这几种有谓词情况的结点,并且对predicate进行裁剪/连接即可。而投影下推也许需要记住当前投影所需列,然后遍历所有结点进行剪裁,感觉很是复杂。故而,这里为了简单起见,只处理多个连续projection结点情况及projection-aggregation结点情况。

我在OptimizeMergeProjection之后、其余合并谓词之前进行了该优化。

为了简单起见(面向测试用例……),我这里只处理了至少两个连续projection、并且父projection的选择列只有column value(也即没有d1+d2这样的表达式)的情况,就不考虑像谓词下推那样处理每个结点了。也即将这样:

1
2
select d1, d2 from
select v1 as d1, v1 + v2 as d2, v3 as d3 from t1;

优化成这样:

1
select v1, v1 + v2 from t1;

故而,我们需要遍历父投影结点,仅留下其需要的列,从而创建一个新的子节点,并且再次递归处理它自身。

1
return OptimizeColumnPruning(std::make_shared<ProjectionPlanNode>(std::make_shared<Schema>(parent_projection_plan.OutputSchema()), expressions, child_plan.GetChildAt(0)));

这样一来,就能将所有符合条件的连续投影合并为一个投影结点。

common expression elimination

也即做一个这样的优化:

1
2
3
4
5
6
7
========== before ==========
select `#0.0`, `#0.1`, `((#0.2+#0.3)+#0.4)`
Agg { types=[max, max, max, max], aggregates=[#0.1, #0.1, #0.1, #0.2], group_by=[#0.0] }

========== after ==========
select `#0.0`, `#0.1`, `((#0.1+#0.1)+#0.2)`
Agg { types=[max, max], aggregates=[#0.1, #0.2], group_by=[#0.0] }

注:aggregation的schema为group-by,aggregates。故而,#0.0为group-by列,#0.1开始才为aggregates列。

我这里的处理思路是,在做完column pruning之后,再针对projection-aggregation这样的局部结点(面向测试用例)进行特殊处理。为了达到这样的效果,就需要做两件事:

  1. 精简化aggregation的aggregates和types
  2. 改写projection的列名

为了实现这两点,我引入了一个position map,对于每一个column value,key为原position,value为精简后的position。比如说,上例的position map应为:

1
2
3
// for group-by value #0.0, we assume that the group by columns are always in front after processing.
[0, 0],
[1, 1], [2, 1], [3, 1], [4, 2]

position map在精简化aggregation的aggregates和types时被填充,在改写projection的列名时被使用。这样一来,就成功完成了公共子表达式消除。