bailaohe's blog


  • 首页

  • 归档

  • 标签

跟着tikv源码学rust-0:开篇和准备

发表于 2017-07-29   |   分类于 programming

最近一段时间,非常关注tidb这个开源项目。个人感觉,这个项目和蚂蚁的OceanBase是从两个层次,尝试从数据库层面上解决应用扩展的痛点。前者关注金融级应用,因此更强调跨数据中心的实物一致性和高可用;后者相比之下更为“亲民”,作为一个后端程序员,能够有朝一日将一切持久化的扩展问题都交给数据库,开发一套业务代码,能够在几十到几十万并发访问下“平趟”,是件多爽的事!

不过羞羞地说,眼下对tidb存储服务tikv的开发语言rust都还没入门,想顺利的分析代码进而有所贡献有点儿不切实际。不过根据我之前对rust的简单学习感受来说,这门语言学习曲线太陡了。不结合一个实体项目,反复嚼rustbook实在很难理解那么多零碎复杂的特性。所以我决定换个思路,从tikv入手,看看优质rust项目的开发套路,边看边学,应该感悟会更加深刻。

我的初步打算是,从对tikv感兴趣的几个功能模块入手,对代码进行由表及里的分析,结合之前对数据库存储开发一点儿经历,学习分布式数据库存储的原理和架构。对于每部分代码用到的rust语言的feature,回到rustbook或者其他学习材料,进行学习和总结。希望能坚持下去。

学习/开发环境

  • 操作系统:MacOS Sierra 10.12.5
  • IDE:Visual Studio Code 1.14.2(插件:rust 0.4.2 + racer)
  • Rust: rustup管理nightly-2017-05-29-x86_64-apple-darwin (tikv基于该环境编译和测试)

第一个PR

为了给自己迈出第一步的契机,参加了PingCAP的社区活动:十分钟成为Contributor,为tikv提交了本人的第一个pr。pr本身没什么可说的,只是实现一个简单的abs内建函数。但作为一个对rust只有理论基础的人,借此机会完整地对tikv进行一次编译,还是踩了些坑,得到了不少实践感受。

nightly版本、jemalloc和libc

和大多数rust项目一样,tikv也是night-only的。使用rustup升级到最新的nightly,编译tikv出现如下编译错误

jemalloc编译错误

到rustup的lib目录下翻了翻,果然有两个对应libc的rlib文件。在1.20前似乎都只有一个libc文件。网上查了很久也没找到原因,所以暂时只能乖乖用PingCAP推荐的nightly-2017-05-29-x86_64编译了。

note: 后来发现

librocksdb的版本

tikv底层使用facebook的rocksdb作为单节点的kv存储。rocksdb是一个C++工程,所以其头文件的版本也至关重要。在写这篇文章的时候,tikv刚刚把对rocksdb的版本依赖从5.5.1升级到5.6.1。如果没有安装对应版本的rocksdb头文件,会出现如下编译错误:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
running: "c++" "-O0" "-ffunction-sections" "-fdata-sections" "-fPIC" "-g" "-m64" "-std=c++11" "-o" "/Users/baihe/project/github/tikv/target/debug/build/librocksdb_sys-865a78dfa907ba49/out/crocksdb/c.o" "-c" "crocksdb/c.cc"
cargo:warning=crocksdb/c.cc:2115:12: error: no member named 'max_background_jobs' in 'rocksdb::Options'; did you mean 'max_background_flushes'?
cargo:warning= opt->rep.max_background_jobs = n;
cargo:warning= ^~~~~~~~~~~~~~~~~~~
cargo:warning= max_background_flushes
cargo:warning=/usr/local/include/rocksdb/options.h:506:7: note: 'max_background_flushes' declared here
cargo:warning= int max_background_flushes = 1;
cargo:warning= ^
cargo:warning=crocksdb/c.cc:3181:11: warning: 7 enumeration values not handled in switch: 'kColumnFamilyName', 'kFilterPolicyName', 'kComparatorName'... [-Wswitch]
cargo:warning= switch (prop) {
cargo:warning= ^
cargo:warning=crocksdb/c.cc:3210:11: warning: 10 enumeration values not handled in switch: 'kDataSize', 'kIndexSize', 'kFilterSize'... [-Wswitch]
cargo:warning= switch (prop) {
cargo:warning= ^
cargo:warning=2 warnings and 1 error generated.
exit code: 1

rustfmt问题

PingCAP团队使用的rustfmt是 0.6 的版本,如果使用最新版本会导致测试用例编译失败。

rust-clippy

tikv项目中使用了rust-clippy。这是一个常用的rust源码检查工具,帮助开发者保证代码质量,避免不当的代码实践。由于本人目前rust零基础,却仍希望未来用rust做些事情的希望,这类工具对我是非常有价值的。

rust-clippy本身是一个rust编译器插件,tikv中将它作为一个optional依赖,通过cargo或者rustc在编译时控制feature:clippy来实现打开/关闭该插件。

clippy在打开状态下,可以检查出类似如下的代码问题:

1
2
3
4
5
6
7
src/main.rs:8:5: 11:6 warning: you seem to be trying to use match for destructuring a single type. Consider using `if let`, #[warn(single_match)] on by default
src/main.rs:8 match x {
src/main.rs:9 Some(y) => println!("{:?}", y),
src/main.rs:10 _ => ()
src/main.rs:11 }
src/main.rs:8:5: 11:6 help: Try
if let Some(y) = x { println!("{:?}", y) }

非常棒,我准备把rust-clippy作为以后rust项目的必备依赖。rust-clippy还有其他使用方法,具体可以浏览其github主页文档。值得一提的是,rust-clippy也是个nightly-only项目。

Rust学习点

这个系列应该是本人通过tikv源码学习rust和数据库技术的笔记。因此希望在每篇文章的结尾,对于这部分工作学习到的rust的关键点进行总结。并对这些关键点做编号,帮助反向索引。

KP-01:条件编译和feature

上文中提到的rust-clippy作为编译器插件,由feature控制打开/关闭状态,因此去查询了feature和条件编译相关的功能。

属性(Attribute)

属性是rust中支持的一种修饰符(Annotation),通常用在一个声明(struct、mod、……)上,具体定义可以看rustbook第一版中对于属性的描述。完整的reference在这里,等有机会在看(估计就不会看……)。

cfg/cfg_attr属性

在rust语言的一大堆属性中,有一类特殊属性,可以根据编译器传入的feature开关,控制代码编译的行为。该属性主要有如下两种方式:

1
2
3
4
5
6
7
8
9
10
#[cfg(foo)]
struct Foo;
#[cfg(feature = "bar")]
struct Bar
#[cfg(target_os = "macos")]
fn macos_only() {
// ...
}

放到C语言里,相当于预编译开关,代码比解释更明白:

1
2
3
4
5
6
7
8
9
10
11
12
13
#if foo == true
struct Foo;
#endif
#ifdef bar
struct Bar;
#endif
#if target_os == "macos"
void macos_only() {
// ...
}
#endif

另外在cfg属性里还支持布尔组合,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#[cfg(any(foo, bar))]
fn needs_foo_or_bar() {
// ...
}
#[cfg(all(unix, target_pointer_width = "32"))]
fn on_32bit_unix() {
// ...
}
#[cfg(not(foo))]
fn needs_not_foo() {
// ...
}

不用看文档,猜也能才出来all、any、not对应的是与、或、非。这些布尔表达式也支持嵌套,来实现更为复杂的条件判断。但总体而言,我还是更喜欢C语言的写法。

cfg_attr属性有两个操作数,可以基于条件来设置其他属性。

1
2
#[cfg_attr(a, b)]
struct Foo;

在条件a满足的情况下,相当于

1
2
#[b]
struct Foo;

否则就完全没有作用。

这篇文章描述了很多基于cfg_attr的有趣玩法,特别是可以实现动态文档和动态宏定义,有兴趣可以实践一下。

feature和plugin

1
2
#![feature(plugin)]
#![cfg_attr(feature = "dev", plugin(clippy))]

上述代码出现在tikv源代码tikv-server.rs的文件开头,有了对于条件编译的相关背景,我们知道上述代码的作用是:

  1. 打开feature:plugin用于支持插件加载
  2. 如果编译器传入feature包含dev,使插件clippy生效

note: 为啥用的是#![cfg]/!#[cfg_attr]而不是#[cfg]/#[cfg_attr]?看看文档就知道了。

在tikv的cargo.toml文件中

1
2
3
4
5
6
7
8
[features]
default = []
dev = ["clippy"]
...
[dependencies]
clippy = {version = "*", optional = true}
...

这样,在tikv编译过程中,我们就可以通过执行cargo build --features "dev"将参数--cfg feature="foo"传递给rustc编译器,就会引入optional依赖clippy,并依照代码中的cfg_attr属性为编译器加载clippy提供的插件。

根据crates.io文档The Manifest Format,feature是用户在cargo.toml中定义的编译器flag:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
[features]
# 默认feature集合,设置为空
default = []
# foo是没有依赖的feature,主要用于条件编译,例如:`#[cfg(feature = "foo")]`
foo = []
# dev是依赖于optional依赖clippy的feature。一方面dev可以作为alias让我们以更可读的方式描述feature,
另一方面可以通过optional依赖引入该feature的扩展功能,如clippy提供的编译器插件。
dev = ["clippy"]
# session是对于外部依赖cookie提供的另一个feature:cookie的alias
session = ["cookie/session"]
# feature可以是一个组依赖,其中然包括optional依赖,也可以是session这种其他feature
group-feature = ["jquery", "uglifier", "session"]
[dependencies]
cookie = "1.2.0"
jquery = { version = "1.0.2", optional = true }
uglifier = { version = "1.5.3", optional = true }
clippy = { version = "*", optional = true }

stable/nightly的区别

这是我学习rust最困惑的地方,似乎接触到的所有rust项目都声明自己是nightly-only,那stable还有个毛用啊?直到在A tale of two Rusts这篇文章中看到这么一句话:

Stable Rust is dead. Nightly Rust is the only Rust.

Rust的stable和nightly的差别,可以类比python的2和3,甚至差异更大。文章中任务rust的nightly可以被认为是另一门变成语言。一方面,很多feature只有在nightly中才可以使用,这些特性在rustc的-Z参数中。如果在stable中使用该参数,会看到如下信息:

1
2
> rustc -Z extra-plugins=clippy
error: the option `Z` is only accepted on the nightly compiler

这些feature需要在nightly中经过实践验证,稳定后才有可能移入stable中。

另一方面,存在一个重要的feature,永远不大可能从nightly进入stable。就是rust-clippy用到的:

1
#![feature(plugin)]

可以认为凡是需要code-generation的rust程序,都得使用该feature。为啥改feature不可能stable,原文中的描述没怎么看懂,先放到这里,以后参悟:

Why compiler plugins can never be stable, though? It’s because the internal API they are coded against goes too deep into the compiler bowels to ever get stabilized. If it were, it would severely limit the ability to further develop the language without significant breakage of the established plugins.

参考

  1. rustbook-1st中的Conditional Compilation和Attributes
  2. Quick tip: the #[cfg_attr] attribute
  3. A tale of two Rusts
  4. crates.io文档The Manifest Format

rust中的一些要点

发表于 2017-07-28   |   分类于 programming

lifetime消除规则

在rust中,每一个引用(reference)都拥有一个lifetime。rust对于安全性和编译时解决大部分问题的执着,要求我们在定义使用引用的函数或者结构体的时,需要指明引用的lifetime。在1.0版本之前,所有引用的lifetime都必须由开发者通过标注(annotation)的形式显式指定。之后,rust语言开发团队逐渐发现了一些可以预测lifetime的“套路”,并将其集成入后续版本中,使得rust程序员在很多情况下不需要为所有引用指定lifetime标注。 这些“套路”被总结为lifetime消除规则。

在rust的函数定义中,参数和返回值的lifetime分别称为输入lifetime和输出lifetime。如下3条规则中,规则1适配输入lifetime,规则2、3适配输出lifetime。如果rust编译器在尝试这3条规则后,仍然无法推断所有引用的lifetime,则会编译报错。

  1. 函数中每个参数都有相应独立的输入lifetime. 例如,fn foo<'a>(x: &'a i32)和fn foo<'a, 'b>(x: &'a i32, y: &'b i32);

  2. 如果函数只有一个输入lifetime,'a,则所有返回值的输出lifetime都为'a。即fn foo<'a>(x: &'a i32) -> &'a i32;

  3. 如果函数有多个输入lifetime,但其中一个是&self或&mut self。即该函数是一个对象方法,则可以将self引用的lifetime赋给该方法的所有输出lifetime。

需要注意的是,上述lifetime消除规则,描述或者预测的其实是输入lifetime和输出lifetime的绑定关系。rust语言的特性要求在编译时刻,明确当前命名空间下所有引用的生命周期,以进行安全性检查,防止“野引用/指针(dangling references)”。但如果当前命名空间中有个发生了borrow行为的函数调用的返回值中又包含一个新的引用x,如何判定这个x的lifetime呢?

首先,rust要求引用不能比它指向的变量“活得更长”,因此这个返回的引用x的lifetime一定和输入lifetime有某种对应关系;

同时,出于编译器实现的可行性,我们不可能在编译时分析所有返回引用的函数定义,更不要说函数定义中可能包含多层调用嵌套……

所以,rust语言开发团队最后选择“尽人事,听天命”:适配了上述lifetime消除规则的,编译器可以在不分析函数定义,只查看函数声明的前提下推断输出lifetime;其他的漏网之鱼只能靠程序员拉编译器一把了……

Use SparkSQL to build a OLAP database across different datasources

发表于 2016-04-20

Spark is a large-scale data processing engine. SparkSQL, one of its important component, can access the Hive metastore service to handle Hive tables directly. Furthermore, SparkSQL also provides approach to use data from other external datasources (JDBC to RDB, Mongo, HBase, etc).

Original Target

In my work, I need to handle data from different datasources (mostly Mysql & Mongo) to generate the final OLAP query result. Our goal is to establish a universal data platform to access, especially to process JOIN operation across schema on multiple datasources.

Approach-1: Pandas ETL engine

We originally used pandas to load required schemas as (pandas) Dataframes and then process all data operations within memory. This approach, however, is

  • Time Consuming: requires great efforts to load dataframes into memory
  • Lack of Scalability: cannot handle large-scale data well since the entire platform is resided in single node.
  • Difficult to Access: needs pandas APIs to process all the data operations. There are methods to use SQL to handle pandas Dataframe (e.g., sql4pandas), but the supported sql syntax is limited.

At last, we come to Spark. In SparkSQL, the basic operational data unit is also DataFrame, no matter a table in RDB, a collection in MongoDB, or a document in ElasticSearch. Moreover, its lazy evaluation of Dataframe enable it to process ETL job until the time we really need to access it, which makes it efficient in data handling and aware of change of external datasource.

Approach-2: PySpark Jupyter Notebook

The idea is very easy, we register all Dataframes as temporary tables at first. Then we can use sql via Spark SQLContext to operate multiple datasources directly. Its easy to setup the jupyter notebook environment using PySpark. You can check the following demo notebook at my github repository (here). I post the source code as follows.

Initialize pySpark Environment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import os
import sys
# Add support to access mysql
SPARK_CLASSPATH = "./libs/mysql-connector-java-5.1.38-bin.jar"
# Add support to access mongo (from official)
SPARK_CLASSPATH += ":./libs/mongo-hadoop-core-1.5.2.jar"
SPARK_CLASSPATH += ":./libs/mongo-java-driver-3.2.2.jar"
# Add support to access mongo (from stratio) based on casbah libs
SPARK_CLASSPATH += ":./libs/casbah-commons_2.10-3.1.1.jar"
SPARK_CLASSPATH += ":./libs/casbah-core_2.10-3.1.1.jar"
SPARK_CLASSPATH += ":./libs/casbah-query_2.10-3.1.1.jar"
SPARK_CLASSPATH += ":./libs/spark-mongodb_2.10-0.11.1.jar"
# Set the environment variable SPARK_CLASSPATH
os.environ['SPARK_CLASSPATH'] = SPARK_CLASSPATH
# Add pyspark to sys.path
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.9-src.zip'))
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SQLContext
# Initialize spark conf/context/sqlContext
conf = SparkConf().setMaster("local[*]").setAppName('spark-etl')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

Initial Data Access Drivers (Mysql/Mongo/…)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 1. Initialize the mysql driver
mysql_host = "YOUR_MYSQL_HOST"
mysql_port = 3306
mysql_db = "YOUR_MYSQL_DB"
mysql_user = "YOUR_MYSQL_USER"
mysql_pass = "YOUR_MYSQL_PASS"
mysql_driver = "com.mysql.jdbc.Driver"
mysql_prod = sqlContext.read.format("jdbc").options(
url="jdbc:mysql://{host}:{port}/{db}".format(host=mysql_host, port=mysql_port, db=mysql_db),
driver = mysql_driver,
user=mysql_user,
password=mysql_pass)
# 2. Initalize the official mongo driver
mongo_user = "YOUR_MONGO_USER"
mongo_pass = "YOUR_MONGO_PASSWORD"
mongo_host = "127.0.0.1"
mongo_port = 27017
mongo_db = "test"

Register Temporary Tables from datasources (Mysql/Mongo/…)

1
2
3
4
5
6
7
8
9
10
11
# 1. Register mysql temporary tables
df_deal = mysql_prod.load(dbtable = "YOUR_MYSQL_TABLE")
df_deal.registerTempTable("mysql_table")
# 2. Register mongo temporary tables
sqlContext.sql("CREATE TEMPORARY TABLE mongo_table USING com.stratio.datasource.mongodb OPTIONS (host '{host}:{port}', database '{db}', collection '{table}')".format(
host=mongo_host,
port=mongo_port,
db=mongo_db,
table="demotbl"
))

Then We can use SparkSQL as follows:

1
2
df_mongo = sqlContext.sql("SELECT * FROM mongo_table limit 10")
df_mongo.collect()

Approach-3: OLAP SQL Database on SparkSQL Thrift

We take our step furthermore, we want to make our platform as a database, facilitate us to access it in our program via JDBC driver, and to support different legacy BI application (e.g., Tableau, QlikView).

As mentioned above, SparkSQL can use Hive metastore directly. Thus, we want to start the SparkSQL thriftserver accompy with Hive metastore service, establish the environment with some SparkSQL DDL statements to create the symbol-links to external datasources.

The work is also very easy, just share the same hive-site.xml between Hive metastore service and SparkSQL thriftserver. We post the content of hive-site.xml as follows. It’s only a toy settings without any Hadoop/HDFS/Mapreduce stuff to highlight the key points, you can adapt it quickly for production use.

Config hive-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>Bh@840922</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>localhost</value>
</property>
</configuration>

Start the SparkSQL thriftserver with required jars

1
2
3
4
5
6
7
8
9
#!/bin/sh
${SPARK_HOME}/sbin/start-thriftserver.sh \
--jars ${WORKDIR}/libs/mongo-java-driver-3.2.2.jar, \
${WORKDIR}/libs/casbah-commons_2.10-3.1.1.jar, \
${WORKDIR}/libs/casbah-core_2.10-3.1.1.jar, \
${WORKDIR}/libs/casbah-query_2.10-3.1.1.jar, \
${WORKDIR}/libs/spark-mongodb_2.10-0.11.1.jar, \
${WORKDIR}/libs/mysql-connector-java-5.1.38-bin.jar

OK, everything is done! Now you can do the same thing as approach-2 to create a symbol-link to external mongo table as follows in your beeline client:

1
CREATE TEMPORARY TABLE mongo_table USING com.stratio.datasource.mongodb OPTIONS (host 'localhost:27017', database 'test', collection 'demotbl');

Then you can access it via normal query statement:

0: jdbc:hive2://localhost:10000> show tables;
+--------------+--------------+--+
|  tableName   | isTemporary  |
+--------------+--------------+--+
| mongo_table  | false        |
+--------------+--------------+--+
1 row selected (0.108 seconds)
0: jdbc:hive2://localhost:10000> select * from mongo_table;
+------+----+---------------------------+--+
|  x   | y  |            _id            |
+------+----+---------------------------+--+
| 1.0  | a  | 5715f227d2f82889971df7f1  |
| 2.0  | b  | 57170b5e582cb370c48f085c  |
+------+----+---------------------------+--+
2 rows selected (0.38 seconds)

JVM GC related stuff

发表于 2016-02-24   |   分类于 programming

Some good tutorials I have read, maybe translate them in futher.

http://javapapers.com/java/java-garbage-collection-introduction/

http://www.cubrid.org/blog/tags/Garbage%20Collection/

http://www.javaworld.com/article/2078623/core-java/jvm-performance-optimization-part-1-a-jvm-technology-primer.html

scala features to best practices [5]: implicits

发表于 2016-02-24   |   分类于 programming

Implicit conversions and implicit parameters are Scala’s power tools that do useful work behind the scenes. With implicits, you can provide elegant libraries that hide tedious details from library users.

FT-6: implicit conversion (via implicit method/class)

An implicit conversion from type S to type T is defined by an implicit value which has function type S => T, or by an implicit method convertible to a value of that type. Implicit conversions are applied in two situations:

  • If an expression e is of type S, and S does not conform to the expression’s expected type T.
  • In a selection e.m with e of type T, if the selector m does not denote a member of T.
1
2
implicit def double2Int(d: Double) = d.toInt
val x: Int = 42.0

SC-6-1: enrich an existing class

Rather than create a separate library of String utility methods, like a StringUtilities class, you want to add your own behavior(s) to the String class, so you can write code like this:

1
"HAL".increment

Instead of this:

1
StringUtilities.increment("HAL")

Then we can enrich the String class with an implicit method as follows:

1
2
3
4
5
6
// define a method named increment in a normal Scala class:
class StringImprovements(val s: String) {
def increment = s.map(c => (c + 1).toChar)
}
// define another method to handle the implicit conversion:
implicit def stringToString(s: String) = new StringImprovements(s)

When you call increment on a String, which does not has that method at all. Thus, the compiler find the compatible one StringImprovements and convert the string to StringImprovements via the implicit method stringToString, this is the scenario-2 mentioned above.

Scala 2.10 introduced a new feature called implicit classes. An implicit class is a class marked with the implicit keyword. This keyword makes the class’ primary constructor available for implicit conversions when the class is in scope. This is similar to monkey patching in Ruby, and Meta-Programming in Groovy.

1
2
3
implicit class StringImprovements(s: String) {
def increment = s.map(c => (c + 1).toChar)
}

In real-world code, this is just slightly more complicated. According to SIP-13, Implicit Classes

An implicit class must be defined in a scope where method definitions are allowed (not at the top level).

This means that your implicit class must be defined inside a class, object, or package object. You can also check some other restrictions of implicit class here: http://docs.scala-lang.org/overviews/core/implicit-classes.html

FT-7: implicit parameter

A method with implicit parameters can be applied to arguments just like a normal method. In this case the implicit label has no effect. However, if such a method misses arguments for its implicit parameters, such arguments will be automatically provided.

The actual arguments that are eligible to be passed to an implicit parameter fall into two categories:

  • First, eligible are all identifiers x that can be accessed at the point of the method call without a prefix and that denote an implicit definition or an implicit parameter.
  • Second, eligible are also all members of companion modules of the implicit parameter’s type that are labeled implicit.

SC-7-1: default parameter value

Implicits can be used to declare a value to be provided as a default as long as an implicit value is set with in the scope.

1
2
3
def howMuchCanIMake_?(hours: Int)(implicit dollarsPerHour: BigDecimal) = dollarsPerHour * hours
implicit var hourlyRate = BigDecimal(34.00)

What’s the advantage this solution takes over the simple default value in parameter definition? The search of implicit value can be taken in the scope of companion object, and thus you can keep the default value private from the caller.

SC-7-2: implicit conversion via implicit parameter

An implicit function parameter is also usable as an implicit conversion, and it’s more flexible than the traditional solution. Check the following codes:

1
2
def smaller[T](a: T, b: T)(implicit order: T => Ordered[T])
= if (a < b) a else b // Calls order(a) < b if a doesn't have a < operator

Note that order is a function with a single parameter, is tagged implicit, and has a name that is a single identifier. Therefore, it is an implicit conversion, in addition to being an implicit parameter. So, we can omit the call to order in the body of the function

scala features to best practices [4]: closure

发表于 2016-02-24   |   分类于 programming

FT-5: closure

You want to pass a function around like a variable, and while doing so, you want that function to be able to refer to one or more fields that were in the same scope as the function when it was declared.

In his excellent article, Closures in Ruby, Paul Cantrell states

A closure is a block of code which meets three criteria

He defines the criteria as follows:

  1. The block of code can be passed around as a value, and
  2. It can be executed on demand by anyone who has that value, at which time
  3. It can refer to variables from the context in which it was created (i.e., it is closed with respect to variable access, in the mathematical sense of the word “closed”).

Scala Cookbook, give a more graphic metaphor:

I like to think of a closure as being like quantum entanglement, which Ein‐ stein referred to as “a spooky action at a distance.” Just as quantum entanglement begins with two elements that are together and then separated—but somehow remain aware of each other—a closure begins with a function and a variable defined in the same scope, which are then separated from each other. When the function is executed at some other point in space (scope) and time, it is magically still aware of the variable it referenced in their earlier time together, and even picks up any changes to that variable.

1
2
3
4
5
6
7
8
9
10
11
12
var votingAge = 18
val isOfVotingAge = (age: Int) => age >= votingAge
isOfVotingAge(16) // false
isOfVotingAge(20) // true
// change votingAge in one scope
votingAge = 21
// the change to votingAge affects the result
printResult(isOfVotingAge, 20) // now false
// `printResult` and `votingAge` can be far from each other in a light year

scala features to best practices [3]: case class

发表于 2016-02-24   |   分类于 programming

FT-4: case class

SC-4-1: build boilerplate code

You’re working with match expressions, actors, or other situations where you want to use the case class syntax to generate boilerplate code, including accessor and mutator methods, along with apply, unapply, toString, equals, and hashCode methods, and more.

Define your class as a case class, defining any parameters it needs in its constructor

1
2
// name and relation are 'val' by default
case class Person(name: String, relation: String)

Defining a class as a case class results in a lot of boilerplate code being generated, with the following benefits:

  • An apply method is generated, so you don’t need to use the new keyword to create a new instance of the class.
  • Accessor methods are generated for the constructor parameters because case class constructor parameters are val by default. Mutator methods are also generated for parameters declared as var.
  • A good, default toString method is generated.
  • An unapply method is generated, making it easy to use case classes in match expressions.
  • equals and hashCode methods are generated.
  • A copy method is generated.

SC-4-2: pattern match via constructor pattern

1
2
3
4
5
6
7
8
9
10
11
12
13
trait Animal
case class Dog(name: String) extends Animal
case class Cat(name: String) extends Animal
case object Woodpecker extends Animal
object CaseClassTest extends App {
def determineType(x: Animal): String = x match {
case Dog(moniker) => "Got a Dog, name = " + moniker
case _:Cat => "Got a Cat (ignoring the name)"
case Woodpecker => "That was a Woodpecker"
case _ => "That was something else"
}

Scala features to best practices [2]: companion object

发表于 2016-02-23   |   分类于 programming

FT-3: companion object

Define nonstatic (instance) members in your class, and define members that you want to appear as “static” members in an object that has the same name as the class, and is in the same file as the class. This object is known as a companion object.

1
2
3
4
5
6
7
8
9
10
11
// Pizza class
class Pizza (var crustType: String) {
override def toString = "Crust type is " + crustType
}
// companion object
object Pizza {
val CRUST_TYPE_THIN = "thin"
val CRUST_TYPE_THICK = "thick"
def getFoo = "Foo"
}

Although this approach is different than Java, the recipe is straightforward:

  • Define your class and object in the same file, giving them the same name.
  • Define members that should appear to be “static” in the object.
  • Define nonstatic (instance) members in the class.

SC-3-1: accessing private members

It’s also important to know that a class and its companion object can access each other’s private members. In the following code, the “static” method double in the object can access the private variable secret of the class Foo:

1
2
3
4
5
6
7
8
9
10
class Foo {
private val secret = 2
}
object Foo {
// access the private class field 'secret'
def double(foo: Foo) = foo.secret * 2
}
object Driver extends App {
val f = new Foo println(Foo.double(f)) // prints 4
}

Similarly, in the following code, the instance member printObj can access the private field obj of the object Foo:

1
2
3
4
5
6
7
8
9
10
11
class Foo {
// access the private object field 'obj'
def printObj { println(s"I can see ${Foo.obj}") }
}
object Foo {
private val obj = "Foo's object"
}
object Driver extends App {
val f = new Foo
f.printObj
}

SC-3-2: private primary constructor

A simple way to enforce the Singleton pattern in Scala is to make the primary constructor private, then put a getInstance method in the companion object of the class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class Brain private {
override def toString = "This is the brain."
}
object Brain {
val brain = new Brain def getInstance = brain
}
object SingletonTest extends App {
// this won't compile
// val brain = new Brain
// this works
val brain = Brain.getInstance
println(brain)
}
}

SC-3-3: creating instances without new keyword

1
2
3
4
5
6
7
8
9
class Person {
var name: String = _
}
object Person {
def apply(name: String): Person = {
var p = new Person p.name = name
p
}
}

The apply method in a companion object is treated specially by the Scala compiler and lets you create new instances of your class without requiring the new keyword.

The problem can also be addressed by declaring your class as a case class. This works because the case class generates an apply method in a companion object for you. However, it’s important to know that a case class creates much more code for you than just the apply method.

xiaomei BI design

发表于 2016-02-22   |   分类于 bigdata

submodule

  • schema service (readlonly)
    • scan a s pecified driver url to load a bunch of schemas
    • manually create a schema
    • query the schema types
    • guess the schema of some schemaless datastore
  • report service
    • a store system of widget
    • layout placeholder
  • analyse service
    • workflow view
    • schema view
    • publish servicei

Concepts

a dataframe can be modeled as a sequence of transformations closure over an init raw dataframe

the sequence os ops can be lazily computed and any dataframe can be specified as a snapshot at a position

the data-content of the frame can only be generated by computed along the sequence, but the schema of the dataframe can be deduced

widget includes:

  • config:
    • schema-key (from schema service)
    • access-rule
    • parameters
    • query
  • front-end:
    • JS (data-access api)
    • CSS
    • resources

FAQ

  1. to add

Scala features to best practices [1]: delayed evaluation

发表于 2016-02-22   |   分类于 programming

features: delayed evaluation

I’d like to use the term delayed evaluation to cover following two features in scala: lazy var/val and byname parameter. They are not quite related to each other, but both are to postpone the evaluation of a given expression or block for the final result.

FT-1: lazy var/val

Defining a field as lazy is a useful approach when the field might not be accessed in the normal processing of your algorithms, or if running the algorithm will take a long time, and you want to defer that to a later time.

At present, I think its useful in following scenarios:

SC-1-1: field-initialization takes great efforts

It makes sense to use lazy on a class field if its initialization requires a long time to run, and we don’t want to do the job when we instantiate the class until we actually use the field.

1
2
3
4
5
6
7
class Foo {
lazy val text = io.Source.fromFile("/etc/passwd").getLines.foreach(println)
}
object Test extends App {
val f = new Foo
}

In above example, the initialization of text needs to retrieve the contents of the text file /etc/passwd. But when this code is compiled and run, there is no output, because the text field isn’t initialized until it’s accessed. That’s how a lazy field works.

SC-1-2: field-initialization has dependencies

Sometimes we need to initialize fields in a specific order because they have dependency on the other. Then we may produce following ugly codes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class SparkStreamDemo extends Serializable {
@transient private var conf: SparkConf = null
@transient private var sc: SparkContext = null
@transient private var ssc: StreamingContext = null
def getConf() = {
if (conf == null)
conf = new SparkConf()
conf
}
def getSC() = {
if (sc == null)
sc = new SparkContext(getConf)
sc
}
def getSSC() = {
if (ssc == null)
ssc = new StreamingContext(getSC, Seconds(10))
ssc
}
}

In this spark-streaming demo, the initialization of ssc depends on that of sc, which further depends on conf. We operate the initialize manually, thus we define these fields with var, and implement the lazy initialization in getters. The shortcoming is obvious, we have to restrict the access of these field through getters, otherwise we may get the null-valued ones! Moreover, defining var to these fields is not best-practice since they are readonly after initialization. A modified version via lazy val/var is as follows:

1
2
3
4
5
class SparkStreamDemo extends Serializable {
@transient lazy val conf: SparkConf = new SparkConf()
@transient lazy val sc: SparkContext = new SparkContext(conf)
@transient lazy val ssc: StreamingContext = new StreamingContext(sc, Seconds(10))
}

What if a lazy field depends on a non-lazy var, which is not properly initialzed? Can the instance be re-used after some NullPointerException-like error raised? This seems no problem as scala provides a tricky, as @ViktorKlang posted on Twitter:

Little known Scala fact: if the initialization of a lazy val throws an exception, it will attempt to reinitialize the val at next access.

You can check the details here: http://scalapuzzlers.com/#pzzlr-012

FT-2: by-name parameter

The by-name parameter can be considered equivalent to () => Int, which is a Function type that takes a Unit type argument. Besides from normal functions, it can also be used with an Object and apply to make interesting block-like calls.

SC-2-1: wrapper function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// A benchmark construct:
def benchmark (body : => Unit) : Long = {
val start = java.util.Calendar.getInstance().getTimeInMillis()
body
val end = java.util.Calendar.getInstance().getTimeInMillis()
end - start
}
val time = benchmark {
var i = 0 ;
while (i < 1000000) {
i += 1 ;
}
}
println("while took: " + time)

SC-2-2: Add syntactic sugar

1
2
3
4
5
6
7
// While loops are syntactic sugar in Scala:
def myWhile (cond : => Boolean) (body : => Unit) : Unit =
if (cond) { body ; myWhile (cond) (body) } else ()
var i = 0 ;
myWhile (i < 4) { i += 1 ; println (i) }

Accompany with curry, we re-implement a while-loop in above example.

bailaohe

bailaohe

10 日志
2 分类
16 标签
RSS
© 2017 bailaohe
由 Hexo 强力驱动
主题 - NexT.Mist