Spark kyro序列化测试

spark官网给出的几种调优点其中有一条是数据序列化

1.数据序列化,data serialization
1)java serialization(slow and large)
2)kyro serialization(qucikly compact)
注册使用,不注册性能相反
使用kryo的三种方式:

1)代码中增加conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
2)spark-default.conf中进行配置
3)提交作业时--conf key=value的形式添加,优先级比conf中设置高

指定完成后,需要对序列化的类进行注册

	conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))

有几个类就注册几个类,MyClass1,MyClass2即为要注册的类名

对以下几种情况进行测试
1.不进行序列化,34.3MB
Spark kyro序列化测试
2.java序列化,25.1MB
Spark kyro序列化测试
3.使用kyro序列化,但不注册类,40.2MB
Spark kyro序列化测试
4.使用kyro序列化,并注册类,21.1MB
Spark kyro序列化测试
综上,kyro序列化需要注册对应的类,如不注册,性能最糟,甚至不如不序列化
除了上边这种情况,序列化后总体要比不序列化好。

测试代码如下,源于https://blog.****.net/lsshlsw/article/details/50856842

import org.apache.spark.storage.StorageLevel
import org.apache.spark.{SparkConf, SparkContext}
import scala.util.Random
import scala.collection.mutable.ArrayBuffer

case class Info(name: String ,age: Int,gender: String,addr: String)

object SerializeCompare {
  def main(args: Array[String]) {

    val conf = new SparkConf().setMaster("local[2]").setAppName("KyroTest")
    conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    conf.registerKryoClasses(Array(classOf[Info]))
    val sc = new SparkContext(conf)

    val arr = new ArrayBuffer[Info]()

    val nameArr = Array[String]("lsw","yyy","lss")
    val genderArr = Array[String]("male","female")
    val addressArr = Array[String]("beijing","shanghai","shengzhen","wenzhou","hangzhou")

    for(i <- 1 to 1000000){
      val name = nameArr(Random.nextInt(3))
      val age = Random.nextInt(100)
      val gender = genderArr(Random.nextInt(2))
      val address = addressArr(Random.nextInt(5))
      arr.+=(Info(name,age,gender,address))
    }

    val rdd = sc.parallelize(arr)
    
    //序列化的方式将rdd存到内存
    rdd.persist(StorageLevel.MEMORY_ONLY_SER)
    rdd.count()
  }
}