Find the Underlying Structure of Big Data by Machine Learning with Spark | QQ群: 452154809

Spark 数据统计分析利器

Spark Bin 1143℃

熟悉Spark的数据分析API,更加方便我们的数据分析、特征分析、模型评价工作,示例如下:

  • 1 DoubleRDDFunctions

    val sc = new SparkContext(new SparkConf)
    val rdd = sc.parallelize(Seq(1, 2, 3, 100)).map(_.toDouble)
    val stats = rdd.stats()
    println("最大值:" + stats.max, "最小值:" + stats.min, "平均值:" + stats.mean, "方差" + stats.variance,
    "标准方差" + stats.stdev, "元素个数:" + stats.count)
    // (最大值:100.0,最小值:1.0,平均值:26.5,方差1801.25,标准方差42.44113570582201,元素个数:4)
    // 直方图展示数据生成, 下面俩种用法表示一致,共分为俩个桶:( 0,50 ], ( 50,100 ]
    rdd. histogram(2) // 2 evenly spaced buckets, between myRDD.min -> myRDD.max
    rdd.histogram(Array(0.0, 50.0, 100.0)) // manually specify the buckets
  • 2 MultivariateStatisticalSummary

    val observations: RDD[Vector] = sc.textFile("feature.data")
    .map(_.split("\001")).map(x => Vectors.dense(x.map(_.toDouble))) // an RDD of Vectors
    val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
    println(summary.mean) // a dense vector containing the mean value for each column
    println(summary.variance) // column-wise variance
    println(summary.numNonzeros) // number of nonzeros in each column
  • 3 BinaryClassificationMetrics

    def main(args: Array[String]): Unit = {
    val sc = new SparkContext(new SparkConf)
    val data = MLUtils.loadLibSVMFile(sc, "libsvm.data")
    val splits = data.randomSplit(Array(0.7, 0.3))
    val (trainingData, testData) = (splits(0), splits(1))

    val boostingStrategy = BoostingStrategy.defaultParams("Classification")
    boostingStrategy.numIterations = 60
    boostingStrategy.treeStrategy.numClasses = 2
    boostingStrategy.treeStrategy.maxDepth = 10
    //Empty categoricalFeaturesInfo indicates all features are continuous
    boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()
    val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

    val labelAndPreds = testData.map {
    point =>
    val prediction = model.predict(point.features)
    (point.label, prediction)
    }
    // Area under PR,Area under ROC
    val metrics = new BinaryClassificationMetrics(labelAndPreds)
    println(model.getClass.getSimpleName, metrics.areaUnderPR, metrics.areaUnderROC)
    // (GradientBoostedTreesModel,0.6077438016528927,0.7024793388429752)

  • PS:

           The area under this PR curve is referred to as the average precision.Intuitively, an area under the PR curve of 1.0 will equate to a perfect classifier that will achieve 100 percent in both precision and recall.

          The area under the ROC curve (commonly referred to as AUC) represents an average value. Again, an AUC of 1.0 will represent a perfect classifier. An area of 0.5 is referred to as the random guess.

转载请注明:单向街的夏天 » Spark 数据统计分析利器

喜欢 (4)or分享 (0)