在Ubuntu上使用shadow$ocks

2017-09-11

tonglin0325

安装

1 2	pip install shadow$ocks

创建文件

1 2	touch /etc/shadow$ocks.json

{
"server":"服務器IP或域名",
"server_port":端口號,
"local_address": "127.0.0.1",
"local_port":1080,
"password":"密碼",
"timeout":300,
"method":"aes-256-cfb",
"fast_open": false
}

安装proxychains

1 2	sudo apt-get install proxychains

编辑/etc/proxychains.conf，最后一行改为socks5 127.0.0.1 1080

然后在root下运行

全文 >>

同时安装anaconda2和anaconda3

2017-07-02

tonglin0325

安装的过程请参考 Ubuntu14.04下同时安装Anaconda2与Anaconda3

启动的时候cd到$HOME/anaconda2/envs/py3k/bin下

1
2
3

source activate py3k #启动
deactivate py3k    #退出

然后记得在/etc/profile中加上

1
2
3

# added by Anaconda2 4.3.1 installer
export PATH="/home/common/anaconda2/bin:$PATH"

如果想安装包,直接pip install

Hive学习笔记——安装和内部表CRUD

2017-06-24

tonglin0325

1.首先需要安装Hadoop和Hive

安装的时候参考 http://blog.csdn.net/jdplus/article/details/46493553

安装的版本是apache-hive-2.1.1-bin.tar.gz,解压到/usr/local目录下

然后在/etc/profile文件中添加

1
2
3

export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

2.修改配置文件

在bin/hive-config.sh文件中添加

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_121
export HIVE_HOME=/usr/local/hive
export HADOOP_HOME=/usr/local/hadoop

添加hive-env.sh文件

1 2	cp hive-env.sh.template hive-env.sh

全文 >>

Python爬虫——布隆过滤器

2017-06-18

tonglin0325

布隆过滤器的实现方法1:自己实现

参考 http://www.cnblogs.com/naive/p/5815433.html

bllomFilter两个参数分别代表,布隆过滤器的大小和hash函数的个数

#coding:utf-8
#!/usr/bin/env python

from bitarray import bitarray
# 3rd party
import mmh3
import scrapy
from BeautifulSoup import BeautifulSoup as BS
import os
ls = os.linesep

class BloomFilter(set):

    def __init__(self, size, hash_count):
        super(BloomFilter, self).__init__()
        self.bit_array = bitarray(size)
        self.bit_array.setall(0)
        self.size = size
        self.hash_count = hash_count

    def __len__(self):
        return self.size

    def __iter__(self):
        return iter(self.bit_array)

    def add(self, item):
        for ii in range(self.hash_count):
            index = mmh3.hash(item, ii) % self.size
            self.bit_array[index] = 1

        return self

    def __contains__(self, item):
        out = True
        for ii in range(self.hash_count):
            index = mmh3.hash(item, ii) % self.size
            if self.bit_array[index] == 0:
                out = False

        return out

class DmozSpider(scrapy.Spider):
    name = "baidu"
    allowed_domains = ["baidu.com"]
    start_urls = [
        "http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"
    ]

    def parse(self, response):

        # fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"
        #
        # html = response.xpath('//html').extract()[0]
        # fobj = open(fname, 'w')
        # fobj.writelines(html.encode('utf-8'))
        # fobj.close()

        bloom = BloomFilter(1000, 10)
        animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',
                   'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',
                   'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']
        # First insertion of animals into the bloom filter
        for animal in animals:
            bloom.add(animal)

        # Membership existence for already inserted animals
        # There should not be any false negatives
        for animal in animals:
            if animal in bloom:
                print('{} is in bloom filter as expected'.format(animal))
            else:
                print('Something is terribly went wrong for {}'.format(animal))
                print('FALSE NEGATIVE!')

        # Membership existence for not inserted animals
        # There could be false positives
        other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',
                         'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',
                         'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',
                         'hawk']
        for other_animal in other_animals:
            if other_animal in bloom:
                print('{} is not in the bloom, but a false positive'.format(other_animal))
            else:
                print('{} is not in the bloom filter as expected'.format(other_animal))

布隆过滤器的实现方法2:使用pybloom

参考 http://www.jianshu.com/p/f57187e2b5b9

#coding:utf-8
#!/usr/bin/env python

from pybloom import BloomFilter

import scrapy
from BeautifulSoup import BeautifulSoup as BS
import os
ls = os.linesep

class DmozSpider(scrapy.Spider):
    name = "baidu"
    allowed_domains = ["baidu.com"]
    start_urls = [
        "http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"
    ]

    def parse(self, response):

        # fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"
        #
        # html = response.xpath('//html').extract()[0]
        # fobj = open(fname, 'w')
        # fobj.writelines(html.encode('utf-8'))
        # fobj.close()

        # bloom = BloomFilter(100, 10)
        bloom = BloomFilter(1000, 0.001)
        animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',
                   'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',
                   'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']
        # First insertion of animals into the bloom filter
        for animal in animals:
            bloom.add(animal)

        # Membership existence for already inserted animals
        # There should not be any false negatives
        for animal in animals:
            if animal in bloom:
                print('{} is in bloom filter as expected'.format(animal))
            else:
                print('Something is terribly went wrong for {}'.format(animal))
                print('FALSE NEGATIVE!')

        # Membership existence for not inserted animals
        # There could be false positives
        other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',
                         'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',
                         'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',
                         'hawk']
        for other_animal in other_animals:
            if other_animal in bloom:
                print('{} is not in the bloom, but a false positive'.format(other_animal))
            else:
                print('{} is not in the bloom filter as expected'.format(other_animal))

输出

全文 >>

Ubuntu下安装和使用zookeeper和kafka

2017-06-17

tonglin0325

1.下载 kafka和zookeeper

这里下载的是 kafka_2.10-0.10.0.0.tgz 和 zookeeper-3.4.10.tar.gz

可以在清华镜像站下载

1 2	https://mirrors.tuna.tsinghua.edu.cn/apache/

或者apache官网

1
2
3

https://kafka.apache.org/downloads
https://zookeeper.apache.org/releases.html

然后分别解压到/usr/local目录下

2.安装zookeeper

进入zookeeper目录,在conf目录下将zoo_sample.cfg文件拷贝,并更名为zoo.cfg

参考 https://my.oschina.net/phoebus789/blog/730787

zoo.cfg文件的内容

全文 >>

Ubuntu16.04安装xgboost

2017-06-10

tonglin0325

1.Python下安装方法

git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
make -j4
cd python-package/
sudo python setup.py install

如果在import xgboost后，遇到问题

1
2

OSError: /home/common/anaconda2/lib/python2.7/site-packages/scipy/sparse/../../../../libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/common/coding/coding/Scala/xgboost/python-package/xgboost/../../lib/libxgboost.so)

解决方法

1 2	conda install libgcc

2.Java下安装方法

请先在Python下安装好，因为上面的gcc版本问题会影响到java下xgboost的编译和安装

先更新

1 2	git pull && git submodule init && git submodule update && git submodule status

然后参考

1 2	http://xgboost.readthedocs.io/en/latest/jvm/

几个xgboost的Scala实现方法

1
2
3

https://www.elenacuoco.com/2016/10/10/scala-spark-xgboost-classification/
http://blog.csdn.net/luoyexuge/article/details/71422270&nbsp;

全文 >>

查看pip安装的Python库

2017-06-02

tonglin0325

查看安装的库

1 2	pip list或者pip freeze

查看过时的库

1 2	pip list --outdated

批量更新的Python脚本

import pip
from subprocess import call
 
for dist in pip.get_installed_distributions():
    call("sudo pip install --upgrade " + dist.project_name, shell=True)

更新pip

1 2	pip install --upgrade pip

Kaggle学习笔记——房屋价格预测

2017-05-28

tonglin0325

Kaggle的房价数据集使用的是Ames Housing dataset，是美国爱荷华州的艾姆斯镇2006-2010年的房价

1.特征探索和分析

1.了解特征的含义

首先使用Python的pandas加载一下训练样本和测试样本，数据的格式是csv格式的，且第一列是特征的名称

查看一下特征的维度

import pandas as pd

# 加载数据
train_data = pd.read_csv("./raw_data/train.csv")

# 去掉id和售价，只看特征
data: DataFrame = train_data.drop(columns=["Id", "SalePrice"])

print(data.shape)

输出如下，除去Id和SalePrice，总共有79维的特征

1 2	(1460, 79)

翻译一下给的房屋数据的特征，这里定义了一个dict，方便理解每个特征的含义

dict = {
    "MSSubClass": "参与销售住宅的类型:有年代新旧等信息",
    "MSZoning": "房屋类型:农用,商用等",
    "LotFrontage": "距离街道的距离",
    "LotArea": "房屋的面积",
    "Street": "通向房屋的Street是用什么铺的",
    "Alley": "通向房屋的Alley是用什么铺的",
    "LotShape": "房屋的户型,规整程度",
    "LandContour": "房屋的平坦程度",
    "Utilities": "设施,通不通水电气",
    "LotConfig": "死路,处于三岔口等",
    "LandSlope": "坡度",
    "Neighborhood": "邻居",
    "Condition1": "",
    "Condition2": "",
    "BldgType": "住宅类型,住的家庭数,是否别墅等",
    "HouseStyle": "住宅类型,隔断等",
    "OverallQual": "房屋的质量",
    "OverallCond": "房屋位置的质量",
    "YearBuilt": "建造的时间",
    "YearRemodAdd": "改造的时间",
    "RoofStyle": "屋顶的类型",
    "RoofMatl": "屋顶的材料",
    "Exterior1st": "外观覆盖的材质",
    "Exterior2nd": "如果超过一种,则有第二种材质",
    "MasVnrType": "表层砌体类型",
    "MasVnrArea": "表层砌体面积",
    "ExterQual": "外观材料质量",
    "ExterCond": "外观材料情况",
    "Foundation": "地基类型",
    "BsmtQual": "地下室质量",
    "BsmtCond": "地下室的基本情况",
    "BsmtExposure": "地下室采光",
    "BsmtFinType1": "地下室的完成情况比例",
    "BsmtFinSF1": "地下室的完成面积",
    "BsmtFinType2": "如果有多个地下室的话",
    "BsmtFinSF2": "如果有多个地下室的话",
    "BsmtUnfSF": "未完成的地下室面积",
    "TotalBsmtSF": "地下室面积",
    "Heating": "供暖类型",
    "HeatingQC": "供暖质量",
    "CentralAir": "是否有中央空调",
    "Electrical": "电气系统",
    "_1stFlrSF": "1楼面积",
    "_2ndFlrSF": "2楼面积",
    "LowQualFinSF": "低质量完成的面积(楼梯占用的面积)",
    "GrLivArea": "地面以上居住面积",
    "BsmtFullBath": "地下室都是洗手间",
    "BsmtHalfBath": "地下室一半是洗手间",
    "FullBath": "洗手间都在一层以上",
    "HalfBath": "一半洗手间在一层以上",
    "BedroomAbvGr": "卧室都在一层以上",
    "KitchenAbvGr": "厨房在一层以上",
    "KitchenQual": "厨房质量",
    "TotRmsAbvGrd": "所有房间都在一层以上",
    "Functional": "房屋的功能性等级",
    "Fireplaces": "壁炉位置",
    "FireplaceQu": "壁炉质量",
    "GarageType": "车库类型",
    "GarageYrBlt": "车库建造时间",
    "GarageFinish": "车库的室内装修",
    "GarageCars": "车库的汽车容量",
    "GarageArea": "车库面积",
    "GarageQual": "车库质量",
    "GarageCond": "车库情况",
    "PavedDrive": "铺路的材料",
    "WoodDeckSF": "木地板面积",
    "OpenPorchSF": "露天门廊面积",
    "EnclosedPorch": "独立门廊面积",
    "_3SsnPorch": "three season门廊面积",
    "ScreenPorch": "纱门门廊面积",
    "PoolArea": "游泳池面积",
    "PoolQC": "游泳池质量",
    "Fence": "栅栏质量",
    "MiscFeature": "上面不包含其他功能",
    "MiscVal": "上面不包含其他功能的价格",
    "MoSold": "月销量",
    "YrSold": "年销量",
    "SaleType": "销售方式",
    "SaleCondition": "销售情况"
}

2.查看特征是离散特征还是连续特征

1
2
3

# 查看特征是离散特征还是连续特征
train_data.info()

全文 >>

Spark学习笔记——泰坦尼克生还预测

2017-05-26

tonglin0325

package kaggle

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionWithSGD, NaiveBayes, SVMWithSGD}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.Statistics


/**
  * Created by mi on 17-5-23.
  */


object Titanic {


  def main(args: Array[String]) {

    //    val sparkSession = SparkSession.builder.
    //      master("local")
    //      .appName("spark session example")
    //      .getOrCreate()
    //    val rawData = sparkSession.read.csv("/home/mi/下载/kaggle/Titanic/nohead-train.csv")
    //    val d = rawData.map{p => p.asInstanceOf[person]}
    //    d.show()

    val conf = new SparkConf().setAppName("WordCount").setMaster("local")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    //屏蔽日志
    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

    // 读取数据
    val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "/home/mi/下载/kaggle/Titanic/train.csv", "header" -> "true"))

    // 分析年龄数据
    val ageAnalysis = df.rdd.filter(d => d(5) != null).map { d =>
      val age = d(5).toString.toDouble
      Vectors.dense(age)
    }
    val ageMean = Statistics.colStats(ageAnalysis).mean(0)
    val ageMax = Statistics.colStats(ageAnalysis).max(0)
    val ageMin = Statistics.colStats(ageAnalysis).min(0)
    val ageDiff = ageMax - ageMin

    // 分析船票价格数据
    val fareAnalysis = df.rdd.filter(d => d(9) != null).map { d =>
      val fare = d(9).toString.toDouble
      Vectors.dense(fare)
    }
    val fareMean = Statistics.colStats(fareAnalysis).mean(0)
    val fareMax = Statistics.colStats(fareAnalysis).max(0)
    val fareMin = Statistics.colStats(fareAnalysis).min(0)
    val fareDiff = fareMax - fareMin


    // 数据预处理
    val trainData = df.rdd.map { d =>
      val label = d(1).toString.toInt
      val sex = d(4) match {
        case "male" => 0.0
        case "female" => 1.0
      }
      val age = d(5) match {
        case null => (ageMean - ageMin) / ageDiff
        case _ => (d(5).toString().toDouble - ageMin) / ageDiff
      }
      val fare = d(9) match {
        case null => (fareMean - fareMin) / fareDiff
        case _ => (d(9).toString().toDouble - fareMin) / fareDiff
      }

      LabeledPoint(label, Vectors.dense(sex, age, fare))
    }

    // 切分数据集和测试集
    val Array(trainingData, testData) = trainData.randomSplit(Array(0.8, 0.2))

    // 训练数据
    val numIterations = 8
    val lrModel = new LogisticRegressionWithLBFGS().setNumClasses(2).run(trainingData)
    //    val svmModel = SVMWithSGD.train(trainingData, numIterations)

    val nbTotalCorrect = testData.map { point =>
      if (lrModel.predict(point.features) == point.label) 1 else 0
    }.sum
    val nbAccuracy = nbTotalCorrect / testData.count

    println("SVM模型正确率：" + nbAccuracy)

    // 预测
    // 读取数据
    val testdf = sqlContext.load("com.databricks.spark.csv", Map("path" -> "/home/mi/下载/kaggle/Titanic/test.csv", "header" -> "true"))

    // 分析测试集年龄数据
    val ageTestAnalysis = testdf.rdd.filter(d => d(4) != null).map { d =>
      val age = d(4).toString.toDouble
      Vectors.dense(age)
    }
    val ageTestMean = Statistics.colStats(ageTestAnalysis).mean(0)
    val ageTestMax = Statistics.colStats(ageTestAnalysis).max(0)
    val ageTestMin = Statistics.colStats(ageTestAnalysis).min(0)
    val ageTestDiff = ageTestMax - ageTestMin

    // 分析船票价格数据
    val fareTestAnalysis = testdf.rdd.filter(d => d(8) != null).map { d =>
      val fare = d(8).toString.toDouble
      Vectors.dense(fare)
    }
    val fareTestMean = Statistics.colStats(fareTestAnalysis).mean(0)
    val fareTestMax = Statistics.colStats(fareTestAnalysis).max(0)
    val fareTestMin = Statistics.colStats(fareTestAnalysis).min(0)
    val fareTestDiff = fareTestMax - fareTestMin

    // 数据预处理
    val data = testdf.rdd.map { d =>
      val sex = d(3) match {
        case "male" => 0.0
        case "female" => 1.0
      }
      val age = d(4) match {
        case null => (ageTestMean - ageTestMin) / ageTestDiff
        case _ => (d(4).toString().toDouble - ageTestMin) / ageTestDiff
      }
      val fare = d(8) match {
        case null => (fareTestMean - fareTestMin) / fareTestDiff
        case _ => (d(8).toString().toDouble - fareTestMin) / fareTestDiff
      }

      Vectors.dense(sex, age, fare)
    }

    val predictions = lrModel.predict(data).map(p => p.toInt)
    // 保存预测结果
    predictions.coalesce(1).saveAsTextFile("file:///home/mi/下载/kaggle/Titanic/test_predict")
  }
}

Spark学习笔记——手写数字识别

2017-05-25

tonglin0325

import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.regression.RandomForestRegressor
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, NaiveBayes, SVMWithSGD}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.optimization.L1Updater
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.{DecisionTree, RandomForest}
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.Entropy

/**
  * Created by common on 17-5-17.
  */

case class LabeledPic(
                       label: Int,
                       pic: List[Double] = List()
                     )

object DigitRecognizer {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("DigitRecgonizer").setMaster("local")
    val sc = new SparkContext(conf)
    // 去掉第一行，sed 1d train.csv > train_noheader.csv
    val trainFile = "file:///media/common/工作/kaggle/DigitRecognizer/train_noheader.csv"
    val trainRawData = sc.textFile(trainFile)
    // 通过逗号对数据进行分割，生成数组的rdd
    val trainRecords = trainRawData.map(line => line.split(","))

    val trainData = trainRecords.map { r =>
      val label = r(0).toInt
      val features = r.slice(1, r.size).map(d => d.toDouble)
      LabeledPoint(label, Vectors.dense(features))
    }


    //    // 使用贝叶斯模型
    //    val nbModel = NaiveBayes.train(trainData)
    //
    //    val nbTotalCorrect = trainData.map { point =>
    //      if (nbModel.predict(point.features) == point.label) 1 else 0
    //    }.sum
    //    val nbAccuracy = nbTotalCorrect / trainData.count
    //
    //    println("贝叶斯模型正确率：" + nbAccuracy)
    //
    //    // 对测试数据进行预测
    //    val testRawData = sc.textFile("file:///media/common/工作/kaggle/DigitRecognizer/test_noheader.csv")
    //    // 通过逗号对数据进行分割，生成数组的rdd
    //    val testRecords = testRawData.map(line => line.split(","))
    //
    //    val testData = testRecords.map { r =>
    //      val features = r.map(d => d.toDouble)
    //      Vectors.dense(features)
    //    }
    //    val predictions = nbModel.predict(testData).map(p => p.toInt)
    //    // 保存预测结果
    //    predictions.coalesce(1).saveAsTextFile("file:///media/common/工作/kaggle/DigitRecognizer/test_predict")


    //    // 使用线性回归模型
    //    val lrModel = new LogisticRegressionWithLBFGS()
    //      .setNumClasses(10)
    //      .run(trainData)
    //
    //    val lrTotalCorrect = trainData.map { point =>
    //      if (lrModel.predict(point.features) == point.label) 1 else 0
    //    }.sum
    //    val lrAccuracy = lrTotalCorrect / trainData.count
    //
    //    println("线性回归模型正确率：" + lrAccuracy)
    //
    //    // 对测试数据进行预测
    //    val testRawData = sc.textFile("file:///media/common/工作/kaggle/DigitRecognizer/test_noheader.csv")
    //    // 通过逗号对数据进行分割，生成数组的rdd
    //    val testRecords = testRawData.map(line => line.split(","))
    //
    //    val testData = testRecords.map { r =>
    //      val features = r.map(d => d.toDouble)
    //      Vectors.dense(features)
    //    }
    //    val predictions = lrModel.predict(testData).map(p => p.toInt)
    //    // 保存预测结果
    //    predictions.coalesce(1).saveAsTextFile("file:///media/common/工作/kaggle/DigitRecognizer/test_predict1")


    //    // 使用决策树模型
    //    val maxTreeDepth = 10
    //    val numClass = 10
    //    val dtModel = DecisionTree.train(trainData, Algo.Classification, Entropy, maxTreeDepth, numClass)
    //
    //    val dtTotalCorrect = trainData.map { point =>
    //      if (dtModel.predict(point.features) == point.label) 1 else 0
    //    }.sum
    //    val dtAccuracy = dtTotalCorrect / trainData.count
    //
    //    println("决策树模型正确率：" + dtAccuracy)
    //
    //    // 对测试数据进行预测
    //    val testRawData = sc.textFile("file:///media/common/工作/kaggle/DigitRecognizer/test_noheader.csv")
    //    // 通过逗号对数据进行分割，生成数组的rdd
    //    val testRecords = testRawData.map(line => line.split(","))
    //
    //    val testData = testRecords.map { r =>
    //      val features = r.map(d => d.toDouble)
    //      Vectors.dense(features)
    //    }
    //    val predictions = dtModel.predict(testData).map(p => p.toInt)
    //    // 保存预测结果
    //    predictions.coalesce(1).saveAsTextFile("file:///media/common/工作/kaggle/DigitRecognizer/test_predict2")


//    // 使用随机森林模型
//    val numClasses = 30
//    val categoricalFeaturesInfo = Map[Int, Int]()
//    val numTrees = 50
//    val featureSubsetStrategy = "auto"
//    val impurity = "gini"
//    val maxDepth = 10
//    val maxBins = 32
//    val rtModel = RandomForest.trainClassifier(trainData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
//
//    val rtTotalCorrect = trainData.map { point =>
//      if (rtModel.predict(point.features) == point.label) 1 else 0
//    }.sum
//    val rtAccuracy = rtTotalCorrect / trainData.count
//
//    println("随机森林模型正确率：" + rtAccuracy)
//
//    // 对测试数据进行预测
//    val testRawData = sc.textFile("file:///media/common/工作/kaggle/DigitRecognizer/test_noheader.csv")
//    // 通过逗号对数据进行分割，生成数组的rdd
//    val testRecords = testRawData.map(line => line.split(","))
//
//    val testData = testRecords.map { r =>
//      val features = r.map(d => d.toDouble)
//      Vectors.dense(features)
//    }
//    val predictions = rtModel.predict(testData).map(p => p.toInt)
//    // 保存预测结果
//    predictions.coalesce(1).saveAsTextFile("file:///media/common/工作/kaggle/DigitRecognizer/test_predict")


  }

}

tonglin0325的个人主页

在Ubuntu上使用shadow$ocks

同时安装anaconda2和anaconda3

Hive学习笔记——安装和内部表CRUD

Python爬虫——布隆过滤器

Ubuntu下安装和使用zookeeper和kafka

1.下载 kafka和zookeeper

2.安装zookeeper

Ubuntu16.04安装xgboost

查看pip安装的Python库

Kaggle学习笔记——房屋价格预测

1.特征探索和分析

1.了解特征的含义

2.查看特征是离散特征还是连续特征

Spark学习笔记——泰坦尼克生还预测

Spark学习笔记——手写数字识别

tonglin0325

标签

标签云

文章归档

最近文章