tonglin0325的个人主页

同时安装anaconda2和anaconda3

安装的过程请参考 Ubuntu14.04下同时安装Anaconda2与Anaconda3

启动的时候cd到$HOME/anaconda2/envs/py3k/bin下

1
2
3
source activate py3k #启动
deactivate py3k #退出

然后记得在/etc/profile中加上

1
2
3
# added by Anaconda2 4.3.1 installer
export PATH="/home/common/anaconda2/bin:$PATH"

 如果想安装包,直接pip install

Hive学习笔记——安装和内部表CRUD

1.首先需要安装Hadoop和Hive

安装的时候参考 http://blog.csdn.net/jdplus/article/details/46493553

安装的版本是apache-hive-2.1.1-bin.tar.gz,解压到/usr/local目录下

然后在/etc/profile文件中添加

1
2
3
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

2.修改配置文件

在bin/hive-config.sh文件中添加

1
2
3
4
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_121
export HIVE_HOME=/usr/local/hive
export HADOOP_HOME=/usr/local/hadoop

添加hive-env.sh文件

1
2
cp hive-env.sh.template hive-env.sh

全文 >>

Python爬虫——布隆过滤器

布隆过滤器的实现方法1:自己实现

参考 http://www.cnblogs.com/naive/p/5815433.html

bllomFilter两个参数分别代表,布隆过滤器的大小和hash函数的个数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#coding:utf-8
#!/usr/bin/env python

from bitarray import bitarray
# 3rd party
import mmh3
import scrapy
from BeautifulSoup import BeautifulSoup as BS
import os
ls = os.linesep

class BloomFilter(set):

def __init__(self, size, hash_count):
super(BloomFilter, self).__init__()
self.bit_array = bitarray(size)
self.bit_array.setall(0)
self.size = size
self.hash_count = hash_count

def __len__(self):
return self.size

def __iter__(self):
return iter(self.bit_array)

def add(self, item):
for ii in range(self.hash_count):
index = mmh3.hash(item, ii) % self.size
self.bit_array[index] = 1

return self

def __contains__(self, item):
out = True
for ii in range(self.hash_count):
index = mmh3.hash(item, ii) % self.size
if self.bit_array[index] == 0:
out = False

return out

class DmozSpider(scrapy.Spider):
name = "baidu"
allowed_domains = ["baidu.com"]
start_urls = [
"http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"
]

def parse(self, response):

# fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"
#
# html = response.xpath('//html').extract()[0]
# fobj = open(fname, 'w')
# fobj.writelines(html.encode('utf-8'))
# fobj.close()

bloom = BloomFilter(1000, 10)
animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',
'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',
'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']
# First insertion of animals into the bloom filter
for animal in animals:
bloom.add(animal)

# Membership existence for already inserted animals
# There should not be any false negatives
for animal in animals:
if animal in bloom:
print('{} is in bloom filter as expected'.format(animal))
else:
print('Something is terribly went wrong for {}'.format(animal))
print('FALSE NEGATIVE!')

# Membership existence for not inserted animals
# There could be false positives
other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',
'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',
'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',
'hawk']
for other_animal in other_animals:
if other_animal in bloom:
print('{} is not in the bloom, but a false positive'.format(other_animal))
else:
print('{} is not in the bloom filter as expected'.format(other_animal))

 

布隆过滤器的实现方法2:使用pybloom

参考 http://www.jianshu.com/p/f57187e2b5b9

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#coding:utf-8
#!/usr/bin/env python

from pybloom import BloomFilter

import scrapy
from BeautifulSoup import BeautifulSoup as BS
import os
ls = os.linesep

class DmozSpider(scrapy.Spider):
name = "baidu"
allowed_domains = ["baidu.com"]
start_urls = [
"http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"
]

def parse(self, response):

# fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"
#
# html = response.xpath('//html').extract()[0]
# fobj = open(fname, 'w')
# fobj.writelines(html.encode('utf-8'))
# fobj.close()

# bloom = BloomFilter(100, 10)
bloom = BloomFilter(1000, 0.001)
animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',
'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',
'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']
# First insertion of animals into the bloom filter
for animal in animals:
bloom.add(animal)

# Membership existence for already inserted animals
# There should not be any false negatives
for animal in animals:
if animal in bloom:
print('{} is in bloom filter as expected'.format(animal))
else:
print('Something is terribly went wrong for {}'.format(animal))
print('FALSE NEGATIVE!')

# Membership existence for not inserted animals
# There could be false positives
other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',
'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',
'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',
'hawk']
for other_animal in other_animals:
if other_animal in bloom:
print('{} is not in the bloom, but a false positive'.format(other_animal))
else:
print('{} is not in the bloom filter as expected'.format(other_animal))

 

输出

全文 >>

Ubuntu下安装和使用zookeeper和kafka

1.下载 kafka和zookeeper

这里下载的是 kafka_2.10-0.10.0.0.tgz 和 zookeeper-3.4.10.tar.gz

可以在清华镜像站下载

1
2
https://mirrors.tuna.tsinghua.edu.cn/apache/

或者apache官网

1
2
3
https://kafka.apache.org/downloads
https://zookeeper.apache.org/releases.html

然后分别解压到/usr/local目录下

 

2.安装zookeeper

进入zookeeper目录,在conf目录下将zoo_sample.cfg文件拷贝,并更名为zoo.cfg

参考 https://my.oschina.net/phoebus789/blog/730787

zoo.cfg文件的内容

全文 >>

Ubuntu16.04安装xgboost

1.Python下安装方法

1
2
3
4
5
6
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
make -j4
cd python-package/
sudo python setup.py install

如果在import xgboost后,遇到问题

1
2
OSError: /home/common/anaconda2/lib/python2.7/site-packages/scipy/sparse/../../../../libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/common/coding/coding/Scala/xgboost/python-package/xgboost/../../lib/libxgboost.so)

解决方法

1
2
conda install libgcc

2.Java下安装方法

请先在Python下安装好,因为上面的gcc版本问题会影响到java下xgboost的编译和安装

先更新

1
2
git pull && git submodule init && git submodule update && git submodule status

然后参考

1
2
http://xgboost.readthedocs.io/en/latest/jvm/

 

几个xgboost的Scala实现方法

1
2
3
https://www.elenacuoco.com/2016/10/10/scala-spark-xgboost-classification/
http://blog.csdn.net/luoyexuge/article/details/71422270 

全文 >>

查看pip安装的Python库

查看安装的库

1
2
pip list或者pip freeze

查看过时的库

1
2
pip list --outdated

批量更新的Python脚本

1
2
3
4
5
6
import pip
from subprocess import call

for dist in pip.get_installed_distributions():
call("sudo pip install --upgrade " + dist.project_name, shell=True)

 更新pip

1
2
pip install --upgrade pip

 

Kaggle学习笔记——房屋价格预测

Kaggle的房价数据集使用的是Ames Housing dataset,是美国爱荷华州的艾姆斯镇2006-2010年的房价

1.特征探索和分析

1.了解特征的含义

首先使用Python的pandas加载一下训练样本和测试样本,数据的格式是csv格式的,且第一列是特征的名称

查看一下特征的维度

1
2
3
4
5
6
7
8
9
10
import pandas as pd

# 加载数据
train_data = pd.read_csv("./raw_data/train.csv")

# 去掉id和售价,只看特征
data: DataFrame = train_data.drop(columns=["Id", "SalePrice"])

print(data.shape)

输出如下,除去Id和SalePrice,总共有79维的特征

1
2
(1460, 79)

翻译一下给的房屋数据的特征,这里定义了一个dict,方便理解每个特征的含义

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
dict = {
"MSSubClass": "参与销售住宅的类型:有年代新旧等信息",
"MSZoning": "房屋类型:农用,商用等",
"LotFrontage": "距离街道的距离",
"LotArea": "房屋的面积",
"Street": "通向房屋的Street是用什么铺的",
"Alley": "通向房屋的Alley是用什么铺的",
"LotShape": "房屋的户型,规整程度",
"LandContour": "房屋的平坦程度",
"Utilities": "设施,通不通水电气",
"LotConfig": "死路,处于三岔口等",
"LandSlope": "坡度",
"Neighborhood": "邻居",
"Condition1": "",
"Condition2": "",
"BldgType": "住宅类型,住的家庭数,是否别墅等",
"HouseStyle": "住宅类型,隔断等",
"OverallQual": "房屋的质量",
"OverallCond": "房屋位置的质量",
"YearBuilt": "建造的时间",
"YearRemodAdd": "改造的时间",
"RoofStyle": "屋顶的类型",
"RoofMatl": "屋顶的材料",
"Exterior1st": "外观覆盖的材质",
"Exterior2nd": "如果超过一种,则有第二种材质",
"MasVnrType": "表层砌体类型",
"MasVnrArea": "表层砌体面积",
"ExterQual": "外观材料质量",
"ExterCond": "外观材料情况",
"Foundation": "地基类型",
"BsmtQual": "地下室质量",
"BsmtCond": "地下室的基本情况",
"BsmtExposure": "地下室采光",
"BsmtFinType1": "地下室的完成情况比例",
"BsmtFinSF1": "地下室的完成面积",
"BsmtFinType2": "如果有多个地下室的话",
"BsmtFinSF2": "如果有多个地下室的话",
"BsmtUnfSF": "未完成的地下室面积",
"TotalBsmtSF": "地下室面积",
"Heating": "供暖类型",
"HeatingQC": "供暖质量",
"CentralAir": "是否有中央空调",
"Electrical": "电气系统",
"_1stFlrSF": "1楼面积",
"_2ndFlrSF": "2楼面积",
"LowQualFinSF": "低质量完成的面积(楼梯占用的面积)",
"GrLivArea": "地面以上居住面积",
"BsmtFullBath": "地下室都是洗手间",
"BsmtHalfBath": "地下室一半是洗手间",
"FullBath": "洗手间都在一层以上",
"HalfBath": "一半洗手间在一层以上",
"BedroomAbvGr": "卧室都在一层以上",
"KitchenAbvGr": "厨房在一层以上",
"KitchenQual": "厨房质量",
"TotRmsAbvGrd": "所有房间都在一层以上",
"Functional": "房屋的功能性等级",
"Fireplaces": "壁炉位置",
"FireplaceQu": "壁炉质量",
"GarageType": "车库类型",
"GarageYrBlt": "车库建造时间",
"GarageFinish": "车库的室内装修",
"GarageCars": "车库的汽车容量",
"GarageArea": "车库面积",
"GarageQual": "车库质量",
"GarageCond": "车库情况",
"PavedDrive": "铺路的材料",
"WoodDeckSF": "木地板面积",
"OpenPorchSF": "露天门廊面积",
"EnclosedPorch": "独立门廊面积",
"_3SsnPorch": "three season门廊面积",
"ScreenPorch": "纱门门廊面积",
"PoolArea": "游泳池面积",
"PoolQC": "游泳池质量",
"Fence": "栅栏质量",
"MiscFeature": "上面不包含其他功能",
"MiscVal": "上面不包含其他功能的价格",
"MoSold": "月销量",
"YrSold": "年销量",
"SaleType": "销售方式",
"SaleCondition": "销售情况"
}

2.查看特征是离散特征还是连续特征

1
2
3
# 查看特征是离散特征还是连续特征
train_data.info()

全文 >>

Spark学习笔记——泰坦尼克生还预测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
package kaggle

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionWithSGD, NaiveBayes, SVMWithSGD}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.Statistics


/**
* Created by mi on 17-5-23.
*/


object Titanic {


def main(args: Array[String]) {

// val sparkSession = SparkSession.builder.
// master("local")
// .appName("spark session example")
// .getOrCreate()
// val rawData = sparkSession.read.csv("/home/mi/下载/kaggle/Titanic/nohead-train.csv")
// val d = rawData.map{p => p.asInstanceOf[person]}
// d.show()

val conf = new SparkConf().setAppName("WordCount").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)

//屏蔽日志
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

// 读取数据
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "/home/mi/下载/kaggle/Titanic/train.csv", "header" -> "true"))

// 分析年龄数据
val ageAnalysis = df.rdd.filter(d => d(5) != null).map { d =>
val age = d(5).toString.toDouble
Vectors.dense(age)
}
val ageMean = Statistics.colStats(ageAnalysis).mean(0)
val ageMax = Statistics.colStats(ageAnalysis).max(0)
val ageMin = Statistics.colStats(ageAnalysis).min(0)
val ageDiff = ageMax - ageMin

// 分析船票价格数据
val fareAnalysis = df.rdd.filter(d => d(9) != null).map { d =>
val fare = d(9).toString.toDouble
Vectors.dense(fare)
}
val fareMean = Statistics.colStats(fareAnalysis).mean(0)
val fareMax = Statistics.colStats(fareAnalysis).max(0)
val fareMin = Statistics.colStats(fareAnalysis).min(0)
val fareDiff = fareMax - fareMin


// 数据预处理
val trainData = df.rdd.map { d =>
val label = d(1).toString.toInt
val sex = d(4) match {
case "male" => 0.0
case "female" => 1.0
}
val age = d(5) match {
case null => (ageMean - ageMin) / ageDiff
case _ => (d(5).toString().toDouble - ageMin) / ageDiff
}
val fare = d(9) match {
case null => (fareMean - fareMin) / fareDiff
case _ => (d(9).toString().toDouble - fareMin) / fareDiff
}

LabeledPoint(label, Vectors.dense(sex, age, fare))
}

// 切分数据集和测试集
val Array(trainingData, testData) = trainData.randomSplit(Array(0.8, 0.2))

// 训练数据
val numIterations = 8
val lrModel = new LogisticRegressionWithLBFGS().setNumClasses(2).run(trainingData)
// val svmModel = SVMWithSGD.train(trainingData, numIterations)

val nbTotalCorrect = testData.map { point =>
if (lrModel.predict(point.features) == point.label) 1 else 0
}.sum
val nbAccuracy = nbTotalCorrect / testData.count

println("SVM模型正确率:" + nbAccuracy)

// 预测
// 读取数据
val testdf = sqlContext.load("com.databricks.spark.csv", Map("path" -> "/home/mi/下载/kaggle/Titanic/test.csv", "header" -> "true"))

// 分析测试集年龄数据
val ageTestAnalysis = testdf.rdd.filter(d => d(4) != null).map { d =>
val age = d(4).toString.toDouble
Vectors.dense(age)
}
val ageTestMean = Statistics.colStats(ageTestAnalysis).mean(0)
val ageTestMax = Statistics.colStats(ageTestAnalysis).max(0)
val ageTestMin = Statistics.colStats(ageTestAnalysis).min(0)
val ageTestDiff = ageTestMax - ageTestMin

// 分析船票价格数据
val fareTestAnalysis = testdf.rdd.filter(d => d(8) != null).map { d =>
val fare = d(8).toString.toDouble
Vectors.dense(fare)
}
val fareTestMean = Statistics.colStats(fareTestAnalysis).mean(0)
val fareTestMax = Statistics.colStats(fareTestAnalysis).max(0)
val fareTestMin = Statistics.colStats(fareTestAnalysis).min(0)
val fareTestDiff = fareTestMax - fareTestMin

// 数据预处理
val data = testdf.rdd.map { d =>
val sex = d(3) match {
case "male" => 0.0
case "female" => 1.0
}
val age = d(4) match {
case null => (ageTestMean - ageTestMin) / ageTestDiff
case _ => (d(4).toString().toDouble - ageTestMin) / ageTestDiff
}
val fare = d(8) match {
case null => (fareTestMean - fareTestMin) / fareTestDiff
case _ => (d(8).toString().toDouble - fareTestMin) / fareTestDiff
}

Vectors.dense(sex, age, fare)
}

val predictions = lrModel.predict(data).map(p => p.toInt)
// 保存预测结果
predictions.coalesce(1).saveAsTextFile("file:///home/mi/下载/kaggle/Titanic/test_predict")
}
}

 

Spark学习笔记——手写数字识别

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.regression.RandomForestRegressor
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, NaiveBayes, SVMWithSGD}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.optimization.L1Updater
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.{DecisionTree, RandomForest}
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.Entropy

/**
* Created by common on 17-5-17.
*/

case class LabeledPic(
label: Int,
pic: List[Double] = List()
)

object DigitRecognizer {

def main(args: Array[String]): Unit = {

val conf = new SparkConf().setAppName("DigitRecgonizer").setMaster("local")
val sc = new SparkContext(conf)
// 去掉第一行,sed 1d train.csv > train_noheader.csv
val trainFile = "file:///media/common/工作/kaggle/DigitRecognizer/train_noheader.csv"
val trainRawData = sc.textFile(trainFile)
// 通过逗号对数据进行分割,生成数组的rdd
val trainRecords = trainRawData.map(line => line.split(","))

val trainData = trainRecords.map { r =>
val label = r(0).toInt
val features = r.slice(1, r.size).map(d => d.toDouble)
LabeledPoint(label, Vectors.dense(features))
}


// // 使用贝叶斯模型
// val nbModel = NaiveBayes.train(trainData)
//
// val nbTotalCorrect = trainData.map { point =>
// if (nbModel.predict(point.features) == point.label) 1 else 0
// }.sum
// val nbAccuracy = nbTotalCorrect / trainData.count
//
// println("贝叶斯模型正确率:" + nbAccuracy)
//
// // 对测试数据进行预测
// val testRawData = sc.textFile("file:///media/common/工作/kaggle/DigitRecognizer/test_noheader.csv")
// // 通过逗号对数据进行分割,生成数组的rdd
// val testRecords = testRawData.map(line => line.split(","))
//
// val testData = testRecords.map { r =>
// val features = r.map(d => d.toDouble)
// Vectors.dense(features)
// }
// val predictions = nbModel.predict(testData).map(p => p.toInt)
// // 保存预测结果
// predictions.coalesce(1).saveAsTextFile("file:///media/common/工作/kaggle/DigitRecognizer/test_predict")


// // 使用线性回归模型
// val lrModel = new LogisticRegressionWithLBFGS()
// .setNumClasses(10)
// .run(trainData)
//
// val lrTotalCorrect = trainData.map { point =>
// if (lrModel.predict(point.features) == point.label) 1 else 0
// }.sum
// val lrAccuracy = lrTotalCorrect / trainData.count
//
// println("线性回归模型正确率:" + lrAccuracy)
//
// // 对测试数据进行预测
// val testRawData = sc.textFile("file:///media/common/工作/kaggle/DigitRecognizer/test_noheader.csv")
// // 通过逗号对数据进行分割,生成数组的rdd
// val testRecords = testRawData.map(line => line.split(","))
//
// val testData = testRecords.map { r =>
// val features = r.map(d => d.toDouble)
// Vectors.dense(features)
// }
// val predictions = lrModel.predict(testData).map(p => p.toInt)
// // 保存预测结果
// predictions.coalesce(1).saveAsTextFile("file:///media/common/工作/kaggle/DigitRecognizer/test_predict1")


// // 使用决策树模型
// val maxTreeDepth = 10
// val numClass = 10
// val dtModel = DecisionTree.train(trainData, Algo.Classification, Entropy, maxTreeDepth, numClass)
//
// val dtTotalCorrect = trainData.map { point =>
// if (dtModel.predict(point.features) == point.label) 1 else 0
// }.sum
// val dtAccuracy = dtTotalCorrect / trainData.count
//
// println("决策树模型正确率:" + dtAccuracy)
//
// // 对测试数据进行预测
// val testRawData = sc.textFile("file:///media/common/工作/kaggle/DigitRecognizer/test_noheader.csv")
// // 通过逗号对数据进行分割,生成数组的rdd
// val testRecords = testRawData.map(line => line.split(","))
//
// val testData = testRecords.map { r =>
// val features = r.map(d => d.toDouble)
// Vectors.dense(features)
// }
// val predictions = dtModel.predict(testData).map(p => p.toInt)
// // 保存预测结果
// predictions.coalesce(1).saveAsTextFile("file:///media/common/工作/kaggle/DigitRecognizer/test_predict2")


// // 使用随机森林模型
// val numClasses = 30
// val categoricalFeaturesInfo = Map[Int, Int]()
// val numTrees = 50
// val featureSubsetStrategy = "auto"
// val impurity = "gini"
// val maxDepth = 10
// val maxBins = 32
// val rtModel = RandomForest.trainClassifier(trainData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
//
// val rtTotalCorrect = trainData.map { point =>
// if (rtModel.predict(point.features) == point.label) 1 else 0
// }.sum
// val rtAccuracy = rtTotalCorrect / trainData.count
//
// println("随机森林模型正确率:" + rtAccuracy)
//
// // 对测试数据进行预测
// val testRawData = sc.textFile("file:///media/common/工作/kaggle/DigitRecognizer/test_noheader.csv")
// // 通过逗号对数据进行分割,生成数组的rdd
// val testRecords = testRawData.map(line => line.split(","))
//
// val testData = testRecords.map { r =>
// val features = r.map(d => d.toDouble)
// Vectors.dense(features)
// }
// val predictions = rtModel.predict(testData).map(p => p.toInt)
// // 保存预测结果
// predictions.coalesce(1).saveAsTextFile("file:///media/common/工作/kaggle/DigitRecognizer/test_predict")


}

}

 

Stanford Corenlp学习笔记——词性标注

使用Stanford Corenlp对中文进行词性标注

语言为Scala,使用的jar的版本是3.6.0,而且是手动添加jar包,使用sbt添加其他版本的时候出现了各种各样的问题

添加的jar包有5个

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}

/**
* Created by common on 17-5-13.
*/
object NLPLearning {

def main(args: Array[String]): Unit = {
val props="StanfordCoreNLP-chinese.properties"
val pipeline = new StanfordCoreNLP(props)

val annotation = new Annotation("这家酒店很好,我很喜欢。")

pipeline.annotate(annotation)
pipeline.prettyPrint(annotation, System.out)

}

}

 

关于词性标记

动词,形容词(4种):VA,VC,VE,VV

1、谓词性形容词:VA

谓词性形容词大致上相当于英语中的形容词和中文语法中、文学作品里的静态动词。我们的谓词性形容词包括两类:

第一类:没有宾语且能被“很”修饰的谓语。

第二类:源自第一类的、通过重叠(如红彤彤)或者通过名词加形容词模式意味着“像N一样A”(如雪白)的谓语。这个类型的谓词性形容词没有宾语,但是有一些不能被“很”修饰,因为这些词的强调意思已经内嵌在词内了。

全文 >>