tonglin0325的个人主页

Hive学习笔记——安装和内部表CRUD

1.首先需要安装Hadoop和Hive

安装的时候参考 http://blog.csdn.net/jdplus/article/details/46493553

安装的版本是apache-hive-2.1.1-bin.tar.gz,解压到/usr/local目录下

然后在/etc/profile文件中添加

1
2
3
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

2.修改配置文件

在bin/hive-config.sh文件中添加

1
2
3
4
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_121
export HIVE_HOME=/usr/local/hive
export HADOOP_HOME=/usr/local/hadoop

添加hive-env.sh文件

1
2
cp hive-env.sh.template hive-env.sh

修改conf目录下的hive-site.xml的内容,该模式是本地模式,且使用JDBC连接元数据,本地模式可以查看Hive编程指南P24-27

实际数据还是存放在HDFS中,MySQL中存放的是元数据表,即schema信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>XXXX</value>
<description>password to use against metastore database</description>
</property>
#如果不配置下面的部分会产生错误1.
<property>
<name>hive.exec.local.scratchdir</name>
<value>/usr/local/hive/tmp/local</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/usr/local/hive/tmp/downloaded</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
<property>
<name>hive.querylog.location</name>
<value>/usr/local/hive/tmp/location</value>
<description>Location of Hive run time structured log file</description>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/usr/local/hive/tmp/operation_logs</value>
<description>Top level directory where operation logs are stored if logging functionality is enabled</description>
</property>
</configuration>

注意/usr/local/hive/tmp/local , /usr/local/hive/tmp/downloaded , /usr/local/hive/tmp/location , /usr/local/hive/tmp/operation_logs 这四个文件夹需要自己创建

修改hive-log4j.properties

1
2
3
4
#cp hive-log4j.properties.template hive-log4j.properties
#vim hive-log4j.properties
hive.log.dir=自定义目录/log/

在HDFS上建立/tmp和/user/hive/warehouse目录,并赋予组用户写权限

注意这里面的/user/hive/warehouse是由hive-site.xml中的${hive.metastore.warehouse.dir}指定的数据仓库的目录

1
2
3
4
5
hadoop fs -mkdir       /tmp
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse

全文 >>

Python爬虫——布隆过滤器

布隆过滤器的实现方法1:自己实现

参考 http://www.cnblogs.com/naive/p/5815433.html

bllomFilter两个参数分别代表,布隆过滤器的大小和hash函数的个数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#coding:utf-8
#!/usr/bin/env python

from bitarray import bitarray
# 3rd party
import mmh3
import scrapy
from BeautifulSoup import BeautifulSoup as BS
import os
ls = os.linesep

class BloomFilter(set):

def __init__(self, size, hash_count):
super(BloomFilter, self).__init__()
self.bit_array = bitarray(size)
self.bit_array.setall(0)
self.size = size
self.hash_count = hash_count

def __len__(self):
return self.size

def __iter__(self):
return iter(self.bit_array)

def add(self, item):
for ii in range(self.hash_count):
index = mmh3.hash(item, ii) % self.size
self.bit_array[index] = 1

return self

def __contains__(self, item):
out = True
for ii in range(self.hash_count):
index = mmh3.hash(item, ii) % self.size
if self.bit_array[index] == 0:
out = False

return out

class DmozSpider(scrapy.Spider):
name = "baidu"
allowed_domains = ["baidu.com"]
start_urls = [
"http://baike.baidu.com/item/%E7%BA%B3%E5%85%B0%E6%98%8E%E7%8F%A0"
]

def parse(self, response):

# fname = "/media/common/娱乐/Electronic_Design/Coding/Python/Scrapy/tutorial/tutorial/spiders/temp"
#
# html = response.xpath('//html').extract()[0]
# fobj = open(fname, 'w')
# fobj.writelines(html.encode('utf-8'))
# fobj.close()

bloom = BloomFilter(1000, 10)
animals = ['dog', 'cat', 'giraffe', 'fly', 'mosquito', 'horse', 'eagle',
'bird', 'bison', 'boar', 'butterfly', 'ant', 'anaconda', 'bear',
'chicken', 'dolphin', 'donkey', 'crow', 'crocodile']
# First insertion of animals into the bloom filter
for animal in animals:
bloom.add(animal)

# Membership existence for already inserted animals
# There should not be any false negatives
for animal in animals:
if animal in bloom:
print('{} is in bloom filter as expected'.format(animal))
else:
print('Something is terribly went wrong for {}'.format(animal))
print('FALSE NEGATIVE!')

# Membership existence for not inserted animals
# There could be false positives
other_animals = ['badger', 'cow', 'pig', 'sheep', 'bee', 'wolf', 'fox',
'whale', 'shark', 'fish', 'turkey', 'duck', 'dove',
'deer', 'elephant', 'frog', 'falcon', 'goat', 'gorilla',
'hawk']
for other_animal in other_animals:
if other_animal in bloom:
print('{} is not in the bloom, but a false positive'.format(other_animal))
else:
print('{} is not in the bloom filter as expected'.format(other_animal))

全文 >>

Ubuntu下安装和使用zookeeper和kafka

1.下载 kafka和zookeeper

这里下载的是 kafka_2.10-0.10.0.0.tgz 和 zookeeper-3.4.10.tar.gz

可以在清华镜像站下载

1
2
https://mirrors.tuna.tsinghua.edu.cn/apache/

或者apache官网

1
2
3
https://kafka.apache.org/downloads
https://zookeeper.apache.org/releases.html

然后分别解压到/usr/local目录下

全文 >>

Ubuntu16.04安装xgboost

1.Python下安装方法

1
2
3
4
5
6
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
make -j4
cd python-package/
sudo python setup.py install

如果在import xgboost后,遇到问题

1
2
OSError: /home/common/anaconda2/lib/python2.7/site-packages/scipy/sparse/../../../../libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/common/coding/coding/Scala/xgboost/python-package/xgboost/../../lib/libxgboost.so)

解决方法

1
2
conda install libgcc

2.Java下安装方法

请先在Python下安装好,因为上面的gcc版本问题会影响到java下xgboost的编译和安装

先更新

1
2
git pull &amp;&amp; git submodule init &amp;&amp; git submodule update &amp;&amp; git submodule status

然后参考

1
2
http://xgboost.readthedocs.io/en/latest/jvm/

全文 >>

查看pip安装的Python库

查看安装的库

1
2
pip list或者pip freeze

查看过时的库

1
2
pip list --outdated

批量更新的Python脚本

1
2
3
4
5
6
import pip
from subprocess import call

for dist in pip.get_installed_distributions():
call("sudo pip install --upgrade " + dist.project_name, shell=True)

全文 >>

Spark学习笔记——泰坦尼克生还预测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
package kaggle

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionWithSGD, NaiveBayes, SVMWithSGD}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.Statistics


/**
* Created by mi on 17-5-23.
*/


object Titanic {


def main(args: Array[String]) {

// val sparkSession = SparkSession.builder.
// master("local")
// .appName("spark session example")
// .getOrCreate()
// val rawData = sparkSession.read.csv("/home/mi/下载/kaggle/Titanic/nohead-train.csv")
// val d = rawData.map{p => p.asInstanceOf[person]}
// d.show()

val conf = new SparkConf().setAppName("WordCount").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)

//屏蔽日志
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

// 读取数据
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "/home/mi/下载/kaggle/Titanic/train.csv", "header" -> "true"))

// 分析年龄数据
val ageAnalysis = df.rdd.filter(d => d(5) != null).map { d =>
val age = d(5).toString.toDouble
Vectors.dense(age)
}
val ageMean = Statistics.colStats(ageAnalysis).mean(0)
val ageMax = Statistics.colStats(ageAnalysis).max(0)
val ageMin = Statistics.colStats(ageAnalysis).min(0)
val ageDiff = ageMax - ageMin

// 分析船票价格数据
val fareAnalysis = df.rdd.filter(d => d(9) != null).map { d =>
val fare = d(9).toString.toDouble
Vectors.dense(fare)
}
val fareMean = Statistics.colStats(fareAnalysis).mean(0)
val fareMax = Statistics.colStats(fareAnalysis).max(0)
val fareMin = Statistics.colStats(fareAnalysis).min(0)
val fareDiff = fareMax - fareMin


// 数据预处理
val trainData = df.rdd.map { d =>
val label = d(1).toString.toInt
val sex = d(4) match {
case "male" => 0.0
case "female" => 1.0
}
val age = d(5) match {
case null => (ageMean - ageMin) / ageDiff
case _ => (d(5).toString().toDouble - ageMin) / ageDiff
}
val fare = d(9) match {
case null => (fareMean - fareMin) / fareDiff
case _ => (d(9).toString().toDouble - fareMin) / fareDiff
}

LabeledPoint(label, Vectors.dense(sex, age, fare))
}

// 切分数据集和测试集
val Array(trainingData, testData) = trainData.randomSplit(Array(0.8, 0.2))

// 训练数据
val numIterations = 8
val lrModel = new LogisticRegressionWithLBFGS().setNumClasses(2).run(trainingData)
// val svmModel = SVMWithSGD.train(trainingData, numIterations)

val nbTotalCorrect = testData.map { point =>
if (lrModel.predict(point.features) == point.label) 1 else 0
}.sum
val nbAccuracy = nbTotalCorrect / testData.count

println("SVM模型正确率:" + nbAccuracy)

// 预测
// 读取数据
val testdf = sqlContext.load("com.databricks.spark.csv", Map("path" -> "/home/mi/下载/kaggle/Titanic/test.csv", "header" -> "true"))

// 分析测试集年龄数据
val ageTestAnalysis = testdf.rdd.filter(d => d(4) != null).map { d =>
val age = d(4).toString.toDouble
Vectors.dense(age)
}
val ageTestMean = Statistics.colStats(ageTestAnalysis).mean(0)
val ageTestMax = Statistics.colStats(ageTestAnalysis).max(0)
val ageTestMin = Statistics.colStats(ageTestAnalysis).min(0)
val ageTestDiff = ageTestMax - ageTestMin

// 分析船票价格数据
val fareTestAnalysis = testdf.rdd.filter(d => d(8) != null).map { d =>
val fare = d(8).toString.toDouble
Vectors.dense(fare)
}
val fareTestMean = Statistics.colStats(fareTestAnalysis).mean(0)
val fareTestMax = Statistics.colStats(fareTestAnalysis).max(0)
val fareTestMin = Statistics.colStats(fareTestAnalysis).min(0)
val fareTestDiff = fareTestMax - fareTestMin

// 数据预处理
val data = testdf.rdd.map { d =>
val sex = d(3) match {
case "male" => 0.0
case "female" => 1.0
}
val age = d(4) match {
case null => (ageTestMean - ageTestMin) / ageTestDiff
case _ => (d(4).toString().toDouble - ageTestMin) / ageTestDiff
}
val fare = d(8) match {
case null => (fareTestMean - fareTestMin) / fareTestDiff
case _ => (d(8).toString().toDouble - fareTestMin) / fareTestDiff
}

Vectors.dense(sex, age, fare)
}

val predictions = lrModel.predict(data).map(p => p.toInt)
// 保存预测结果
predictions.coalesce(1).saveAsTextFile("file:///home/mi/下载/kaggle/Titanic/test_predict")
}
}

全文 >>

Spark学习笔记——手写数字识别

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.regression.RandomForestRegressor
import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, NaiveBayes, SVMWithSGD}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.optimization.L1Updater
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.{DecisionTree, RandomForest}
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.Entropy

/**
* Created by common on 17-5-17.
*/

case class LabeledPic(
label: Int,
pic: List[Double] = List()
)

object DigitRecognizer {

def main(args: Array[String]): Unit = {

val conf = new SparkConf().setAppName("DigitRecgonizer").setMaster("local")
val sc = new SparkContext(conf)
// 去掉第一行,sed 1d train.csv > train_noheader.csv
val trainFile = "file:///media/common/工作/kaggle/DigitRecognizer/train_noheader.csv"
val trainRawData = sc.textFile(trainFile)
// 通过逗号对数据进行分割,生成数组的rdd
val trainRecords = trainRawData.map(line => line.split(","))

val trainData = trainRecords.map { r =>
val label = r(0).toInt
val features = r.slice(1, r.size).map(d => d.toDouble)
LabeledPoint(label, Vectors.dense(features))
}


// // 使用贝叶斯模型
// val nbModel = NaiveBayes.train(trainData)
//
// val nbTotalCorrect = trainData.map { point =>
// if (nbModel.predict(point.features) == point.label) 1 else 0
// }.sum
// val nbAccuracy = nbTotalCorrect / trainData.count
//
// println("贝叶斯模型正确率:" + nbAccuracy)
//
// // 对测试数据进行预测
// val testRawData = sc.textFile("file:///media/common/工作/kaggle/DigitRecognizer/test_noheader.csv")
// // 通过逗号对数据进行分割,生成数组的rdd
// val testRecords = testRawData.map(line => line.split(","))
//
// val testData = testRecords.map { r =>
// val features = r.map(d => d.toDouble)
// Vectors.dense(features)
// }
// val predictions = nbModel.predict(testData).map(p => p.toInt)
// // 保存预测结果
// predictions.coalesce(1).saveAsTextFile("file:///media/common/工作/kaggle/DigitRecognizer/test_predict")


// // 使用线性回归模型
// val lrModel = new LogisticRegressionWithLBFGS()
// .setNumClasses(10)
// .run(trainData)
//
// val lrTotalCorrect = trainData.map { point =>
// if (lrModel.predict(point.features) == point.label) 1 else 0
// }.sum
// val lrAccuracy = lrTotalCorrect / trainData.count
//
// println("线性回归模型正确率:" + lrAccuracy)
//
// // 对测试数据进行预测
// val testRawData = sc.textFile("file:///media/common/工作/kaggle/DigitRecognizer/test_noheader.csv")
// // 通过逗号对数据进行分割,生成数组的rdd
// val testRecords = testRawData.map(line => line.split(","))
//
// val testData = testRecords.map { r =>
// val features = r.map(d => d.toDouble)
// Vectors.dense(features)
// }
// val predictions = lrModel.predict(testData).map(p => p.toInt)
// // 保存预测结果
// predictions.coalesce(1).saveAsTextFile("file:///media/common/工作/kaggle/DigitRecognizer/test_predict1")


// // 使用决策树模型
// val maxTreeDepth = 10
// val numClass = 10
// val dtModel = DecisionTree.train(trainData, Algo.Classification, Entropy, maxTreeDepth, numClass)
//
// val dtTotalCorrect = trainData.map { point =>
// if (dtModel.predict(point.features) == point.label) 1 else 0
// }.sum
// val dtAccuracy = dtTotalCorrect / trainData.count
//
// println("决策树模型正确率:" + dtAccuracy)
//
// // 对测试数据进行预测
// val testRawData = sc.textFile("file:///media/common/工作/kaggle/DigitRecognizer/test_noheader.csv")
// // 通过逗号对数据进行分割,生成数组的rdd
// val testRecords = testRawData.map(line => line.split(","))
//
// val testData = testRecords.map { r =>
// val features = r.map(d => d.toDouble)
// Vectors.dense(features)
// }
// val predictions = dtModel.predict(testData).map(p => p.toInt)
// // 保存预测结果
// predictions.coalesce(1).saveAsTextFile("file:///media/common/工作/kaggle/DigitRecognizer/test_predict2")


// // 使用随机森林模型
// val numClasses = 30
// val categoricalFeaturesInfo = Map[Int, Int]()
// val numTrees = 50
// val featureSubsetStrategy = "auto"
// val impurity = "gini"
// val maxDepth = 10
// val maxBins = 32
// val rtModel = RandomForest.trainClassifier(trainData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
//
// val rtTotalCorrect = trainData.map { point =>
// if (rtModel.predict(point.features) == point.label) 1 else 0
// }.sum
// val rtAccuracy = rtTotalCorrect / trainData.count
//
// println("随机森林模型正确率:" + rtAccuracy)
//
// // 对测试数据进行预测
// val testRawData = sc.textFile("file:///media/common/工作/kaggle/DigitRecognizer/test_noheader.csv")
// // 通过逗号对数据进行分割,生成数组的rdd
// val testRecords = testRawData.map(line => line.split(","))
//
// val testData = testRecords.map { r =>
// val features = r.map(d => d.toDouble)
// Vectors.dense(features)
// }
// val predictions = rtModel.predict(testData).map(p => p.toInt)
// // 保存预测结果
// predictions.coalesce(1).saveAsTextFile("file:///media/common/工作/kaggle/DigitRecognizer/test_predict")


}

}

全文 >>

Stanford Corenlp学习笔记——词性标注

使用Stanford Corenlp对中文进行词性标注

语言为Scala,使用的jar的版本是3.6.0,而且是手动添加jar包,使用sbt添加其他版本的时候出现了各种各样的问题

添加的jar包有5个

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}

/**
* Created by common on 17-5-13.
*/
object NLPLearning {

def main(args: Array[String]): Unit = {
val props="StanfordCoreNLP-chinese.properties"
val pipeline = new StanfordCoreNLP(props)

val annotation = new Annotation("这家酒店很好,我很喜欢。")

pipeline.annotate(annotation)
pipeline.prettyPrint(annotation, System.out)

}

}

全文 >>

Solr学习笔记——导入JSON数据

1.导入JSON数据的方式有两种,一种是在web管理界面中导入,另一种是使用curl命令来导入

1
2
curl http://localhost:8983/solr/baikeperson/update/json?commit=true --data-binary @/home/XXX/下载/person/test1.json -H 'Content-type:text/json; charset=utf-8'

2.导入的时候注意格式

使用curl可以导入的格式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
{
  "add": {
    "overwrite": true,
    "doc": {
      "id": 1,
      "name": "Some book",
      "author": ["John", "Marry"]
    }
  },
  "add": {
    "overwrite": true,
    "boost": 2.5,
    "doc": {
      "id": 2,
      "name": "Important Book",
      "author": ["Harry", "Jane"]
    }
  },
  "add": {
    "overwrite": true,
    "doc": {
      "id": 3,
      "name": "Some other book",
      "author": "Marry"
    }
  }
}

全文 >>