tonglin0325的个人主页

论文阅读——Twitter日志系统

1.业界公司数据平台建设规模

1.twitter

Twitter关于日志系统的论文有如下2篇,分别是

《The Unified Logging Infrastructure for Data Analytics at Twitter》和《Scaling Big Data Mining Infrastructure: The Twitter Experience》

1
2
3
https://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf
https://www.kdd.org/exploration_files/V14-02-02-Lin.pdf

相关PPT

1
2
3
4
5
6
https://www.slideshare.net/Hadoop_Summit/scaling-big-data-mining-infrastructure-twitter-experience
https://slideplayer.com/slide/12451118/
https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html
https://www.slideshare.net/kevinweil/hadoop-at-twitter-hadoop-summit-2010
https://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter/1-Twitter_Open_Source_Coming_soon

其中《The Unified Logging Infrastructure for Data Analytics at Twitter》这篇Twitter12年的论文中介绍了Twitter的产品日志基础架构以及从应用特定日志到统一的“客户端事件”日志格式的演进,其中message都是Thrift message

全文 >>

Hive学习笔记——SerDe

SerDe 是Serializer 和 Deserializer 的简称,它提供了Hive和各种数据格式交互的方式。

Amazon的Athena可以理解是Amazon对标hive的一款产品,其中对SerDe的介绍如下

1
2
https://docs.aws.amazon.com/zh_cn/athena/latest/ug/serde-about.html

对于Hive中经常使用的SerDe如下,参考了

Hive_10. Hive中常用的 SerDe 和 当前社区的状态

1.LazySimpleSerDe,用来处理文本格式文件:TEXTFILE

全文 >>

maven打包scala+java工程

在 scala和java混合编程的时候,需要添加一些额外的配置到pom中,才能将scala文件的class加到最终的jar中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
 <build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<id>scala-compile</id>
<goals>
<goal>compile</goal>
</goals>
<configuration>
<!--includes是一个数组,包含要编译的code-->
<includes>
<include>**/*.scala</include>
</includes>
</configuration>
</execution>
<execution>
<id>scala-test-compile</id>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.1.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.7.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</pluginManagement>
</build>

全文 >>

MapReduce中的InputFormat

InputFormat在hadoop源码中是一个抽象类 public abstract class InputFormat<K, V>

1
2
https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/InputFormat.java

可以参考文章

1
2
https://cloud.tencent.com/developer/article/1043622

其中有两个抽象方法

1
2
3
4
public abstract 
List<InputSplit> getSplits(JobContext context
) throws IOException, InterruptedException;

1
2
3
4
5
6
public abstract 
RecordReader<K,V> createRecordReader(InputSplit split,
TaskAttemptContext context
) throws IOException,
InterruptedException;

InputFormat做的事情就是将inputfile使用getSplits方法切分成List,之后使用createRecordReader方法将每个split 解析成records, 再依次将record解析成<K,V>对

全文 >>

Ubuntu16.04安装protobuf

1.proto2

1.protobuf的github地址

1
2
https://github.com/protocolbuffers/protobuf

去releases下载需要的版本

1
2
https://github.com/protocolbuffers/protobuf/releases

选择2.5.0的版本

1
2
https://github.com/protocolbuffers/protobuf/releases/tag/v2.5.0

下载

1
2
wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz

编译安装

1
2
3
4
5
6
./autogen.sh
./configure
make
make check
sudo make install

安装完毕,查看版本

1
2
3
protoc --version
libprotoc 2.5.0

参考google的javatutorial

1
2
https://developers.google.com/protocol-buffers/docs/javatutorial

pb的数据类型如下

1
2
https://developers.google.com/protocol-buffers/docs/proto

全文 >>