论文阅读——Twitter日志系统

2020-12-08

tonglin0325

1.业界公司数据平台建设规模

1.twitter

Twitter关于日志系统的论文有如下2篇，分别是

《The Unified Logging Infrastructure for Data Analytics at Twitter》和《Scaling Big Data Mining Infrastructure: The Twitter Experience》

1
2
3

https://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf
https://www.kdd.org/exploration_files/V14-02-02-Lin.pdf

SpringBoot学习笔记——Redis Template

2020-12-01

tonglin0325

Springboot可以通过redis template和redis进行交互，使用方法如下

全文 >>

Hadoop学习笔记——配置文件

2020-11-20

tonglin0325

下载hadoop的原生版本，版本选择2.6.0，下载地址

1 2	https://archive.apache.org/dist/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz

解压后可以看到

其中配置文件在 /etc/hadoop目录下

全文 >>

Hive学习笔记——SerDe

2020-11-17

tonglin0325

SerDe 是Serializer 和 Deserializer 的简称，它提供了Hive和各种数据格式交互的方式。

Amazon的Athena可以理解是Amazon对标hive的一款产品，其中对SerDe的介绍如下

1 2	https://docs.aws.amazon.com/zh_cn/athena/latest/ug/serde-about.html

对于Hive中经常使用的SerDe如下，参考了

Hive_10. Hive中常用的 SerDe 和当前社区的状态

1.LazySimpleSerDe，用来处理文本格式文件：TEXTFILE

全文 >>

MapReduce中的OutputFormat

2020-11-16

tonglin0325

OutputFormat在hadoop源码中是一个抽象类 public abstract class OutputFormat<K, V>，其定义了reduce任务的输出格式

1 2	https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/OutputFormat.java

可以参考文章

MapReduce快速入门系列(12) | MapReduce之OutputFormat

全文 >>

Filebeat的http endpoint input

2020-10-10

tonglin0325

Filebeat的input终于支持了http，可以使用post请求向filebeat的input传输数据，不过现在还是处于beta版本

全文 >>

maven打包scala+java工程

2020-10-10

tonglin0325

在 scala和java混合编程的时候，需要添加一些额外的配置到pom中，才能将scala文件的class加到最终的jar中

 <build>
     <pluginManagement>
         <plugins>
             <plugin>
                 <groupId>org.scala-tools</groupId>
                 <artifactId>maven-scala-plugin</artifactId>
                 <version>2.15.2</version>
                 <executions>
                     <execution>
                         <id>scala-compile</id>
                         <goals>
                             <goal>compile</goal>
                         </goals>
                         <configuration>
                             <!--includes是一个数组，包含要编译的code-->
                             <includes>
                                 <include>**/*.scala</include>
                             </includes>
                         </configuration>
                     </execution>
                     <execution>
                         <id>scala-test-compile</id>
                         <goals>
                             <goal>testCompile</goal>
                         </goals>
                     </execution>
                 </executions>
             </plugin>
             <plugin>
                 <groupId>org.apache.maven.plugins</groupId>
                 <artifactId>maven-assembly-plugin</artifactId>
                 <version>3.1.0</version>
                 <configuration>
                     <descriptorRefs>
                         <descriptorRef>jar-with-dependencies</descriptorRef>
                     </descriptorRefs>
                 </configuration>
                 <executions>
                     <execution>
                         <id>make-assembly</id>
                         <phase>package</phase>
                         <goals>
                             <goal>single</goal>
                         </goals>
                     </execution>
                 </executions>
             </plugin>
             <plugin>
                 <groupId>org.apache.maven.plugins</groupId>
                 <artifactId>maven-compiler-plugin</artifactId>
                 <version>3.7.0</version>
                 <configuration>
                     <source>1.8</source>
                     <target>1.8</target>
                 </configuration>
             </plugin>
         </plugins>
     </pluginManagement>
</build>

全文 >>

使用thrift的java client调用python server

2020-10-09

tonglin0325

参考：Thrift 连接 Java 与 Python，附 Java 通用工厂方法

上面这篇文章的例子是使用java client调用python server中的helloString方法来打印client传输过去的字符串

thrift文件，hello.thrift

service Hello {
    string helloString(1:string word)
}

Server端

生成Python server端代码

1 2	thrift --gen py hello.thrift

全文 >>

MapReduce中的InputFormat

2020-09-30

tonglin0325

InputFormat在hadoop源码中是一个抽象类 public abstract class InputFormat<K, V>

1 2	https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/InputFormat.java

可以参考文章

1 2	https://cloud.tencent.com/developer/article/1043622

其中有两个抽象方法

public abstract 
  List<InputSplit> getSplits(JobContext context
                             ) throws IOException, InterruptedException;

和

public abstract 
  RecordReader<K,V> createRecordReader(InputSplit split,
                                       TaskAttemptContext context
                                      ) throws IOException, 
                                               InterruptedException;

InputFormat做的事情就是将inputfile使用getSplits方法切分成List，之后使用createRecordReader方法将每个split 解析成records, 再依次将record解析成<K,V>对

全文 >>

Ubuntu16.04安装protobuf

2020-09-17

tonglin0325

1.proto2

1.protobuf的github地址

1 2	https://github.com/protocolbuffers/protobuf

去releases下载需要的版本

1 2	https://github.com/protocolbuffers/protobuf/releases

选择2.5.0的版本

1 2	https://github.com/protocolbuffers/protobuf/releases/tag/v2.5.0

下载

1 2	wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz

编译安装

./autogen.sh
./configure
make
make check
sudo make install

安装完毕，查看版本

1
2
3

protoc --version
libprotoc 2.5.0

参考google的javatutorial

1 2	https://developers.google.com/protocol-buffers/docs/javatutorial

pb的数据类型如下

1 2	https://developers.google.com/protocol-buffers/docs/proto

全文 >>

tonglin0325的个人主页

论文阅读——Twitter日志系统

1.业界公司数据平台建设规模

1.twitter

SpringBoot学习笔记——Redis Template

Hadoop学习笔记——配置文件

Hive学习笔记——SerDe

MapReduce中的OutputFormat

Filebeat的http endpoint input

maven打包scala+java工程

使用thrift的java client调用python server

MapReduce中的InputFormat

Ubuntu16.04安装protobuf

1.proto2

tonglin0325

标签

标签云

文章归档

最近文章