tonglin0325的个人主页

机器学习——决策树

1.决策树的构造

优点:计算复杂度不高,输出结果易于理解,对中间值的缺失不敏感,可以处理不相关特征数据

缺点:可能会产生过度匹配问题

适用数据类型:数值型和标称型

全文 >>

scyllaDB基本使用

1.scylla部署

docker单机部署

可以使用docker镜像来启动scyllaDB

docker集群部署

也可以使用docker镜像来部署scyllaDB集群

1
2
3
4
5
6
7
8
docker run --name scylla -p 9042:9042 -p 9160:9160 -p 10000:10000 -p 9180:9180 -v /var/lib/scylla:/var/lib/scylla -d scylladb/scylla

docker run --name scylla-node2 -p 8042:9042 -p 8160:9160 -p 1000:10000 -p 8180:9180 -v /var/lib/scylladb2:/var/lib/scylla -d scylladb/scylla --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' scylla)"

docker run --name scylla-node3 -p 10042:9042 -p 10160:9160 -p 1100:10000 -p 10180:9180 -v /var/lib/scylladb3:/var/lib/scylla -d scylladb/scylla --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' scylla)"

docker run --name scylla-node4 -p 11042:9042 -p 11160:9160 -p 1200:10000 -p 11180:9180 -v /var/lib/scylladb4:/var/lib/scylla -d scylladb/scylla --seeds="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' scylla)"

  

2.在cqlsh中操作scyllaDB

在cqlsh中可以使用CQL (the Cassandra Query Language) 来对scyllaDB做一些基本操作

1
2
3
4
5
6
sh-4.2# cqlsh
Connected to at 172.17.0.3:9042.
[cqlsh 5.0.1 | Cassandra 3.0.8 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh>

参考:CQLSh: the CQL shell

  

3.scyllaDB的操作

scylla数据存储于table当中,而table由keyspace分组

创建keysapce

名字叫做test

1
2
3
4
5
cqlsh> CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION = {'class': 'SimpleStrategy','replication_factor':1};
cqlsh> describe keyspaces;

system_schema system_auth system system_distributed test system_traces

REPLICATION参数指定了备份策略,使用了REPLICATION后必须指定class,其中class有SimpleStrategy,NetworkTopologyStrategy,在这里由于是单机测试,所以我指定副本数量是1

创建表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cqlsh> use test;
cqlsh:test>

CREATE TABLE demo (
user_id int,
str text,
mtime timestamp,
PRIMARY KEY (user_id, mtime)
) WITH CLUSTERING ORDER BY (mtime DESC);

cqlsh:test> DESCRIBE TABLES

demo

cqlsh:test>

PRIMARY KEY参数指定了主键,会按照user_id,mtime的顺序来排列key

CLUSTERING参数指定了mtime按照降序排列

插入数据到scylla表

1
2
3
4
5
6
INSERT INTO demo (
user_id,str,mtime
) VALUES
(6,'test','2021-10-09 07:00:00')
using TTL 86400;

可以看到相同的key user_id会聚合到一起,相同的user_id中mtime按照降序排列

1
2
3
4
5
6
7
8
9
10
11
12
13
cqlsh:test> select * from demo;

user_id | mtime | str
---------+---------------------------------+-------
5 | 2021-10-09 02:00:00.000000+0000 | Panda
1 | 2021-10-09 04:00:00.000000+0000 | Kay
1 | 2021-10-01 02:00:00.000000+0000 | Kay
2 | 2021-10-09 03:00:00.000000+0000 | Snail
2 | 2021-10-01 02:00:00.000000+0000 | Snail
6 | 2021-10-09 07:00:00.000000+0000 | test

(6 rows)

参考:Data Definition 

查看scylla集群状态

1
2
3
4
5
6
7
8
9
10
11
12
13
[root@6a30e1b8fc71 /]# nodetool status
Using /etc/scylla/scylla.yaml as the config file
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 172.17.0.2 1.49 MB 256 ? eaee8765-450d-4d4e-a7b5-2ed4c6b20df3 rack1
UN 172.17.0.5 1.04 MB 256 ? 17153bb7-f4f1-4436-bc49-f1eca3409040 rack1
UN 172.17.0.4 1.03 MB 256 ? 25fd6224-7edc-4161-bf49-ba6fe51c9f73 rack1
UN 172.17.0.6 1.04 MB 256 ? 3d5757d2-dc5f-4ba4-8d90-e02eb9d4c255 rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless  

查看表状态

1
2
nodetool tablestats my_ks.my_tb

增加字段

1
2
ALTER TABLE xx.xx ADD col_name col_type;

删除scylla表

全文 >>

Hive学习笔记——beeline

使用beeline连接hive

1
2
3
kinit -kt xxx.keytab xxx
beeline -u "jdbc:hive2://10.65.13.98:10000/default;principal=hive/_HOST@CLOUDERA.SITE"

参考:

1
2
https://docs.cloudera.com/runtime/7.2.7/securing-hive/topics/hive_remote_data_access.html

如果要想直接运行SQL,可以

1
2
beeline -u "jdbc:hive2://10.65.13.98:10000/default;principal=hive/_HOST@CLOUDERA.SITE" --silent=true --outputformat=tsv2 --showHeader=false -e "select * from xxx.xxx"

退出beeline

1
2
!quit