tonglin0325的个人主页

Hadoop学习笔记——HDFS

1.查看hdfs文件的block信息

不正常的文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
hdfs fsck /logs/xxx/xxxx.gz.gz -files -blocks -locations
Connecting to namenode via http://xxx-01:50070/fsck?ugi=xxx&files=1&blocks=1&locations=1&path=%2Flogs%2Fxxx%2Fxxx%2F401294%2Fds%3Dxxxx-07-14%2Fxxx.gz.gz
FSCK started by xxxx (auth:KERBEROS_SSL) from /10.90.1.91 for path xxxxx.gz.gz at Mon Jul 15 11:44:13 CST 2019
Status: HEALTHY
Total size: 0 B (Total open files size: 194 B)
Total dirs: 0
Total files: 0
Total symlinks: 0 (Files currently being written: 1)
Total blocks (validated): 0 (Total open file blocks (not validated): 1)
Minimally replicated blocks: 0
Over-replicated blocks: 0
Under-replicated blocks: 0
Mis-replicated blocks: 0
Default replication factor: 3
Average block replication: 0.0
Corrupt blocks: 0
Missing replicas: 0
Number of data-nodes: 99
Number of racks: 3
FSCK ended at Mon Jul 15 11:44:13 CST 2019 in 0 milliseconds

 正常的文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Connecting to namenode via http://xxx:50070/fsck?ugi=xxx&files=1&blocks=1&locations=1&path=%2Fxxx%2Fxxx%2Fxxx%2F401294%2Fds%3Dxxx-07-14%2Fxx.gz
FSCK started by xxxx (auth:KERBEROS_SSL) from /10.90.1.91 for path /logs/xxxx.gz at Mon Jul 15 11:46:12 CST 2019
/logs/xxxx.gz 74745 bytes, 1 block(s): OK
0. BP-1760298736-10.90.1.6-1536234810107:blk_1392467116_318836510 len=74745 Live_repl=3 [DatanodeInfoWithStorage[10.90.1.99:1004,DS-9d465b1f-943f-4716-bce0-8b36e5631b4a,DISK], DatanodeInfoWithStorage[10.90.1.216:1004,DS-160924c6-4cd7-4822-93c0-9ac9cf9c5784,DISK], DatanodeInfoWithStorage[10.90.1.191:1004,DS-d0a2e418-610f-4bef-8f1d-4ce045533656,DISK]]

Status: HEALTHY
Total size: 74745 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 74745 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 99
Number of racks: 3
FSCK ended at Mon Jul 15 11:46:12 CST 2019 in 1 milliseconds

 

2.修复hdfs文件命令

1
2
hdfs debug recoverLease -path /logs/xxxx.gz.gz -retries 3

修复之后

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
hdfs fsck /logs/xxx.gz.gz -files -blocks -locations
Connecting to namenode via http://xxx-01:50070/fsck?ugi=xxx&files=1&blocks=1&locations=1&path=%2Flogs%2Fnsh%2Fjson%2F401294%2Fds%3D2019-07-14%2Fxxx.gz.gz
FSCK started by xxx (auth:KERBEROS_SSL) from /10.90.1.91 for path /logs/xxxx.gz.gz at Mon Jul 15 11:48:01 CST 2019
/logs/xxxx.gz.gz 67157 bytes, 1 block(s): OK
0. BP-1760298736-10.90.1.6-1536234810107:blk_1392594522_319757834 len=67157 Live_repl=3 [DatanodeInfoWithStorage[10.90.1.213:1004,DS-6aee5c90-c834-475e-8f20-7a0f8bd8d315,DISK], DatanodeInfoWithStorage[10.90.1.207:1004,DS-cd79bacc-89ff-4fb3-82b5-79341391ae8d,DISK], DatanodeInfoWithStorage[10.90.1.97:1004,DS-ba5953f8-c0c3-444a-8996-3bcfa1bcf851,DISK]]

Status: HEALTHY
Total size: 67157 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 67157 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 99
Number of racks: 3
FSCK ended at Mon Jul 15 11:48:01 CST 2019 in 1 milliseconds

 

  1. 其他博客:Hadoop学习笔记—HDFS

 

  1. 一些脚本

遍历hdfs路径,并截取最后一层子路径,其中 ##*/ 是截取 / 后面的字符,参考:Shell当中的字符串切割

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/usr/bin/env bash

function grep_dir(){
kinit -kt xxx.keytab xxx
# 遍历hdfs路径
for file in `hadoop fs -ls /logs/xxxx | awk '{print $8}'`
do
echo $file" "${file##*/}
done
}

grep_dir

输出

1
2
3
4
5
6
7
8
/logs/xxxx/ds=2019-04-01   ds=2019-04-01
/logs/xxxx/ds=2019-04-02 ds=2019-04-02
/logs/xxxx/ds=2019-04-03 ds=2019-04-03
/logs/xxxx/ds=2019-04-04 ds=2019-04-04
/logs/xxxx/ds=2019-04-05 ds=2019-04-05
/logs/xxxx/ds=2019-04-06 ds=2019-04-06
/logs/xxxx/ds=2019-04-07 ds=2019-04-07

 

遍历hdfs路径,并截取中间层路径,其中

1
2
logid=`echo $file | awk -F "/" '{print $5}'`

是使用 / 切分后,取对应数组第5个位置的字符串

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/usr/bin/env bash

function grep_dir(){
kinit -kt xxxx.keytab xxxx
# 遍历hdfs路径
for file in `hadoop fs -ls /logs/xxx/json/status-1 | awk '{print $8}'`
do
echo $file" "${file##*/}
logid=`echo $file | awk -F "/" '{print $5}'`
echo $logid
done
}

grep_dir

输出

1
2
3
4
5
6
7
/logs/xxx/json/status-1/ds=2021-04-18   ds=2021-04-18
status-1
/logs/xxx/json/status-1/ds=2021-04-19 ds=2021-04-19
status-1
/logs/xxx/json/status-1/ds=2021-04-20 ds=2021-04-20
status-1

  

5.webhdfs

1
2
http://master:14000/webhdfs/v1/?op=liststatus&user.name=hadoop

 

 

参考:WebHDFS与HttpFS的使用