使用parquet-tools的方法有2种
1.在安装了CDH的机器上,会自动有parquet-tools命令
1 2 3 4
| lintong@master:/opt/cloudera/parcels/CDH/bin$ ls| grep parquet-tools parquet-tools lintong@master:/opt/cloudera/parcels/CDH/bin$ parquet-tools
|
2.自行编辑jar
git clone并指定分支,master分支已经删除了parquet-tools
1 2
| git clone git@github.com:apache/parquet-mr.git -b apache-parquet-1.10.1
|
编译
1 2
| cd parquet-tools && mvn clean package -Plocal
|
parquet-tools可以使用的命令,参考:How to build and use parquet-tools to read parquet files
1.查看parquet文件的schema
由AvroParquet写的parquet文件的schema
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| lintong@lintongdeMacBook-Pro ~/coding/java/parquet-mr/parquet-tools/target $ java -jar parquet-tools-1.10.1.jar schema /xxx/avro_parquet/part-r-00000.snappy.parquet
message com.linkedin.haivvreo.test_serializer { required binary string1 (UTF8); required int32 int1; required int32 tinyint1; required int32 smallint1; required int64 bigint1; required boolean boolean1; required float float1; required double double1; required group list1 (LIST) { repeated binary array (UTF8); } required group map1 (MAP) { repeated group map (MAP_KEY_VALUE) { required binary key (UTF8); required int32 value; } } required group struct1 { required int32 sInt; required boolean sBoolean; required binary sString (UTF8); } required binary enum1 (ENUM); optional int32 nullableint; }
|
由ThriftParquet写的parquet文件的schema
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| lintong@lintongdeMacBook-Pro ~/coding/java/parquet-mr/parquet-tools/target $ java -jar parquet-tools-1.10.1.jar schema /xxx/thrift_parquet/part-r-00000.snappy.parquet
message ParquetSchema { required binary string1 (UTF8) = 1; required int32 int1 = 2; required int32 tinyint1 = 3; required int32 smallint1 = 4; required int64 bigint1 = 5; required boolean boolean1 = 6; required double float1 = 7; required double double1 = 8; required group list1 (LIST) = 9 { repeated binary list1_tuple (UTF8); } required group map1 (MAP) = 10 { repeated group map (MAP_KEY_VALUE) { required binary key (UTF8); optional int32 value; } } required group struct1 = 11 { required int32 sInt = 1; required boolean sBoolean = 2; required binary sString (UTF8) = 3; } required binary enum1 (UTF8) = 12; optional int32 nullableint = 13; }
|
由hive job写的parquet文件的schema
1 2 3 4 5 6 7
| message hive_schema { optional binary appid (UTF8); optional int64 ts; optional int32 userid; optional binary countries (UTF8); }
|
2.查看parquet文件的head
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| java -jar parquet-tools-1.10.1.jar head -n 1 /xxx/thrift_parquet/part-r-00000.snappy.parquet string1 = ecAsz6ca7E int1 = 64676 tinyint1 = 8 smallint1 = 0 bigint1 = -9081354042296389692 boolean1 = true float1 = 0.1271180510520935 double1 = 0.011293589263621895 list1: .list1_tuple = v8gCJFRBIb .list1_tuple = nfvrI1Rltp map1: .map: ..key = v8gCJFRBIb ..value = 428 .map: ..key = nfvrI1Rltp ..value = 1257 struct1: .sInt = 740564 .sBoolean = true .sString = RuiVISF2BI enum1 = BLUE nullableint = 5559
|