文件存储格式与数据压缩结合
压缩比和查询速度对比:
TextFile:
创建表,存储数据库格式为TEXTFILE;
create table log_text (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE ;
向表中加载数据:load data local inpath '/export/servers/hivedatas/log.data' into table
log_text ;
查看表中数据大小:dfs -du -h /user/hive/warehouse/myhive.db/log_text;
ORC:
创建表,存储数据格式为ORC
create table log_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc ;
向表中加载数据:insert into table log_orc select * from log_text ;
查看表中数据大小:dfs -du -h /user/hive/warehouse/myhive.db/log_orc;
Parquet:
创建表,存储数据格式为parquet:
create table log_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS PARQUET ;
向表中加载数据:insert into table log_parquet select * from log_text ;
查看表中数据大小:dfs -du -h /user/hive/warehouse/myhive.db/log_parquet;
存储文件的压缩比总结:
ORC > Parquet > textFile
存储文中的查询速度测试:
TextFile:
hive(defaule) > select count(*) from log_text;
Time taken: 21.54 seconds, Fetched: 1 row(s)
ORC:
hive (default)> select count(*) from log_orc;
Time taken: 20.867 seconds, Fetched: 1 row(s)
Parquet:
hive (default)> select count(*) from log_parquet;
Time taken: 22.922 seconds, Fetched: 1 row(s)
存储文件的查询速度:
ORC > TextFile > Parquest
ORC 存储指定压缩方式:
官网:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
ORC 存储方式的压缩:
创建一个非压缩的ORC存储方式:
建表语句:
create table log_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="NONE");
插入数据:insert into table log_orc_none select * from log_text;
查看插入后数据:
dfs -du -h /user/hive/warehouse/myhive.db/log_orc_none;
创建一个SNAPPY压缩的ORC存储方式:
建表语句:
create table log_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");
插入数据:insert into table log_orc_snappy select * from log_text ;
查看插入后数据:dfs -du -h /user/hive/warehouse/myhive.db/log_orc_snappy ;
存储方式和压缩总结:
在实际的项目开发当中,hive表的数据存储格式一般选择:orc或parquet。压缩方式一般选择snappy。