产品动态
产品公告
安全公告
基本路径下的目录结构中。数据集分为多个分区,这些分区是包含该分区的数据文件的文件夹,这与 Hive 表非常相似。分区路径区分开来。在每个分区内,文件被组织为文件组,由文件id唯一标识。每个文件组包含多个文件切片,其中每个切片包含在某个提交/压缩即时时间生成的基本列文件*.parquet以及一组日志文件*.log*,该文件包含自生成基本文件以来对基本文件的插入/更新。文件组/文件id之间的映射就永远不会改变。简而言之,映射的文件组包含一组记录的所有版本。权衡 | 写时复制 | 读时合并 |
数据延迟 | 更高 | 更低 |
更新代价(I/O) | 更高(重写整个 parquet 文件) | 更低(追加到增量日志) |
Parquet 文件大小 | 更小(高更新代价(I/o)) | 更大(低更新代价) |
写放大 | 更高 | 更低(取决于压缩策略) |
cd /usr/local/service/hudiln -s /usr/local/service/spark/conf/spark-defaults.conf /usr/local/service/hudi/demo/config/spark-defaults.conf
hdfs dfs -mkdir -p /hudi/confighdfs dfs -copyFromLocal demo/config/* /hudi/config/
/usr/local/service/hudi/demo/config/kafka-source.propertiesbootstrap.servers=kafka_ip:kafka_port
cat demo/data/batch_1.json | kafkacat -b [kafka_ip] -t stock_ticks -P
spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --master yarn ./hudi-utilities-bundle_2.11-0.5.1-incubating.jar --table-type COPY_ON_WRITE --source-class org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts --target-base-path /usr/hive/warehouse/stock_ticks_cow --target-table stock_ticks_cow --props /hudi/config/kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProviderspark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --master yarn ./hudi-utilities-bundle_2.11-0.5.1-incubating.jar --table-type MERGE_ON_READ --source-class org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts --target-base-path /usr/hive/warehouse/stock_ticks_mor --target-table stock_ticks_mor --props /hudi/config/kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider --disable-compaction
hdfs dfs -ls /usr/hive/warehouse/
bin/run_sync_tool.sh --jdbc-url jdbc:hive2://[hiveserver2_ip:hiveserver2_port] --user hadoop --pass [password] --partitioned-by dt --base-path /usr/hive/warehouse/stock_ticks_cow --database default --table stock_ticks_cowbin/run_sync_tool.sh --jdbc-url jdbc:hive2://[hiveserver2_ip:hiveserver2_port] --user hadoop --pass [password]--partitioned-by dt --base-path /usr/hive/warehouse/stock_ticks_mor --database default --table stock_ticks_mor --skip-ro-suffix
beeline -u jdbc:hive2://[hiveserver2_ip:hiveserver2_port] -n hadoop --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=false
spark-sql --master yarn --conf spark.sql.hive.convertMetastoreParquet=false
select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG';select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG';select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor where symbol = 'GOOG';select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_rt where symbol = 'GOOG';
/usr/local/service/presto-client/presto --server localhost:9000 --catalog hive --schema default --user Hadoop
"_hoodie_commit_time",执行如下 sql 语句:select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG';select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG';select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_mor where symbol = 'GOOG';select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';
cat demo/data/batch_2.json | kafkacat -b 10.0.1.70 -t stock_ticks -P
spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --master yarn ./hudi-utilities-bundle_2.11-0.5.1-incubating.jar --table-type COPY_ON_WRITE --source-class org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts --target-base-path /usr/hive/warehouse/stock_ticks_cow --target-table stock_ticks_cow --props /hudi/config/kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProviderspark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --master yarn ./hudi-utilities-bundle_2.11-0.5.1-incubating.jar --table-type MERGE_ON_READ --source-class org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts --target-base-path /usr/hive/warehouse/stock_ticks_mor --target-table stock_ticks_mor --props /hudi/config/kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider --disable-compaction
cli/bin/hudi-cli.shconnect --path /usr/hive/warehouse/stock_ticks_morcompactions show allcompaction schedule合并执行计划compaction run --compactionInstant [requestID] --parallelism 2 --sparkMemory 1G --schemaFilePath /hudi/config/schema.avsc --retry 1
beeline -u jdbc:hive2://[hiveserver2_ip:hiveserver2_port] -n hadoop --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=falseset hive.execution.engine=tez;set hive.execution.engine=spark;
cosn://[bucket]。参考如下操作:bin/kafka-server-start.sh config/server.properties &cat demo/data/batch_1.json | kafkacat -b kafkaip -t stock_ticks -Pcat demo/data/batch_2.json | kafkacat -b kafkaip -t stock_ticks -Pkafkacat -b kafkaip -Lhdfs dfs -mkdir -p cosn://[bucket]/hudi/confighdfs dfs -copyFromLocal demo/config/* cosn://[bucket]/hudi/config/spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --master yarn ./hudi-utilities-bundle_2.11-0.5.1-incubating.jar --table-type COPY_ON_WRITE --source-class org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts --target-base-path cosn://[bucket]/usr/hive/warehouse/stock_ticks_cow --target-table stock_ticks_cow --props cosn://[bucket]/hudi/config/kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProviderspark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --master yarn ./hudi-utilities-bundle_2.11-0.5.1-incubating.jar --table-type MERGE_ON_READ --source-class org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field ts --target-base-path cosn://[bucket]/usr/hive/warehouse/stock_ticks_mor --target-table stock_ticks_mor --props cosn://[bucket]/hudi/config/kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider --disable-compactionbin/run_sync_tool.sh --jdbc-url jdbc:hive2://[hiveserver2_ip:hiveserver2_port] --user hadoop --pass isd@cloud --partitioned-by dt --base-path cosn://[bucket]/usr/hive/warehouse/stock_ticks_cow --database default --table stock_ticks_cowbin/run_sync_tool.sh --jdbc-url jdbc:hive2://[hiveserver2_ip:hiveserver2_port] --user hadoop --pass hive --partitioned-by dt --base-path cosn://[bucket]/usr/hive/warehouse/stock_ticks_mor --database default --table stock_ticks_mor --skip-ro-suffixbeeline -u jdbc:hive2://[hiveserver2_ip:hiveserver2_port] -n hadoop --hiveconf hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat --hiveconf hive.stats.autogather=falsespark-sql --master yarn --conf spark.sql.hive.convertMetastoreParquet=falsehivesqls:select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG';select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG';select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor where symbol = 'GOOG';select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';select `_hoodie_commit_time`, symbol, ts, volume, open, close from stock_ticks_mor_rt where symbol = 'GOOG';prestosqls:/usr/local/service/presto-client/presto --server localhost:9000 --catalog hive --schema default --user Hadoopselect symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_cow where symbol = 'GOOG';select symbol, max(ts) from stock_ticks_mor group by symbol HAVING symbol = 'GOOG';select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_mor where symbol = 'GOOG';select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG';select "_hoodie_commit_time", symbol, ts, volume, open, close from stock_ticks_mor_rt where symbol = 'GOOG';cli/bin/hudi-cli.shconnect --path cosn://[bucket]/usr/hive/warehouse/stock_ticks_morcompactions show allcompaction schedulecompaction run --compactionInstant [requestid] --parallelism 2 --sparkMemory 1G --schemaFilePath cosn://[bucket]/hudi/config/schema.avsc --retry 1
文档反馈