sqoop将数据从hive导出到mysql

https://blog.csdn.net/wiborgite/article/details/80958201

这位大佬案例写的很清楚，

我这里主要针对官方文档做些说明及提升自己记忆，

sqoop官方文档http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_syntax_4

Argument	Description
`--columns <col,col,col…>`	Columns to export to table 要导出那些列
`--direct`	Use direct export fast path 是否采用直接模式
`--export-dir <dir>`	HDFS source path for the export 存放在hdfs上的数据
`-m,--num-mappers <n>`	Use n map tasks to export in parallel 设定mapreduce个数
`--table <table-name>`	Table to populate 那张表
`--call <stored-proc-name>`	Stored Procedure to call
`--update-key <col-name>`	Anchor column to use for updates. Use a comma separated list of columns if there are more than one column. ！！重点以哪个列作为更新列（一般是主键唯一列），如果有多个用，分割
`--update-mode <mode>`	Specify how updates are performed when new rows are found with non-matching keys in database. 更新模式
	Legal values for `mode` include `updateonly` (default) and `allowinsert`.
`--input-null-string <null-string>`	The string to be interpreted as null for string columns ！！说是把string的null替换成你要的，测试不行。
`--input-null-non-string <null-string>`	The string to be interpreted as null for non-string columns
`--staging-table <staging-table-name>`	The table in which data will be staged before being inserted into the destination table.
`--clear-staging-table`	Indicates that any data present in the staging table can be deleted.
`--batch`	Use batch mode for underlying statement execution.

sqoop官方文档

Consider also a dataset in HDFS containing records like these:

0,this is a test,42
1,some more data,100
...

Running sqoop-export --table foo --update-key id --export-dir /path/to/data --connect … will run an export job that executes SQL statements based on the data like so:

UPDATE foo SET msg='this is a test', bar=42 WHERE id=0;
UPDATE foo SET msg='some more data', bar=100 WHERE id=1;
...

If an UPDATE statement modifies no rows, this is not considered an error; the export will silently continue. (In effect, this means that an update-based export will not insert new rows into the database.) Likewise, if the column specified with --update-key does not uniquely identify rows and multiple rows are updated by a single statement, this condition is also undetected.

The argument --update-key can also be given a comma separated list of column names. In which case, Sqoop will match all keys from this list before updating any existing record.

Depending on the target database, you may also specify the --update-mode argument with allowinsert mode if you want to update rows if they exist in the database already or insert rows if they do not exist yet.

上面就是说你设定了--update-key id 的时候，即使你hive中没有更新的数据，也不会报错，

--update-key 的key一定要为唯一的，如果不唯一就会报错，测试了几次sqoop会卡在map 100%那里也不报错

updateonly其实整个执行sqoop脚本的时候就类似于遍历，hive表里所有的id值，对每一行hive里存在id 在mysql 中都进行更新，当然没有更新的可能就跳过了。

allowinsert不仅会更新列，而且会把hive中新增的数据也同步过来。

sqoop脚本
bin/sqoop export \
--connect jdbc:mysql://hadoop102:3306/test \
--username root \
--password 123456 \
--table staff1 \
--num-mappers 1 \
--export-dir /user/hive/warehouse/gmall.db/staff \
--update-key id \
--update-mode updateonly \
--input-fields-terminated-by "\t"

——————————————————————————————————————————————————

其实还有一个疑问，如果hive中的数据为null，null在hive底层里是以\N存在的

当把hive的数据导入mysql的时候该null会变成字符串 \N

--input-null-string "chenchi"

--input-null-non-string "chenchi"

很多人都说这两个字符串能把null替换，经过测试貌似不行,原以为是null变成chenchi，结果还是\N

后来又看到alter table ${table_name} SET SERDEPROPERTIES('serialization.null.format' = '\\N');

我把这个值设置为chenchi ，貌似感觉没啥用

有路过的大佬麻烦解释下，或者以后我再补充

原文链接：https://blog.csdn.net/cclovezbf/article/details/100559554