sqoop 2.x
在上一节讲sqoop 的时候,提到过Sqoop的1.x 和 2.x 的版本差异比较大,所以我们今天来看一下sqoop 2.x ,需要注意的是1.99算是2.x版本的。
sqoop2比sqoop1的改进
- 引入sqoop server,集中化管理connector等
- 多种访问方式:CLI,Web UI,REST API
- 引入基于角色 的安全机制
基本架构如下
1.x 的架构如图
我们看到主要就是有了一个server
功能对比如下
安装配置
这里的安装和上一节没太大差别,所以我们就不描述那么细了,就直接解压安装就行了,这里的1.99 的版本就是我们今天介绍的2.x
‘
解压好了,我们去配置,老版本是在sqoop-env.sh
中配置,新版本我们直接配置``
cp ~/Downloads/mysql-connector-java-5.1.49.jar $SQOOP_HOME/server/lib
配置安装路径和日志目录
# The absolute path to the directory where system genearated
# log files will be kept.
LOGDIR=/usr/local/sqoop-1.99.7-bin-hadoop200/logs
# The absolute path to the directory where Sqoop 2 is installed
BASEDIR=/usr/local/sqoop-1.99.7-bin-hadoop200
其他使用到日志变量的建议统一替换
配置Hadoop
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=$HADOOP_HOME/etc/hadoop
这个变量配置不行,要使用绝对路径后面有说
配置Mysql 默认使用的是derby
# JDBC repository provider configuration
org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.mysql.MySqlRepositoryHandler
org.apache.sqoop.repository.jdbc.transaction.isolation=READ_COMMITTED
org.apache.sqoop.repository.jdbc.maximum.connections=10
org.apache.sqoop.repository.jdbc.url=jdbc:mysql://localhost:3306/sqoop?useSSL=false
org.apache.sqoop.repository.jdbc.driver=com.mysql.jdbc.Driver
org.apache.sqoop.repository.jdbc.user=root
org.apache.sqoop.repository.jdbc.password=www1234
配置用户访问权限
在Hadoop 的配置文件core-site.xml
中配置
<property>
<name>hadoop.proxyuser.sqoop2.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.sqoop2.groups</name>
<value>*</value>
</property>
启动服务
初始化
sqoop2-tool upgrade
报错如下
Setting conf dir: /usr/local/sqoop-1.99.7-bin-hadoop200/bin/../conf
Sqoop home directory: /usr/local/sqoop-1.99.7-bin-hadoop200
Starting the Sqoop2 server...
0 [main] INFO org.apache.sqoop.core.SqoopServer - Initializing Sqoop server.
5 [main] INFO org.apache.sqoop.core.PropertiesConfigurationProvider - Starting config file poller thread
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
at org.apache.sqoop.security.authentication.SimpleAuthenticationHandler.secureLogin(SimpleAuthenticationHandler.java:37)
at org.apache.sqoop.security.AuthenticationManager.initialize(AuthenticationManager.java:98)
at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:57)
at org.apache.sqoop.server.SqoopJettyServer.<init>(SqoopJettyServer.java:67)
at org.apache.sqoop.server.SqoopJettyServer.main(SqoopJettyServer.java:177)
这主要是因为guava的版本不对,我们把hive的拿过来
rm -f $SQOOP_HOME/tools/lib/guava-11.0.2.jar
cp $HIVE_HOME/lib/guava-27.0-jre.jar $SQOOP_HOME/tools/lib
再次执行初始化,报错如下
Setting conf dir: /usr/local/sqoop-1.99.7-bin-hadoop200/bin/../conf
Sqoop home directory: /usr/local/sqoop-1.99.7-bin-hadoop200
Sqoop tool executor:
Version: 1.99.7
Revision: 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
Compiled on Tue Jul 19 16:08:27 PDT 2016 by abefine
Running tool: class org.apache.sqoop.tools.tool.UpgradeTool
0 [main] INFO org.apache.sqoop.core.PropertiesConfigurationProvider - Starting config file poller thread
Upgrade has failed, please check Server logs for further details.
Tool class org.apache.sqoop.tools.tool.UpgradeTool has failed.
这下我们只能去看日志了
2022-09-30 23:50:06,830 ERROR [org.apache.sqoop.repository.common.CommonRepositoryHandler.runQuery(CommonRepositoryHandler.java:2685)] Can't execute query: CREATE DATABASE IF NOT EXISTS"SQOOP"
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '"SQOOP"' at line 1
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:403)
at com.mysql.jdbc.Util.getInstance(Util.java:386)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944)
这个主要是因为执行SQL 的时候报错,我们去mysql 设置一下允许使用双引号
SET GLOBAL sql_mode = ANSI_QUOTES;
再次执行就成功了
Setting conf dir: /usr/local/sqoop-1.99.7-bin-hadoop200/bin/../conf
Sqoop home directory: /usr/local/sqoop-1.99.7-bin-hadoop200
Sqoop tool executor:
Version: 1.99.7
Revision: 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
Compiled on Tue Jul 19 16:08:27 PDT 2016 by abefine
Running tool: class org.apache.sqoop.tools.tool.UpgradeTool
0 [main] INFO org.apache.sqoop.core.PropertiesConfigurationProvider - Starting config file poller thread
Tool class org.apache.sqoop.tools.tool.UpgradeTool has finished correctly.
验证 verify
执行下面的命令可以检测一下我们的配置是否正确
sqoop2-tool verify
发现验证失败了
Setting conf dir: /usr/local/sqoop-1.99.7-bin-hadoop200/bin/../conf
Sqoop home directory: /usr/local/sqoop-1.99.7-bin-hadoop200
Sqoop tool executor:
Version: 1.99.7
Revision: 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
Compiled on Tue Jul 19 16:08:27 PDT 2016 by abefine
Running tool: class org.apache.sqoop.tools.tool.VerifyTool
0 [main] INFO org.apache.sqoop.core.SqoopServer - Initializing Sqoop server.
6 [main] INFO org.apache.sqoop.core.PropertiesConfigurationProvider - Starting config file poller thread
2022-10-01T00:04:07,707 WARN [main] org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Verification has failed, please check Server logs for further details.
Tool class org.apache.sqoop.tools.tool.VerifyTool has failed.
查看日志看起来是Hadoop 的配置有问题,估计是变量不能被识别,所以直接写成/usr/local/Cellar/hadoop/3.2.1/libexec/etc/hadoop
org.apache.sqoop.common.SqoopException: MAPREDUCE_0002:Failure on submission engine initialization - Invalid Hadoop configuration directory (not a directory or permission issues): $HADOOP_HOME/etc/hadoop
at org.apache.sqoop.submission.mapreduce.MapreduceSubmissionEngine.initialize(MapreduceSubmissionEngine.java:97)
at org.apache.sqoop.driver.JobManager.initialize(JobManager.java:257)
at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:64)
at org.apache.sqoop.tools.tool.VerifyTool.runTool(VerifyTool.java:36)
at org.apache.sqoop.tools.ToolRunner.main(ToolRunner.java:72)
2022-10-01 00:04:09,667 ERROR [org.apache.sqoop.tools.tool.VerifyTool.runTool(VerifyTool.java:41)] Got exception while initializing/destroying Sqoop server:
java.lang.RuntimeException: Failure in server initialization
at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:68)
at org.apache.sqoop.tools.tool.VerifyTool.runTool(VerifyTool.java:36)
at org.apache.sqoop.tools.ToolRunner.main(ToolRunner.java:72)
Caused by: org.apache.sqoop.common.SqoopException: MAPREDUCE_0002:Failure on submission engine initialization - Invalid Hadoop configuration directory (not a directory or permission issues): $HADOOP_HOME/etc/hadoop
at org.apache.sqoop.submission.mapreduce.MapreduceSubmissionEngine.initialize(MapreduceSubmissionEngine.java:97)
at org.apache.sqoop.driver.JobManager.initialize(JobManager.java:257)
at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:64)
再次验证发现可以了
Setting conf dir: /usr/local/sqoop-1.99.7-bin-hadoop200/bin/../conf
Sqoop home directory: /usr/local/sqoop-1.99.7-bin-hadoop200
Sqoop tool executor:
Version: 1.99.7
Revision: 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
Compiled on Tue Jul 19 16:08:27 PDT 2016 by abefine
Running tool: class org.apache.sqoop.tools.tool.VerifyTool
0 [main] INFO org.apache.sqoop.core.SqoopServer - Initializing Sqoop server.
5 [main] INFO org.apache.sqoop.core.PropertiesConfigurationProvider - Starting config file poller thread
2022-10-01T00:21:25,831 WARN [main] org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2022-10-01T00:21:28,074 INFO [main] org.apache.hadoop.conf.Configuration.deprecation - No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS
Verification was successful.
Tool class org.apache.sqoop.tools.tool.VerifyTool has finished correctly.
启动停止服务
sqoop2-server start
又是这个报错,不是前面解决了吗
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
at org.apache.sqoop.security.authentication.SimpleAuthenticationHandler.secureLogin(SimpleAuthenticationHandler.java:37)
at org.apache.sqoop.security.AuthenticationManager.initialize(AuthenticationManager.java:98)
at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:57)
at org.apache.sqoop.server.SqoopJettyServer.<init>(SqoopJettyServer.java:67)
at org.apache.sqoop.server.SqoopJettyServer.main(SqoopJettyServer.java:177)
发现server目录下还有一个版本不对的guava
rm -f guava-11.0.2.jar
cp $HIVE_HOME/lib/guava-27.0-jre.jar $SQOOP_HOME/server/lib
再次启动发现可以了
Setting conf dir: /usr/local/sqoop-1.99.7-bin-hadoop200/bin/../conf
Sqoop home directory: /usr/local/sqoop-1.99.7-bin-hadoop200
Starting the Sqoop2 server...
0 [main] INFO org.apache.sqoop.core.SqoopServer - Initializing Sqoop server.
4 [main] INFO org.apache.sqoop.core.PropertiesConfigurationProvider - Starting config file poller thread
2022-10-01T00:24:01,221 WARN [main] org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2022-10-01T00:24:03,447 INFO [main] org.apache.hadoop.conf.Configuration.deprecation - No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS
2022-10-01T00:24:03,709 INFO [main] org.eclipse.jetty.util.log - Logging initialized @3378ms
2022-10-01T00:24:03,772 INFO [main] org.eclipse.jetty.server.Server - jetty-9.2.13.v20150730
2022-10-01T00:24:03,931 INFO [main] org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager - Updating the current master key for generating delegation tokens
2022-10-01T00:24:03,933 INFO [Thread[Thread-5,5,main]] org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager - Starting expired delegation token remover thread, tokenRemoverScanInterval=60 min(s)
2022-10-01T00:24:03,933 INFO [Thread[Thread-5,5,main]] org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager - Updating the current master key for generating delegation tokens
2022-10-01T00:24:03,949 INFO [main] org.eclipse.jetty.server.handler.ContextHandler - Started o.e.j.s.ServletContextHandler@342726f1{/sqoop,null,AVAILABLE}
2022-10-01T00:24:03,969 INFO [main] org.eclipse.jetty.server.ServerConnector - Started ServerConnector@37ce3644{HTTP/1.1}{0.0.0.0:12000}
2022-10-01T00:24:03,969 INFO [main] org.eclipse.jetty.server.Server - Started @3640ms
Sqoop2 server started.
可以发现多了一个java 进程
37721 SqoopJettyServer
停止可以使用
sqoop2-server stop
客户端使用
启动客户端
使用下面的命令启动一个客户端
sqoop2-shell
可以看到(jps)直接启动了一个java 程序
37945 SqoopShell
连接到服务端
使用下面的命令连接到服务端
set server --host localhost --port 12000 -webapp sqoop
sqoop port和host就不用说了,port是默认值;
最后一个–webapp官方文档说是指定的sqoop jetty服务器名称,大概是一个自己能识别的用于标示这个服务器的名字吧。
检查 connectors
检查Sqoop服务(server)已经注册的 connectors
sqoop:000> show connector
+------------------------+---------+------------------------------------------------------------+----------------------+
| Name | Version | Class | Supported Directions |
+------------------------+---------+------------------------------------------------------------+----------------------+
| generic-jdbc-connector | 1.99.7 | org.apache.sqoop.connector.jdbc.GenericJdbcConnector | FROM/TO |
| kite-connector | 1.99.7 | org.apache.sqoop.connector.kite.KiteConnector | FROM/TO |
| oracle-jdbc-connector | 1.99.7 | org.apache.sqoop.connector.jdbc.oracle.OracleJdbcConnector | FROM/TO |
| ftp-connector | 1.99.7 | org.apache.sqoop.connector.ftp.FtpConnector | TO |
| hdfs-connector | 1.99.7 | org.apache.sqoop.connector.hdfs.HdfsConnector | FROM/TO |
| kafka-connector | 1.99.7 | org.apache.sqoop.connector.kafka.KafkaConnector | TO |
| sftp-connector | 1.99.7 | org.apache.sqoop.connector.sftp.SftpConnector | TO |
+------------------------+---------+------------------------------------------------------------+----------------------+
这些 connectors 的依赖,可以在server/lib
目录下看到
-rw-r--r--@ 1 liuwenqiang wheel 1.7M 7 20 2016 sqoop-connector-ftp-1.99.7.jar
-rw-r--r--@ 1 liuwenqiang wheel 3.1M 7 20 2016 sqoop-connector-generic-jdbc-1.99.7.jar
-rw-r--r--@ 1 liuwenqiang wheel 21M 7 20 2016 sqoop-connector-hdfs-1.99.7.jar
-rw-r--r--@ 1 liuwenqiang wheel 13M 7 20 2016 sqoop-connector-kafka-1.99.7.jar
-rw-r--r--@ 1 liuwenqiang wheel 12M 7 20 2016 sqoop-connector-kite-1.99.7.jar
-rw-r--r--@ 1 liuwenqiang wheel 123K 7 20 2016 sqoop-connector-oracle-jdbc-1.99.7.jar
-rw-r--r--@ 1 liuwenqiang wheel 1.7M 7 20 2016 sqoop-connector-sftp-1.99.7.jar
创建并管理link
Link 的概念可以理解为连接,例如我们创建的mysql 的连接,创建后可以多次使用
创建连接从Mysql导入的link
我们使用create link -c generic-jdbc-connector
命令来创建link ,下面根据提示输入相关信息即可,需要注意的是我们要给link 起一个有业务意义的名字,方便后续维护
在1.99.7版本以前,每个connector会有一个id,当创建link时,用这个id指定所继承的connector,但在这个版本中没有这个id了,创建link时直接使用connector名称创建,这里我们使用的是generic-jdbc-connector
-c 其实就是 -connector
sqoop:000> create link -c generic-jdbc-connector
Creating link for connector with name generic-jdbc-connector
Please fill following values to create new link object
Name: mysql_localhost_world
Database connection
Driver class: com.mysql.jdbc.Driver
Connection String: jdbc:mysql://localhost/world
Username: root
Password: *******
Fetch Size:
Connection Properties:
There are currently 0 values in the map:
entry#
SQL Dialect
Identifier enclose:
Identifier enclose:指定SQL中标识符的定界符,也就是说,有的SQL标示符是一个引号:select * from “table_name”,这种定界符在MySQL中是会报错的。
这个属性默认值就是双引号,所以不能使用回车,必须将之覆盖,我使用空格覆盖了这个值。但是我们也可以在mysql 中进行设置使其支持,我们在sqoop 1.x 的时候讲过
创建导出到hdfs的link
sqoop:000> create link -c hdfs-connector
Creating link for connector with name hdfs-connector
Please fill following values to create new link object
Name: sink_to_hdfs
HDFS cluster
URI: hdfs://localhost:9000
Conf directory: /usr/local/Cellar/hadoop/3.2.1/libexec/etc/hadoop
Additional configs::
There are currently 0 values in the map:
entry#
New link was successfully created with validation status OK and name sink_to_hdfs
显示已有的link
我们可以看到我们已经创建的全部link
sqoop:000> show link
+-----------------------+------------------------+---------+
| Name | Connector Name | Enabled |
+-----------------------+------------------------+---------+
| mysql_localhost_world | generic-jdbc-connector | true |
| sink_to_hdfs | hdfs-connector | true |
+-----------------------+------------------------+---------+
我们通过可以添加--all
参数查看link 的详细信息
sqoop:000> show link --all
2 link(s) to show:
link with name mysql_localhost_world (Enabled: true, Created by liuwenqiang at 22-10-1 上午10:15, Updated by liuwenqiang at 22-10-1 上午10:15)
Using Connector generic-jdbc-connector with name {1}
Database connection
Driver class: com.mysql.jdbc.Driver
Connection String: jdbc:mysql://localhost/world
Username: root
Password:
Fetch Size:
Connection Properties:
SQL Dialect
Identifier enclose:
link with name sink_to_hdfs (Enabled: true, Created by liuwenqiang at 22-10-1 上午10:20, Updated by liuwenqiang at 22-10-1 上午10:20)
Using Connector hdfs-connector with name {1}
HDFS cluster
URI: hdfs://localhost:9000
Conf directory: /usr/local/Cellar/hadoop/3.2.1/libexec/etc/hadoop
Additional configs::
创建并管理job
前面我们说了link 就是连接,我们已经创建了2个link ,下面我们创建一个job ,把这两个连接连起来,来完成数据迁移的目的,我们输入下面的命令
创建job
create job -f "mysql_localhost_world" -t "sink_to_hdfs"
- -f指定from,即是数据源位置,
- t指定to,即是目的地位置。
本例是从MySQL传递数据到HDFS,所以就是from mysql to HDFS。参数值就是在创建link时指定的Name。
create job -f "mysql_localhost_world" -t "sink_to_hdfs"
sqoop:000> create job -f "mysql_localhost_world" -t "sink_to_hdfs"
Creating job for links with from name mysql_localhost_world and to name sink_to_hdfs
Please fill following values to create new job object
Name: mysql_localhost_world_2_hdfs_job
Database source
Schema name: world
Table name: city
SQL statement:
Column names:
There are currently 0 values in the list:
element#
Partition column:
Partition column nullable:
Boundary query:
Incremental read
Check column:
Last value:
Target configuration
Override null value:
Null value:
File format:
0 : TEXT_FILE
1 : SEQUENCE_FILE
2 : PARQUET_FILE
Choose:
Compression codec:
0 : NONE
1 : DEFAULT
2 : DEFLATE
3 : GZIP
4 : BZIP2
5 : LZO
6 : LZ4
7 : SNAPPY
8 : CUSTOM
Choose:
Custom codec:
Output directory:
Append mode:
Throttling resources
Extractors:
Loaders:
Classpath configuration
Extra mapper jars:
There are currently 0 values in the list:
element#
2022-10-01T10:31:57
New job was successfully created with validation status OK and name mysql_localhost_world_2_hdfs_job
显示已有job
sqoop:021> show job
2022-10-01T10:35:58,431 WARN [pool-5-thread-5]
+----+----------------------------------+------------------------------------------------+-------------------------------+---------+
| Id | Name | From Connector | To Connector | Enabled |
+----+----------------------------------+------------------------------------------------+-------------------------------+---------+
| 1 | mysql_localhost_world_2_hdfs_job | mysql_localhost_world (generic-jdbc-connector) | sink_to_hdfs (hdfs-connector) | true |
+----+----------------------------------+------------------------------------------------+-------------------------------+---------+
或者可以执行下面的命令,查看job 的详细信息
sqoop:021> show job --all
1 job(s) to show:
Job with name mysql_localhost_world_2_hdfs_job (Enabled: true, Created by liuwenqiang at 22-10-1 上午10:30, Updated by liuwenqiang at 22-10-1 上午10:30)
Throttling resources
Extractors:
Loaders:
Classpath configuration
Extra mapper jars:
From link: mysql_localhost_world
Database source
Schema name: world
Table name: city
SQL statement:
Column names:
Partition column:
Partition column nullable:
Boundary query:
Incremental read
Check column:
Last value:
To link: sink_to_hdfs
Target configuration
Override null value:
Null value:
File format:
Compression codec:
Custom codec:
Output directory: /tmp/city
Append mode:
启动job
使用下面的命令启动job
start job -n mysql_localhost_world_2_hdfs_job
-n 是 --name 的意思
总结
sqoop1优点架构部署简单 ,sqoop1的缺点命令行方式容易出错,格式紧耦合,无法支持所有数据类型,安全机制不够完善,例如密码暴漏, 安装需要root权限,connector必须符合JDBC模型
sqoop2的优点多种交互方式,命令行,web UI,rest API,conncetor集中化管理,所有的链接安装在sqoop server上,完善权限管理机制,connector规范化,仅仅负责数据的读写。 sqoop2的缺点,架构稍复杂,配置部署更繁琐。
更多请参考官方文档