数据同步工具—sqoop 2.x

sqoop 2.x

在上一节讲sqoop 的时候,提到过Sqoop的1.x 和 2.x 的版本差异比较大,所以我们今天来看一下sqoop 2.x ,需要注意的是1.99算是2.x版本的。

sqoop2比sqoop1的改进

  1. 引入sqoop server,集中化管理connector等
  2. 多种访问方式:CLI,Web UI,REST API
  3. 引入基于角色 的安全机制

基本架构如下

image-20221001003530574

1.x 的架构如图

image-20221001100413243

我们看到主要就是有了一个server

功能对比如下

image-20221001003442249

安装配置

这里的安装和上一节没太大差别,所以我们就不描述那么细了,就直接解压安装就行了,这里的1.99 的版本就是我们今天介绍的2.x

image-20220930231852527

解压好了,我们去配置,老版本是在sqoop-env.sh 中配置,新版本我们直接配置``

image-20220930232127108

cp ~/Downloads/mysql-connector-java-5.1.49.jar $SQOOP_HOME/server/lib

配置安装路径和日志目录

# The absolute path to the directory where system genearated
# log files will be kept.
LOGDIR=/usr/local/sqoop-1.99.7-bin-hadoop200/logs

# The absolute path to the directory where Sqoop 2 is installed
BASEDIR=/usr/local/sqoop-1.99.7-bin-hadoop200

其他使用到日志变量的建议统一替换

配置Hadoop

org.apache.sqoop.submission.engine.mapreduce.configuration.directory=$HADOOP_HOME/etc/hadoop 

这个变量配置不行,要使用绝对路径后面有说

配置Mysql 默认使用的是derby

# JDBC repository provider configuration
org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.mysql.MySqlRepositoryHandler
org.apache.sqoop.repository.jdbc.transaction.isolation=READ_COMMITTED
org.apache.sqoop.repository.jdbc.maximum.connections=10
org.apache.sqoop.repository.jdbc.url=jdbc:mysql://localhost:3306/sqoop?useSSL=false
org.apache.sqoop.repository.jdbc.driver=com.mysql.jdbc.Driver
org.apache.sqoop.repository.jdbc.user=root
org.apache.sqoop.repository.jdbc.password=www1234

配置用户访问权限

在Hadoop 的配置文件core-site.xml 中配置

<property>
  <name>hadoop.proxyuser.sqoop2.hosts</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.sqoop2.groups</name>
  <value>*</value>
</property>

启动服务

初始化

sqoop2-tool upgrade

报错如下

Setting conf dir: /usr/local/sqoop-1.99.7-bin-hadoop200/bin/../conf
Sqoop home directory: /usr/local/sqoop-1.99.7-bin-hadoop200
Starting the Sqoop2 server...
0    [main] INFO  org.apache.sqoop.core.SqoopServer  - Initializing Sqoop server.
5    [main] INFO  org.apache.sqoop.core.PropertiesConfigurationProvider  - Starting config file poller thread
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
	at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
	at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
	at org.apache.sqoop.security.authentication.SimpleAuthenticationHandler.secureLogin(SimpleAuthenticationHandler.java:37)
	at org.apache.sqoop.security.AuthenticationManager.initialize(AuthenticationManager.java:98)
	at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:57)
	at org.apache.sqoop.server.SqoopJettyServer.<init>(SqoopJettyServer.java:67)
	at org.apache.sqoop.server.SqoopJettyServer.main(SqoopJettyServer.java:177)

这主要是因为guava的版本不对,我们把hive的拿过来

rm -f  $SQOOP_HOME/tools/lib/guava-11.0.2.jar
cp $HIVE_HOME/lib/guava-27.0-jre.jar $SQOOP_HOME/tools/lib

再次执行初始化,报错如下

Setting conf dir: /usr/local/sqoop-1.99.7-bin-hadoop200/bin/../conf
Sqoop home directory: /usr/local/sqoop-1.99.7-bin-hadoop200
Sqoop tool executor:
	Version: 1.99.7
	Revision: 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
	Compiled on Tue Jul 19 16:08:27 PDT 2016 by abefine
Running tool: class org.apache.sqoop.tools.tool.UpgradeTool
0    [main] INFO  org.apache.sqoop.core.PropertiesConfigurationProvider  - Starting config file poller thread
Upgrade has failed, please check Server logs for further details.
Tool class org.apache.sqoop.tools.tool.UpgradeTool has failed.

这下我们只能去看日志了

2022-09-30 23:50:06,830 ERROR [org.apache.sqoop.repository.common.CommonRepositoryHandler.runQuery(CommonRepositoryHandler.java:2685)] Can't execute query: CREATE DATABASE IF NOT EXISTS"SQOOP"
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '"SQOOP"' at line 1
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at com.mysql.jdbc.Util.handleNewInstance(Util.java:403)
	at com.mysql.jdbc.Util.getInstance(Util.java:386)
	at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944)

这个主要是因为执行SQL 的时候报错,我们去mysql 设置一下允许使用双引号

 SET GLOBAL sql_mode = ANSI_QUOTES;

再次执行就成功了

Setting conf dir: /usr/local/sqoop-1.99.7-bin-hadoop200/bin/../conf
Sqoop home directory: /usr/local/sqoop-1.99.7-bin-hadoop200
Sqoop tool executor:
	Version: 1.99.7
	Revision: 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
	Compiled on Tue Jul 19 16:08:27 PDT 2016 by abefine
Running tool: class org.apache.sqoop.tools.tool.UpgradeTool
0    [main] INFO  org.apache.sqoop.core.PropertiesConfigurationProvider  - Starting config file poller thread
Tool class org.apache.sqoop.tools.tool.UpgradeTool has finished correctly.

验证 verify

执行下面的命令可以检测一下我们的配置是否正确

sqoop2-tool verify

发现验证失败了

Setting conf dir: /usr/local/sqoop-1.99.7-bin-hadoop200/bin/../conf
Sqoop home directory: /usr/local/sqoop-1.99.7-bin-hadoop200
Sqoop tool executor:
	Version: 1.99.7
	Revision: 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
	Compiled on Tue Jul 19 16:08:27 PDT 2016 by abefine
Running tool: class org.apache.sqoop.tools.tool.VerifyTool
0    [main] INFO  org.apache.sqoop.core.SqoopServer  - Initializing Sqoop server.
6    [main] INFO  org.apache.sqoop.core.PropertiesConfigurationProvider  - Starting config file poller thread
2022-10-01T00:04:07,707 WARN [main] org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Verification has failed, please check Server logs for further details.
Tool class org.apache.sqoop.tools.tool.VerifyTool has failed.

查看日志看起来是Hadoop 的配置有问题,估计是变量不能被识别,所以直接写成/usr/local/Cellar/hadoop/3.2.1/libexec/etc/hadoop

org.apache.sqoop.common.SqoopException: MAPREDUCE_0002:Failure on submission engine initialization - Invalid Hadoop configuration directory (not a directory or permission issues): $HADOOP_HOME/etc/hadoop
	at org.apache.sqoop.submission.mapreduce.MapreduceSubmissionEngine.initialize(MapreduceSubmissionEngine.java:97)
	at org.apache.sqoop.driver.JobManager.initialize(JobManager.java:257)
	at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:64)
	at org.apache.sqoop.tools.tool.VerifyTool.runTool(VerifyTool.java:36)
	at org.apache.sqoop.tools.ToolRunner.main(ToolRunner.java:72)
2022-10-01 00:04:09,667 ERROR [org.apache.sqoop.tools.tool.VerifyTool.runTool(VerifyTool.java:41)] Got exception while initializing/destroying Sqoop server:
java.lang.RuntimeException: Failure in server initialization
	at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:68)
	at org.apache.sqoop.tools.tool.VerifyTool.runTool(VerifyTool.java:36)
	at org.apache.sqoop.tools.ToolRunner.main(ToolRunner.java:72)
Caused by: org.apache.sqoop.common.SqoopException: MAPREDUCE_0002:Failure on submission engine initialization - Invalid Hadoop configuration directory (not a directory or permission issues): $HADOOP_HOME/etc/hadoop
	at org.apache.sqoop.submission.mapreduce.MapreduceSubmissionEngine.initialize(MapreduceSubmissionEngine.java:97)
	at org.apache.sqoop.driver.JobManager.initialize(JobManager.java:257)
	at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:64)

再次验证发现可以了

Setting conf dir: /usr/local/sqoop-1.99.7-bin-hadoop200/bin/../conf
Sqoop home directory: /usr/local/sqoop-1.99.7-bin-hadoop200
Sqoop tool executor:
	Version: 1.99.7
	Revision: 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
	Compiled on Tue Jul 19 16:08:27 PDT 2016 by abefine
Running tool: class org.apache.sqoop.tools.tool.VerifyTool
0    [main] INFO  org.apache.sqoop.core.SqoopServer  - Initializing Sqoop server.
5    [main] INFO  org.apache.sqoop.core.PropertiesConfigurationProvider  - Starting config file poller thread
2022-10-01T00:21:25,831 WARN [main] org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2022-10-01T00:21:28,074 INFO [main] org.apache.hadoop.conf.Configuration.deprecation - No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS
Verification was successful.
Tool class org.apache.sqoop.tools.tool.VerifyTool has finished correctly.

启动停止服务

sqoop2-server start

又是这个报错,不是前面解决了吗

Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
	at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
	at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
	at org.apache.sqoop.security.authentication.SimpleAuthenticationHandler.secureLogin(SimpleAuthenticationHandler.java:37)
	at org.apache.sqoop.security.AuthenticationManager.initialize(AuthenticationManager.java:98)
	at org.apache.sqoop.core.SqoopServer.initialize(SqoopServer.java:57)
	at org.apache.sqoop.server.SqoopJettyServer.<init>(SqoopJettyServer.java:67)
	at org.apache.sqoop.server.SqoopJettyServer.main(SqoopJettyServer.java:177)

发现server目录下还有一个版本不对的guava

rm -f  guava-11.0.2.jar
cp $HIVE_HOME/lib/guava-27.0-jre.jar $SQOOP_HOME/server/lib

再次启动发现可以了

Setting conf dir: /usr/local/sqoop-1.99.7-bin-hadoop200/bin/../conf
Sqoop home directory: /usr/local/sqoop-1.99.7-bin-hadoop200
Starting the Sqoop2 server...
0    [main] INFO  org.apache.sqoop.core.SqoopServer  - Initializing Sqoop server.
4    [main] INFO  org.apache.sqoop.core.PropertiesConfigurationProvider  - Starting config file poller thread
2022-10-01T00:24:01,221 WARN [main] org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2022-10-01T00:24:03,447 INFO [main] org.apache.hadoop.conf.Configuration.deprecation - No unit for dfs.client.datanode-restart.timeout(30) assuming SECONDS
2022-10-01T00:24:03,709 INFO [main] org.eclipse.jetty.util.log - Logging initialized @3378ms
2022-10-01T00:24:03,772 INFO [main] org.eclipse.jetty.server.Server - jetty-9.2.13.v20150730
2022-10-01T00:24:03,931 INFO [main] org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager - Updating the current master key for generating delegation tokens
2022-10-01T00:24:03,933 INFO [Thread[Thread-5,5,main]] org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager - Starting expired delegation token remover thread, tokenRemoverScanInterval=60 min(s)
2022-10-01T00:24:03,933 INFO [Thread[Thread-5,5,main]] org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager - Updating the current master key for generating delegation tokens
2022-10-01T00:24:03,949 INFO [main] org.eclipse.jetty.server.handler.ContextHandler - Started o.e.j.s.ServletContextHandler@342726f1{/sqoop,null,AVAILABLE}
2022-10-01T00:24:03,969 INFO [main] org.eclipse.jetty.server.ServerConnector - Started ServerConnector@37ce3644{HTTP/1.1}{0.0.0.0:12000}
2022-10-01T00:24:03,969 INFO [main] org.eclipse.jetty.server.Server - Started @3640ms
Sqoop2 server started.

可以发现多了一个java 进程

37721 SqoopJettyServer

停止可以使用

sqoop2-server stop

客户端使用

启动客户端

使用下面的命令启动一个客户端

sqoop2-shell

可以看到(jps)直接启动了一个java 程序

37945 SqoopShell

连接到服务端

使用下面的命令连接到服务端

set server --host localhost --port 12000 -webapp sqoop

sqoop port和host就不用说了,port是默认值;

最后一个–webapp官方文档说是指定的sqoop jetty服务器名称,大概是一个自己能识别的用于标示这个服务器的名字吧。

检查 connectors

检查Sqoop服务(server)已经注册的 connectors

sqoop:000> show connector
+------------------------+---------+------------------------------------------------------------+----------------------+
|          Name          | Version |                           Class                            | Supported Directions |
+------------------------+---------+------------------------------------------------------------+----------------------+
| generic-jdbc-connector | 1.99.7  | org.apache.sqoop.connector.jdbc.GenericJdbcConnector       | FROM/TO              |
| kite-connector         | 1.99.7  | org.apache.sqoop.connector.kite.KiteConnector              | FROM/TO              |
| oracle-jdbc-connector  | 1.99.7  | org.apache.sqoop.connector.jdbc.oracle.OracleJdbcConnector | FROM/TO              |
| ftp-connector          | 1.99.7  | org.apache.sqoop.connector.ftp.FtpConnector                | TO                   |
| hdfs-connector         | 1.99.7  | org.apache.sqoop.connector.hdfs.HdfsConnector              | FROM/TO              |
| kafka-connector        | 1.99.7  | org.apache.sqoop.connector.kafka.KafkaConnector            | TO                   |
| sftp-connector         | 1.99.7  | org.apache.sqoop.connector.sftp.SftpConnector              | TO                   |
+------------------------+---------+------------------------------------------------------------+----------------------+

这些 connectors 的依赖,可以在server/lib 目录下看到

-rw-r--r--@ 1 liuwenqiang  wheel   1.7M  7 20  2016 sqoop-connector-ftp-1.99.7.jar
-rw-r--r--@ 1 liuwenqiang  wheel   3.1M  7 20  2016 sqoop-connector-generic-jdbc-1.99.7.jar
-rw-r--r--@ 1 liuwenqiang  wheel    21M  7 20  2016 sqoop-connector-hdfs-1.99.7.jar
-rw-r--r--@ 1 liuwenqiang  wheel    13M  7 20  2016 sqoop-connector-kafka-1.99.7.jar
-rw-r--r--@ 1 liuwenqiang  wheel    12M  7 20  2016 sqoop-connector-kite-1.99.7.jar
-rw-r--r--@ 1 liuwenqiang  wheel   123K  7 20  2016 sqoop-connector-oracle-jdbc-1.99.7.jar
-rw-r--r--@ 1 liuwenqiang  wheel   1.7M  7 20  2016 sqoop-connector-sftp-1.99.7.jar

创建并管理link

Link 的概念可以理解为连接,例如我们创建的mysql 的连接,创建后可以多次使用

创建连接从Mysql导入的link

我们使用create link -c generic-jdbc-connector 命令来创建link ,下面根据提示输入相关信息即可,需要注意的是我们要给link 起一个有业务意义的名字,方便后续维护

在1.99.7版本以前,每个connector会有一个id,当创建link时,用这个id指定所继承的connector,但在这个版本中没有这个id了,创建link时直接使用connector名称创建,这里我们使用的是generic-jdbc-connector

-c 其实就是 -connector

sqoop:000> create link -c generic-jdbc-connector
Creating link for connector with name generic-jdbc-connector
Please fill following values to create new link object
Name: mysql_localhost_world

Database connection

Driver class: com.mysql.jdbc.Driver
Connection String: jdbc:mysql://localhost/world
Username: root
Password: *******
Fetch Size:
Connection Properties:
There are currently 0 values in the map:
entry#

SQL Dialect

Identifier enclose:

Identifier enclose:指定SQL中标识符的定界符,也就是说,有的SQL标示符是一个引号:select * from “table_name”,这种定界符在MySQL中是会报错的。

这个属性默认值就是双引号,所以不能使用回车,必须将之覆盖,我使用空格覆盖了这个值。但是我们也可以在mysql 中进行设置使其支持,我们在sqoop 1.x 的时候讲过

创建导出到hdfs的link

sqoop:000>  create link -c hdfs-connector
Creating link for connector with name hdfs-connector
Please fill following values to create new link object
Name: sink_to_hdfs

HDFS cluster

URI: hdfs://localhost:9000
Conf directory: /usr/local/Cellar/hadoop/3.2.1/libexec/etc/hadoop
Additional configs::
There are currently 0 values in the map:
entry#
New link was successfully created with validation status OK and name sink_to_hdfs

显示已有的link

我们可以看到我们已经创建的全部link

sqoop:000> show link
+-----------------------+------------------------+---------+
|         Name          |     Connector Name     | Enabled |
+-----------------------+------------------------+---------+
| mysql_localhost_world | generic-jdbc-connector | true    |
| sink_to_hdfs          | hdfs-connector         | true    |
+-----------------------+------------------------+---------+

我们通过可以添加--all 参数查看link 的详细信息

sqoop:000> show link --all
2 link(s) to show:
link with name mysql_localhost_world (Enabled: true, Created by liuwenqiang at 22-10-1 上午10:15, Updated by liuwenqiang at 22-10-1 上午10:15)
Using Connector generic-jdbc-connector with name {1}
  Database connection
    Driver class: com.mysql.jdbc.Driver
    Connection String: jdbc:mysql://localhost/world
    Username: root
    Password:
    Fetch Size:
    Connection Properties:
  SQL Dialect
    Identifier enclose:
link with name sink_to_hdfs (Enabled: true, Created by liuwenqiang at 22-10-1 上午10:20, Updated by liuwenqiang at 22-10-1 上午10:20)
Using Connector hdfs-connector with name {1}
  HDFS cluster
    URI: hdfs://localhost:9000
    Conf directory: /usr/local/Cellar/hadoop/3.2.1/libexec/etc/hadoop
    Additional configs::

创建并管理job

前面我们说了link 就是连接,我们已经创建了2个link ,下面我们创建一个job ,把这两个连接连起来,来完成数据迁移的目的,我们输入下面的命令

创建job

create job -f "mysql_localhost_world" -t "sink_to_hdfs"
  1. -f指定from,即是数据源位置,
  2. t指定to,即是目的地位置。

本例是从MySQL传递数据到HDFS,所以就是from mysql to HDFS。参数值就是在创建link时指定的Name。

create job -f "mysql_localhost_world" -t "sink_to_hdfs"
sqoop:000> create job -f "mysql_localhost_world" -t "sink_to_hdfs"
Creating job for links with from name mysql_localhost_world and to name sink_to_hdfs
Please fill following values to create new job object
Name: mysql_localhost_world_2_hdfs_job

Database source

Schema name: world
Table name: city
SQL statement:
Column names:
There are currently 0 values in the list:
element#
Partition column:
Partition column nullable:
Boundary query:

Incremental read

Check column:
Last value:

Target configuration

Override null value:
Null value:
File format:
  0 : TEXT_FILE
  1 : SEQUENCE_FILE
  2 : PARQUET_FILE
Choose:
Compression codec:
  0 : NONE
  1 : DEFAULT
  2 : DEFLATE
  3 : GZIP
  4 : BZIP2
  5 : LZO
  6 : LZ4
  7 : SNAPPY
  8 : CUSTOM
Choose:
Custom codec:
Output directory:
Append mode:

Throttling resources

Extractors:
Loaders:

Classpath configuration

Extra mapper jars:
There are currently 0 values in the list:
element#
2022-10-01T10:31:57
New job was successfully created with validation status OK  and name mysql_localhost_world_2_hdfs_job

显示已有job

sqoop:021>  show job
2022-10-01T10:35:58,431 WARN [pool-5-thread-5]
+----+----------------------------------+------------------------------------------------+-------------------------------+---------+
| Id |               Name               |                 From Connector                 |         To Connector          | Enabled |
+----+----------------------------------+------------------------------------------------+-------------------------------+---------+
| 1  | mysql_localhost_world_2_hdfs_job | mysql_localhost_world (generic-jdbc-connector) | sink_to_hdfs (hdfs-connector) | true    |
+----+----------------------------------+------------------------------------------------+-------------------------------+---------+

或者可以执行下面的命令,查看job 的详细信息

sqoop:021>  show job --all
1 job(s) to show:
Job with name mysql_localhost_world_2_hdfs_job (Enabled: true, Created by liuwenqiang at 22-10-1 上午10:30, Updated by liuwenqiang at 22-10-1 上午10:30)
  Throttling resources
    Extractors:
    Loaders:
  Classpath configuration
    Extra mapper jars:
From link: mysql_localhost_world
  Database source
    Schema name: world
    Table name: city
    SQL statement:
    Column names:
    Partition column:
    Partition column nullable:
    Boundary query:
  Incremental read
    Check column:
    Last value:
To link: sink_to_hdfs
  Target configuration
    Override null value:
    Null value:
    File format:
    Compression codec:
    Custom codec:
    Output directory: /tmp/city
    Append mode:

启动job

使用下面的命令启动job

start job -n mysql_localhost_world_2_hdfs_job

-n 是 --name 的意思

总结

sqoop1优点架构部署简单 ,sqoop1的缺点命令行方式容易出错,格式紧耦合,无法支持所有数据类型,安全机制不够完善,例如密码暴漏, 安装需要root权限,connector必须符合JDBC模型

sqoop2的优点多种交互方式,命令行,web UI,rest API,conncetor集中化管理,所有的链接安装在sqoop server上,完善权限管理机制,connector规范化,仅仅负责数据的读写。 sqoop2的缺点,架构稍复杂,配置部署更繁琐。

更多请参考官方文档


版权声明:本文为king14bhhb原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。