scrapy mysql 多线程_多线程爬虫遇到的一些问题

在爬虫程序中遇到的问题：

一.使用多线程HttpClient来抓取页面

1.用EntityUtils.toString来解析数据，经常会发生无法解析的错误，认为是线程不完全导致，遂使用jsoup来解析页面。

java.nio.charset.IllegalCharsetNameException: UTF-8

at java.nio.charset.Charset.checkName(Charset.java:284)

at java.nio.charset.Charset.lookup2(Charset.java:458)

at java.nio.charset.Charset.lookup(Charset.java:437)

at java.nio.charset.Charset.isSupported(Charset.java:476)

at org.jsoup.helper.DataUtil.getCharsetFromContentType(DataUtil.java:132)

2.用Jsoup.parse()来解析页面，在多线程并发调用时，容易占用大量内存。

用jmap来dump文件，用MemoryAnalyzer来分析，发现多个线程中在从response中读入数据，后续输出数据时，java.io.ByteArrayOutputStream申请了大量内存。

Exception in thread "pool-1-thread-879" java.lang.OutOfMemoryError: Java heap space

at java.util.Arrays.copyOfRange(Arrays.java:2694)

at java.lang.String.(String.java:203)

at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:561)

at java.nio.CharBuffer.toString(CharBuffer.java:1201)

at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:121)

at org.jsoup.helper.DataUtil.load(DataUtil.java:54)

at org.jsoup.Jsoup.parse(Jsoup.java:118)

3.多线程并发，程序http连接多，占用内存大，如果可以不保持http长连接，设置httpget.setHeader("Connection", "close");可以使程序释放连接和内存大大加快。

总结：因访问的服务器完全是未知的，可能不是web服务器，所以返回的内容是未知的，才会有‘无法解析’，‘从流中读取的内容无法确定’。

方案：因业务不需要完全读取页面，最后采用自己封装读取接口，定好读取大小，从而绕开这些问题。

二.使用mongo3.0遇到的一些问题

1.mongo对象已有连接池概念，当需求无法满足时，mongo就会报错，如等待超时，大于最大等待数。

当我设置了最大连接数100，超时时间，等待队列等数据后，因系统压力，出现了如下错误。

1.Exception in thread "pool-1-thread-199" com.mongodb.MongoTimeoutException: Timeout waiting for a pooled item after 120000 MILLISECONDS

2.com.mongodb.MongoTimeoutException: Timeout waiting for a pooled item after 120000 MILLISECONDS

3.Exception in thread "pool-1-thread-281" com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message

Caused by: java.net.SocketTimeoutException: Read timed out

4.Exception in thread "pool-1-thread-3657" com.mongodb.MongoWaitQueueFullException: Too many threads are already waiting for a connection. Max number of threads (maxWaitQueueSize) of 500 has been exceeded.

5.com.mongodb.MongoWaitQueueFullException: Too many threads are already waiting for a connection. Max number of threads (maxWaitQueueSize) of 500 has been exceeded.

2.化繁为简，只设置最大连接数200，既满足了我的系统需求(默认有5倍connectionsPerHost的等待队列，还有等待时间等。)

mongoClient = new MongoClient(new ServerAddress(host,port),new MongoClientOptions.Builder()

.socketKeepAlive(true) // 是否保持长链接

.connectionsPerHost(200) // 最大连接数

.minConnectionsPerHost(20)// 最小连接数

.build());

MongoCollection collection = mongoClient.getDatabase("mydb").getCollection("test");

3.mongo的日志文件需要定时轮转，不然单个文件会变很大。利用定时任务执行下条命令即可

mongoClient.getDatabase("admin").runCommand(new Document("logRotate",1));

4.mongo的用法示例

Document filterObject = new Document();

filterObject.put("list", new Document("$slice",-1));//返回数组最后一个Document

collection.find(new Document("key", value))//查找内容

.projection(filterObject)//定义返回内容

.limit(1);

FindIterable findIterable = collection.find(new Document("lday", lToday)

.append("lstatus", 1))

.sort(new Document("_id",1))

.projection(new Document("_id", 0)

.append("_id2", 1))

.skip(i*100000)

.limit(100000);

MongoCursor cursor = findIterable.iterator();

try {

while (cursor.hasNext()) {

Document docItem = cursor.next();

value = docItem.get("_id");

}

} catch (Exception e) {

log.info(e);

} finally {

cursor.close();

}

三.c3p0报警告

WARN [BasicResourcePool] com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask@37bdccae -- Acquisition Attempt Failed!!! Clearing pending acquires. While trying to acquire a needed new resource, we failed to succeed more than the maximum number of allowed acquisition attempts (10). Last acquisition attempt exception:

com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.

网友给解决办法是设置关闭缓存maxStatements=0，但初始化时已经为0。

Initializing c3p0 pool... com.mchange.v2.c3p0.ComboPooledDataSource [ acquireIncrement -> 3, acquireRetryAttempts -> 10, acquireRetryDelay -> 1000, autoCommitOnClose -> false, automaticTestTable -> null, breakAfterAcquireFailure -> false, checkoutTimeout -> 0, connectionCustomizerClassName -> null,

connectionTesterClassName -> com.mchange.v2.c3p0.impl.DefaultConnectionTester, dataSourceName -> 1opjr8a9bj0w0iw1dvkk91|e3bc723, debugUnreturnedConnectionStackTraces -> false, description -> null, driverClass -> com.mysql.jdbc.Driver, factoryClassLocation -> null, forceIgnoreUnresolvedTransactions -> false,

identityToken -> 1opjr8a9bj0w0iw1dvkk91|e3bc723, idleConnectionTestPeriod -> 0, initialPoolSize -> 5, jdbcUrl -> jdbc:mysql://***?useUnicode=true&characterEncoding=UTF-8, lastAcquisitionFailureDefaultUser -> null, maxAdministrativeTaskTime -> 0, maxConnectionAge -> 0, maxIdleTime -> 60, maxIdleTimeExcessConnections -> 0,

maxPoolSize -> 20, maxStatements -> 100, maxStatementsPerConnection -> 0, minPoolSize -> 3, numHelperThreads -> 3, numThreadsAwaitingCheckoutDefaultUser -> 0, preferredTestQuery -> null, properties -> {user=******, password=******}, propertyCycle -> 0, testConnectionOnCheckin -> false, testConnectionOnCheckout -> false,

unreturnedConnectionTimeout -> 0, usesTraditionalReflectiveProxies -> false ]

看了下配置文档，根据报的警告，尝试设置idleConnectionTestPeriod，暂时没有再报了。

DataSource ds_unpooled = DataSources.unpooledDataSource(url, userName, password);

Map pool_conf = new HashMap();

//设置最大连接数

pool_conf.put("maxPoolSize", 20);

//设置最大空闲时间

pool_conf.put("maxIdleTime", 60);

//关闭缓存

pool_conf.put("maxStatements", 0);

//检查连接池中的空闲连接

pool_conf.put("idleConnectionTestPeriod", 600);

ds_pooled = DataSources.pooledDataSource(ds_unpooled,pool_conf);

原文链接：https://blog.csdn.net/weixin_42510446/article/details/113998886