爬取京东搜索的数据到本地ElasticSearch

爬取京东搜索的数据到本地ElasticSearch

1、前言

尽量跟着官方文档和别人的博客学习,然后再配合视频学习。

环境:Linux(CentOS7)、JDK1.8、ELasticSearch6.8、Kibana6.8、ik分词器6.8。(版本最好一致)

代码路径

https://github.com/spreoW/ElasticSearch-Spider

2、创建Java工程,导入用到的依赖

:完整依赖上传到git上面了,大家有兴趣可以看看。

<!--jsoup网络用到的包-->
<dependency>
	<groupId>org.jsoup</groupId>
	<artifactId>jsoup</artifactId>
	<version>1.10.2</version>
</dependency>

<!--elasticsearch客户端-->
<dependency>
	<groupId>org.elasticsearch.client</groupId>
	<artifactId>elasticsearch-rest-high-level-client</artifactId>
	<version>6.8.10</version>
</dependency>

<!--JSON转换的工具类-->
<dependency>
	<groupId>com.alibaba</groupId>
	<artifactId>fastjson</artifactId>
	<version>1.2.3</version>
</dependency>

3、编写网页解析工具类

注意事项

  1. @Component加入Spring注解
  2. 为空跳出本次循环(代码优化)
@Component
public class HtmlParseUtil {

    public List<Goods> prase(String keyword) throws Exception {
        String url = "https://search.jd.com/Search?keyword="+keyword+"&enc=utf-8";
        Document document = Jsoup.parse(new URL(url), 3000);
        Element goodsList = document.getElementById("J_goodsList");
        Elements elements = goodsList.getElementsByTag("li");
        List<Goods> list = new ArrayList<>();
        for (Element element:elements){
            String image = element.getElementsByTag("img").eq(0).attr("src");
            // 为空跳出本次循环
            if (image==null||image.length()==0){
                continue;
            }
            Goods goods = new Goods();
            String prive = element.getElementsByClass("p-price").eq(0).text();
            String title = element.getElementsByClass("p-name").text();
            goods.setImage(image);
            goods.setPrice(prive);
            goods.setTitle(title);
            list.add(goods);
        }
        return list;
    }
}

4、Service业务编写

注意事项

bulkRequest.add(new IndexRequest("jd_goods").type("goods")
											.source(JSON.toJSONString(goodsList.get(i)),XContentType.JSON));

1、在6.X版本的要加type,7.X的不用加
2、一定要传String或者Map,要是是对象的话,无法做精准search和模糊search,所以要用到JSON.toJSONString(),而不是JSON.toJSON();

第二点是个坑,一定要用JSON.toJSONString()。

Object json = JSON.toJSON(user);
String jsonString = JSON.toJSONString(user);
System.out.println(json);
System.out.println(jsonString);

// 返回类型不一样,但是打印出的的是一样的。
{"age":33,"username":"zhangsan"}
{"age":33,"username":"zhangsan"}
@Service
public class ContentsService {

    @Autowired
    private HtmlParseUtil htmlParseUtil;
    @Autowired
    private RestHighLevelClient restHighLevelClient;

    /**
     * 爬取数据插入到ElasticSearch
     * @param keyword 搜索关键字
     * @return
     * @throws Exception
     */
    public Boolean parse(String keyword) throws Exception {
        // 爬取数据
        List<Goods> goodsList = htmlParseUtil.prase(keyword);
        BulkRequest bulkRequest = new BulkRequest();
        bulkRequest.timeout("2m");
        for (int i=0;i<goodsList.size();i++){
            bulkRequest.add(new IndexRequest("jd_goods").type("goods").source(JSON.toJSONString(goodsList.get(i)),XContentType.JSON));
        }
        BulkResponse bulk = restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT);
        return !bulk.hasFailures();
    }

    /**
     * 根据输入的关键字在ElasticSearch里面搜索,将分页查询的数据通过List<Map<>>返回
     * @param keyword  搜索关键字
     * @param pageNo   当前数
     * @param pageSize 查几个
     * @return
     */
    public List<Map<String,Object>> searchPage(String keyword,int pageNo,int pageSize) throws IOException {

        SearchRequest searchRequest = new SearchRequest("jd_goods");
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();

        searchSourceBuilder.from(pageNo);
        searchSourceBuilder.size(pageSize);

        //构建精准查询
        TermQueryBuilder termQueryBuilder = QueryBuilders.termQuery("title", keyword);
        searchSourceBuilder.query(termQueryBuilder);
        searchSourceBuilder.timeout(new TimeValue(60, TimeUnit.SECONDS));

        //执行搜索
        searchRequest.source(searchSourceBuilder);
        SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);

        SearchHit[] hits = searchResponse.getHits().getHits();
        List<Map<String,Object>> searchList = new ArrayList<>();
        for (SearchHit sh : hits){
            searchList.add(sh.getSourceAsMap());
        }
        return searchList;
    }
}

5、Controller编写

:讲究的是简洁。

@RestController
public class ContentsController {

    @Autowired
    private ContentsService contentsService;

    @GetMapping("/parse/{keyword}")
    public Boolean parse(@PathVariable("keyword") String keyword) throws Exception {
        return contentsService.parse(keyword);
    }

    @GetMapping("/search/{keyword}/{pageNo}/{pageSize}")
    public List<Map<String,Object>> searchPage(@PathVariable("keyword") String keyword,
                                               @PathVariable("pageNo")int pageNo,
                                               @PathVariable("pageSize")int pageSize) throws IOException {
        return contentsService.searchPage(keyword,pageNo,pageSize);
    }
}

6、测试

  1. 启动SpringBoot
  2. 请求http://localhost:8080/parse/java,插入到ElasticSearch
  3. 分页搜索,http://localhost:8080/search/java/1/30

7、页面展示

1、parse http://localhost:8080/parse/java
在这里插入图片描述
2、search http://localhost:8080/search/java/1/30
在这里插入图片描述


版权声明:本文为qq_40205337原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。