使用纯Shell和curl爬取微博热搜

微博热搜网页结构还是很简单明了的，用不着Python什么的，纯Shell配合curl就能搞定，爬取置顶1条和前10条，过滤掉广告：

#!/bin/sh
s=`curl -s 'https://s.weibo.com/top/summary?cate=realtimehot' | grep -A 150 '<div class="data" id="pl_top_realtimehot">'`

top="顶."`echo "$s" | grep -A 2 '<td class="td-01"><i class="icon-top"></i></td>' | tail -n 1 | grep -o '>.*<' | awk -F '[><]' '{print $2}'`
echo $top

i=1
j=1
while [ $i -le 10 ]
do
    l=`echo "$s" | grep -A 5 "<td class=\"td-01 ranktop\">$i</td>" | tail -n 4`
    r=`echo "$l" | tail -n 1 | grep 'icon-txt-recommend'`
    if [ $? == 1 ]; then
        echo "$l" | head -n 1 | grep -o '>.*<' | awk -F '[><]' '{print "'$j'."$2 }'
        j=$((j+1))
    fi
    i=$((i+1))
done

原文链接：https://blog.csdn.net/u012440550/article/details/106862539