Python爬虫多线程抓取代理服务器(2)

#开启20个线程负责校验,将抓取到的代理分成20份,每个线程校验一份
for i in range(20):
    t = ProxyCheck(rawProxyList[((len(rawProxyList)+19)/20) * i:((len(rawProxyList)+19)/20) * (i+1)])
    checkThreads.append(t)

for i in range(len(checkThreads)):
    checkThreads[i].start()

for i in range(len(checkThreads)):
    checkThreads[i].join()

print '.'*10+"总共有%s个代理通过校验" %len(checkedProxyList) +'.'*10

#持久化
f= open("proxy_list.txt",'w+')
for proxy in sorted(checkedProxyList,cmp=lambda x,y:cmp(x[3],y[3])):
    print "checked proxy is: %s:%s\t%s\t%s" %(proxy[0],proxy[1],proxy[2],proxy[3])
    f.write("%s:%s\t%s\t%s\n"%(proxy[0],proxy[1],proxy[2],proxy[3]))
f.close()部分log: 目标网站:
目标网站:
目标网站:
目标网站:
目标网站:
目标网站:
目标网站:
目标网站:
..........总共抓取了800个代理..........
..........总共有478个代理通过校验.........
173.213.113.111:8089    United States  0.341555833817
173.213.113.111:3128    United States  0.347477912903
210.101.131.232:8080    韩国 首尔      0.418715000153
.....

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:http://www.heiqu.com/f1c61cc27179efb90be188e667dd9e60.html