简单好用的nodejs 爬虫框架分享(2)

$ crawl-pet --help Crawl-pet options help: -u, --url string Destination address -o, --outdir string Save the directory, Default use pwd -r, --restart Reload all page --clear Clear queue --save string Save file rules following options = url: Save the path consistent with url = simple: Save file in the project path = group: Save 500 files in one folder --types array Limit download file type --limit number=5 Concurrency limit --sleep number=200 Concurrent interval --timeout number=180000 Queue timeout --proxy string Set up proxy --parser string Set crawl rule, it's a js file path! The default load the parser.js file in the project path --maxsize number Limit the maximum size of the download file --minwidth number Limit the minimum width of the download file --minheight number Limit the minimum height of the download file -i, --info View the configuration file -l, --list array View the queue data e.g. [page/down/queue],0,-1 -f, --find array Find the download URL of the local file --json Print result to json format -v, --version View version -h, --help View help

最后分享一个配置

$ crawl-pet -u https://www.reddit.com/r/funny/ -o reddit --save group

info.json

{ "url": "https://www.reddit.com/r/funny/", "outdir": ".", "save": "group", "types": "", "limit": "5", "parser": "my_parser.js", "sleep": "200", "timeout": "180000", "proxy": "", "maxsize": 0, "minwidth": 0, "minheight": 0, "cookie": "over18=1" }

my_parser.js

exports.body = function(url, body, response, crawler_handle) { const re = /\b(data-url|href|src)\s*=\s*["']([^'"#]+)/ig var m = null while (m = re.exec(body)){ let href = m[2] if (/thumb|user|icon|\.(css|json|js|xml|svg)\b/i.test(href)) { continue } if (/\.(png|gif|jpg|jpeg|mp4)\b/i.test(href)) { crawler_handle.addDown(href) continue } if(/reddit\.com\/r\//i.test(href)){ crawler_handle.addPage(href) } } crawler_handle.over() }

如果你是了解 reddit 的,那就这样了。

GIthub 地址在这里:https://github.com/wl879/Crawl-pet

本站下载地址:点击下载

您可能感兴趣的文章:

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/wwyxdw.html