.NET Core 实现定时抓取网站文章并发送到邮箱

大家好,我是晓晨。许久没有更新博客了,今天给大家带来一篇干货型文章,一个每隔5分钟抓取博客园首页文章信息并在第二天的上午9点发送到你的邮箱的小工具。比如我在2018年2月14日,9点来到公司我就会收到一封邮件,是2018年2月13日的博客园首页的文章信息。写这个小工具的初衷是,一直有看博客的习惯,但是最近由于各种原因吧,可能几天都不会看一下博客,要是中途错过了什么好文可是十分心疼的哈哈。所以做了个工具,每天归档发到邮箱,妈妈再也不会担心我错过好的文章了。为什么只抓取首页?因为博客园首页文章的质量相对来说高一些。

准备

作为一个持续运行的工具,没有日志记录怎么行,我准备使用的是NLog来记录日志,它有个日志归档功能非常不错。在http请求中,由于网络问题吧可能会出现失败的情况,这里我使用Polly来进行Retry。使用HtmlAgilityPack来解析网页,需要对xpath有一定了解。下面是详细说明:

组件名 用途 github
NLog   记录日志   https://github.com/NLog/NLog  
Polly   当http请求失败,进行重试   https://github.com/App-vNext/Polly  
HtmlAgilityPack   网页解析   https://github.com/zzzprojects/html-agility-pack  
MailKit   发送邮件   https://github.com/jstedfast/MailKit  

有不了解的组件,可以通过访问github获取资料。

参考文章

https://www.jb51.net/article/112595.htm

获取&解析博客园首页数据

我是用的是HttpWebRequest来进行http请求,下面分享一下我简单封装的类库:

using System; using System.IO; using System.Net; using System.Text; namespace CnBlogSubscribeTool { /// <summary> /// Simple Http Request Class /// .NET Framework >= 4.0 /// Author:stulzq /// CreatedTime:2017-12-12 15:54:47 /// </summary> public class HttpUtil { static HttpUtil() { //Set connection limit ,Default limit is 2 ServicePointManager.DefaultConnectionLimit = 1024; } /// <summary> /// Default Timeout 20s /// </summary> public static int DefaultTimeout = 20000; /// <summary> /// Is Auto Redirect /// </summary> public static bool DefalutAllowAutoRedirect = true; /// <summary> /// Default Encoding /// </summary> public static Encoding DefaultEncoding = Encoding.UTF8; /// <summary> /// Default UserAgent /// </summary> public static string DefaultUserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36" ; /// <summary> /// Default Referer /// </summary> public static string DefaultReferer = ""; /// <summary> /// httpget request /// </summary> /// <param>Internet Address</param> /// <returns>string</returns> public static string GetString(string url) { var stream = GetStream(url); string result; using (StreamReader sr = new StreamReader(stream)) { result = sr.ReadToEnd(); } return result; } /// <summary> /// httppost request /// </summary> /// <param>Internet Address</param> /// <param>Post request data</param> /// <returns>string</returns> public static string PostString(string url, string postData) { var stream = PostStream(url, postData); string result; using (StreamReader sr = new StreamReader(stream)) { result = sr.ReadToEnd(); } return result; } /// <summary> /// Create Response /// </summary> /// <param></param> /// <param>Is post Request</param> /// <param>Post request data</param> /// <returns></returns> public static WebResponse CreateResponse(string url, bool post, string postData = "") { var httpWebRequest = WebRequest.CreateHttp(url); httpWebRequest.Timeout = DefaultTimeout; httpWebRequest.AllowAutoRedirect = DefalutAllowAutoRedirect; httpWebRequest.UserAgent = DefaultUserAgent; httpWebRequest.Referer = DefaultReferer; if (post) { var data = DefaultEncoding.GetBytes(postData); httpWebRequest.Method = "POST"; httpWebRequest.ContentType = "application/x-www-form-urlencoded;charset=utf-8"; httpWebRequest.ContentLength = data.Length; using (var stream = httpWebRequest.GetRequestStream()) { stream.Write(data, 0, data.Length); } } try { var response = httpWebRequest.GetResponse(); return response; } catch (Exception e) { throw new Exception(string.Format("Request error,url:{0},IsPost:{1},Data:{2},Message:{3}", url, post, postData, e.Message), e); } } /// <summary> /// http get request /// </summary> /// <param></param> /// <returns>Response Stream</returns> public static Stream GetStream(string url) { var stream = CreateResponse(url, false).GetResponseStream(); if (stream == null) { throw new Exception("Response error,the response stream is null"); } else { return stream; } } /// <summary> /// http post request /// </summary> /// <param></param> /// <param>post data</param> /// <returns>Response Stream</returns> public static Stream PostStream(string url, string postData) { var stream = CreateResponse(url, true, postData).GetResponseStream(); if (stream == null) { throw new Exception("Response error,the response stream is null"); } else { return stream; } } } }

获取首页数据

string res = HttpUtil.GetString(https://www.cnblogs.com);

.NET Core 实现定时抓取网站文章并发送到邮箱

解析数据

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/wdssxz.html