用.NET Core写爬虫爬取电影天堂(2)

获取电影详细信息

privateMovieInfoFillMovieInfoFormWeb(AngleSharp.Dom.IElement a, string onlineURL) { var movieHTML = HTTPHelper.GetHTMLByURL(onlineURL); var movieDoc = htmlParser.Parse(movieHTML); //http://www.dy2018.com/i/97462.html 分析过程见上,不再赘述 //电影的详细介绍 在id为Zoom的标签中 var zoom = movieDoc.GetElementById("Zoom"); //下载链接在 bgcolor='#fdfddf'的td中,有可能有多个链接 var lstDownLoadURL = movieDoc.QuerySelectorAll("[bgcolor='#fdfddf']"); //发布时间 在class='updatetime'的span标签中 var updatetime = movieDoc.QuerySelector("span.updatetime"); var pubDate = DateTime.Now; if(updatetime!=null && !string.IsNullOrEmpty(updatetime.InnerHtml)) { //内容带有“发布时间:”字样, //replace成""之后再去转换,转换失败不影响流程 DateTime.TryParse(updatetime.InnerHtml.Replace("发布时间:", ""), out pubDate); } var movieInfo = new MovieInfo() { //InnerHtml中可能还包含font标签,做多一个Replace MovieName = a.InnerHtml.Replace("<font color=\"#0c9000\">","") .Replace("<font color=\" #0c9000\">","") .Replace("</font>", ""), Dy2018OnlineUrl = onlineURL, MovieIntro = zoom != null ? WebUtility.HtmlEncode(zoom.InnerHtml) : "暂无介绍...", //可能没有简介,虽然好像不怎么可能 XunLeiDownLoadURLList = lstDownLoadURL != null ? lstDownLoadURL.Select(d => d.FirstElementChild.InnerHtml).ToList() : null, //可能没有下载链接 PubDate = pubDate, }; return movieInfo; }

HTTPHelper

这边有个小坑,dy2018网页编码格式是GB2312,.NET Core默认不支持GB2312,使用Encoding.GetEncoding(“GB2312”)的时候会抛出异常。

解决方案是手动安装System.Text.Encoding.CodePages包(Install-Package System.Text.Encoding.CodePages),

然后在Starup.cs的Configure方法中加入Encoding.RegisterProvider(CodePagesEncodingProvider.Instance),接着就可以正常使用Encoding.GetEncoding(“GB2312”)了。

using System; using System.Net.Http; using System.Net.Http.Headers; using System.Text; namespace Dy2018Crawler { public class HTTPHelper { public static HttpClient Client { get; } = new HttpClient(); publicstaticstringGetHTMLByURL(stringurl) { try { System.Net.WebRequest wRequest = System.Net.WebRequest.Create(url); wRequest.ContentType = "text/html; charset=gb2312"; wRequest.Method = "get"; wRequest.UseDefaultCredentials = true; // Get the response instance. var task = wRequest.GetResponseAsync(); System.Net.WebResponse wResp = task.Result; System.IO.Stream respStream = wResp.GetResponseStream(); //dy2018这个网站编码方式是GB2312, using (System.IO.StreamReader reader = new System.IO.StreamReader(respStream, Encoding.GetEncoding("GB2312"))) { return reader.ReadToEnd(); } } catch (Exception ex) { Console.WriteLine(ex.ToString()); return string.Empty; } } } }

定时任务的实现

定时任务我这里使用的是 Pomelo.AspNetCore.TimedJob

Pomelo.AspNetCore.TimedJob是一个.NET Core实现的定时任务job库,支持毫秒级定时任务、从数据库读取定时配置、同步异步定时任务等功能。

由.NET Core社区大神兼前微软MVP AmamiyaYuuko (入职微软之后就卸任MVP…)开发维护,不过好像没有开源,回头问下看看能不能开源掉。

nuget上有各种版本,按需自取。地址: https://www.nuget.org/packages/Pomelo.AspNetCore.TimedJob/1.1.0-rtm-10026

作者自己的介绍文章: Timed Job - Pomelo扩展包系列

Startup.cs相关代码

我这边使用的话,首先肯定是先安装对应的包:Install-Package Pomelo.AspNetCore.TimedJob -Pre

然后在Startup.cs的ConfigureServices函数里面添加Service,在Configure函数里面Use一下。

// This method gets called by the runtime. Use this method to add services to the container. publicvoidConfigureServices(IServiceCollection services) { // Add framework services. services.AddMvc(); //Add TimedJob services services.AddTimedJob(); } publicvoidConfigure(IApplicationBuilder app, IHostingEnvironment env, ILoggerFactory loggerFactory) { //使用TimedJob app.UseTimedJob(); if (env.IsDevelopment()) { app.UseDeveloperExceptionPage(); app.UseBrowserLink(); } else { app.UseExceptionHandler("/Home/Error"); } app.UseStaticFiles(); app.UseMvc(routes => { routes.MapRoute( name: "default", template: "{controller=Home}/{action=Index}/{id?}"); }); Encoding.RegisterProvider(CodePagesEncodingProvider.Instance); }

Job相关代码

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/wjywxg.html