php正文提取道理演示

以下是php正文提取道理的简朴演示,看了阐明白下收罗侠的功效。我得出其过滤法则,以下先写出收罗的部门代码,虽然要上演示。

演示请看《php正文提取道理演示——提取篇》

上PHP代码:

<?php include'function.php'; header("Content-type: text/html; charset=utf-8"); if(isset($_GET['q'])) { $url = htmlspecialchars(urldecode($_GET['q'])); define('DOWN_TIME',microtime(true)); $str = file_get_contents($url); $downtime = microtime(true) - DOWN_TIME; //转编码 if (!is_utf8($str)) $str = array_iconv($str, $input = 'gb2312', $output = 'utf-8'); define('START_TIME',microtime(true)); // 先去掉javascript $str = preg_replace('#<script[^>]*?>.*?</script>#si','',$str); $str = preg_replace('#<style[^>]*?>.*?</style>#si','',$str); //去除所有标签 除了<p> <br> <b> <strong> <img> <h1>~<h6> <i> <em> <span> $str = strip_tags($str,'<p><span><b><strong><img><h1><h2><h3><h4><h5><h6><i><em>'); //去除所有标签属性 //$str = preg_replace('#<([a-z1-6]+?)\s+?.* $str = preg_replace('#<(p|span|b|strong|h1|h2|h3|h4|h5|h6|i|em)\s+?.*?>#is','<$1>',$str); //查找正文部门 /* 待研究 */ }else { $url="http://"; } ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>正文提取道理-演示版v1.0_enenba blog</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta content="正文提取道理,演示版v1.0"> <meta content="正文提取道理研究"> </head> <body> <h3>此为正文提取道理演示版本,并没有真正实现,仅供参考。beta.1.0</h3> <form method="get" action=""> <input type="text" size="120" value="<?= $url ?>"><input type="submit" value="获取内容"> </form> <?php if(isset($_GET['q'])) { echo '下载用时:'.$downtime.'s<br />'; echo '处理惩罚用时:'.(microtime(true) - START_TIME).'s'; highlight_string($str); } ?> </body> </html>

PS:并没完成,只是简朴过滤标签,下次写正文提取篇

end

附件下载/演示源码:
【 zhengwentiqu.rar 】 1.77KB

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/7887.html