大家应该都有所体会,很多时候在做网络爬虫的时候特别需要将爬虫搜索到的超链接进行处理,统一都改成绝对路径的,所以本文就写了一个正则表达式来对搜索到的链接进行处理。下面话不多说,来看看详细的介绍吧。
通常我们可能会搜索到如下的链接:
<!-- 空超链接 --> <a href=""></a> <!-- 空白符 --> <a href=" " > </a> <!-- a标签含有其它属性 --> <a href="https://www.jb51.net/index.html" alt="超链接"> index.html </a> <a href="https://www.jb51.net/" target="_blank"> / target="_blank" </a> <a target="_blank" href="https://www.jb51.net/" alt="超链接" > target="_blank" / alt="超链接" </a> <a target="_blank" title="超链接" href="https://www.jb51.net/" alt="超链接" > target="_blank" title="超链接" / alt="超链接" </a> <!-- 根目录 --> <a href="https://www.jb51.net/" > / </a> <a href="https://www.jb51.net/article/a" > a </a> <!-- 含参数 --> <a href="/index.html?id=1" > /index.html?id=1 </a> <a href="?id=2" > ?id=2 </a> <!-- // --> <a href="https://www.jb51.net/index.html" > //index.html </a> <a href="https://www.mafutian.net" > //www.mafutian.net </a> <!-- 站内链接 --> <a href="https://www.hole_1.com/index.html" > </a> <!-- 站外链接 --> <a href="https://www.mafutian.net" > </a> <a href="https://www.numberer.net" > </a> <!-- 图片,文本文件格式的链接 --> <a href="https://www.jb51.net/article/1.jpg" > 1.jpg </a> <a href="https://www.jb51.net/article/1.jpeg" > 1.jpeg </a> <a href="https://www.jb51.net/article/1.gif" > 1.gif </a> <a href="https://www.jb51.net/article/1.png" > 1.png </a> <a href="https://www.jb51.net/article/1.txt" > 1.txt </a> <!-- 普通链接 --> <a href="https://www.jb51.net/index.html" > index.html </a> <a href="https://www.jb51.net/index.html" > index.html </a> <a href="https://www.jb51.net/article/index.html" > ./index.html </a> <a href="https://www.jb51.net/index.html" > ../index.html </a> <a href="https://www.jb51.net/article/.../" > .../ </a> <a href="https://www.jb51.net/article/..." > ... </a> <!-- 非链接,含有链接冒号 --> <a href="javascript:void(0)" > javascript:void(0) </a> <a href="https://www.jb51.net/article/a:b" > a:b </a> <a href="/a#a:b" > /a#a:b </a> <a href="mailto:'mafutian@126.com'" > mailto:'mafutian@126.com' </a> <a href="/tencent://message/?uin=335134463" > /tencent://message/?uin=335134463 </a> <!-- 相对路径 --> <a href="" > . </a> <a href="" > .. </a> <a href="https://www.jb51.net/" > ../ </a> <a href="https://www.jb51.net/a/b/.." > /a/b/.. </a> <a href="https://www.jb51.net/a" > /a </a> <a href="https://www.jb51.net/article/b" > ./b </a> <a href="https://www.jb51.net/article/././././b" > ./././././././././b </a> <!-- 其实就是 ./b --> <a href="https://www.jb51.net/c" > ../c </a> <a href="" > ../../d </a> <a href="" > ../a/../b/c/../d </a> <a href="https://www.jb51.net/e" > ./../e </a> <a href="https://www.hole_1.org/../e" > </a> <a href="https://www.jb51.net/./f" > ./.././f </a> <a href="https://www.hole_1.org/../a/.../../b/c/../d/.." > </a> <!-- 带有端口号 --> <a href="https://www.jb51.net/:8081/index.html" > :8081/index.html </a> <a href="https://www.mafutian.net:80/index.html" > :80/index.html </a> <a href="https://www.mafutian.net:8081/index.html" > :8081/index.html </a> <a href="https://www.mafutian.net:8082/index.html" > :8082/index.html </a>
处理的第一步,设置成绝对路径:
... / ../ ../
然后本文讲讲如何去除绝对路径中的 './'、'https://www.jb51.net/'、'/..'的实现代码:
function url_to_absolute($relative) { $absolute = ''; // 去除所有的 './' $absolute = preg_replace('/(?<!\.)\.\//','',$relative); $count = preg_match_all('/(?<!\/)\/([^\/]{1,}?)\/\.\.\//',$absolute,$res); // 迭代去除所有的 '/abc/../' do { $absolute = preg_replace('/(?<!\/)\/([^\/]{1,}?)\/\.\.\//','https://www.jb51.net/',$absolute); $count = preg_match_all('/(?<!\/)\/([^\/]{1,}?)\/\.\.\//',$absolute,$res); }while($count >= 1); // 除去最后的 '/..' $absolute = preg_replace('/(?<!\/)\/([^\/]{1,}?)\/\.\.$/','https://www.jb51.net/',$absolute); $absolute = preg_replace('/\/\.\.$/','',$absolute); // 除去存在的 'https://www.jb51.net/' $absolute = preg_replace('/(?<!\.)\.\.\//','',$absolute); return $absolute; } $relative = 'http://www.mytest.org/../a/.../../b/c/../d/..'; var_dump(url_to_absolute($relative)); // 输出:string 'http://www.mytest.org/a/b/' (length=26)
总结