[转]php 抓取页面 - 源码

标题: [转]php 抓取页面

时间: 2013-01-27

点击: 1396

抓取页面也就是把别人的页面抓回来分析得出自己需要的东西然后入库
简单点抓取就是抓个 get请求的  不涉及到 post数据  也不涉及到登陆
一个最简单的抓取
[php]
01. function getHtml( $url ){
02. $opts = array (
03. 'http' => array (
04. 'method' => "GET" ,
05. 'header' => "Content-Type:text/html; charset=utf-8"
06. )
07. );
08. $context = stream_context_create( $opts );
09. return file_get_contents ( $url , false, $context );
10. }
11. echo getHtml( "http://www.baidu.com" );
抓取回来的是页面也就是由html css js组成的东西,
一般要获取自己需要的东西,这是就得就分析页面结构,然后写正则来获取数据
一个例子(获取163财经频道的本周点击最多 http://money.163.com/special/002526BH/rank.html )
[php]
01. function getHtml( $url ){
02. $opts = array (
03. 'http' => array (
04. 'method' => "GET" ,
05. 'header' => "Content-Type:text/html; charset=utf-8"
06. )
07. );
08. $context = stream_context_create( $opts );
09. return file_get_contents ( $url , false, $context );
10. }
11.
12. $url = 'http://money.163.com/special/002526BH/rank.html' ;
13. $file = getHtml( $url );
14. $reg = '/<div class="tabContents">([\s\S]*?)<\/div>/' ;
15.
16. if (preg_match_all( $reg , $file , $matches )){
17. $data = array ();
18. $news = $matches [1][1]; //$matches[1] 表示能够匹配改结构的所有的匹配项
19. $reg = '/<a href="(?P<url>.*?)"\s*title="(?P<title>.*?)"\s*>(?P<subjet>.*?)<\/a>/' ;
20. if (preg_match_all( $reg , $news , $matches )){
21. $len = count ( $matches [ "subjet" ]);
22. $subjet = $matches [ "subjet" ];
23. $url = $matches [ "url" ];
24. for ( $i =0; $i < $len ; $i ++){
25. $data [ $i ] = array ( "subjet" => $subjet [ $i ], "url" => $url [ $i ]);
26. print_r( $data [ $i ]);
27. }
28. }
29. } else {
30. echo "抓取失败" ;
31. }
抓取结果
Array (     [subjet] => 王石回应离婚传闻：我没有背叛家庭     [url] => http://money.163.com/12/1029/10/8EVOHHRV00253B0H.html ) Array (     [subjet] => 浙江楼市从领涨到领跌温州炒房客被套转投实业     [url] => http://money.163.com/12/1105/01/8FGR7QQK00253B0H.html ) Array (     [subjet] => 《时代周刊》评选2012最佳发明：谷歌眼镜上榜     [url] => http://money.163.com/12/1102/15/8FAM06V600253G87.html ) Array (     [subjet] => 网络盛传万科董事长王石已离婚     [url] => http://money.163.com/12/1029/00/8EUPDERR00253B0H.html ) Array (     [subjet] => 中粮上海楼盘单价21.6万元/平刷新汤臣一品纪     [url] => http://money.163.com/12/1030/10/8F2CUD73002534NU.html ) Array (     [subjet] => 中国近10年来涌现13位首富有人已锒铛入狱     [url] => http://money.163.com/12/1029/14/8F06QTD700253G87.html ) Array (     [subjet] => 2012全球25大最佳国家品牌：瑞士排名第一     [url] => http://money.163.com/12/1030/19/8F3CH5AJ00253B0H.html ) ...........内容较多  省略掉
第二中抓取麻烦点,就要post数据,用到Curl
一些Curl的介绍的文章
http://developer.51cto.com/art/200904/121739.htm
http://www.cmx8.cn/curl.html
一个例子 (抓取博客园  前端文章的例子     抓取地址 http://www.cnblogs.com/mvc/AggSite/PostList.aspx )
抓取成功的页面很少, 因为博客园的博客皮肤太多 , 那个正则只支持几款皮肤,  所以失败率很高呀.ps博客园的东西真难抓取
[php]
01. header( "Content-type: text/html;charset=UTF-8" );
02.
03. function postHtml( $json , $url ){
04. $curlPost = http_build_query(json_decode( $json ,true));
05. $ch = curl_init();
06. curl_setopt( $ch , CURLOPT_URL, $url );
07. curl_setopt( $ch , CURLOPT_HEADER, 1);
08. curl_setopt( $ch , CURLOPT_RETURNTRANSFER, 1);
09. curl_setopt( $ch , CURLOPT_POST, 1);
10. curl_setopt( $ch , CURLOPT_POSTFIELDS, $curlPost );
11. $data = curl_exec( $ch );
12. curl_close( $ch );
13. return $data ;
14. }
15.
16. function getHtml( $url ){
17. $opts = array (
18. 'http' => array (
19. 'method' => "GET" ,
20. 'header' => "Content-Type:text/html; charset=utf-8"
21. )
22. );
23. $context = stream_context_create( $opts );
24. return file_get_contents ( $url , false, $context );
25. }
26.
27. $url = 'http://www.cnblogs.com/mvc/AggSite/PostList.aspx' ;
28. $json = '{"CategoryType":"TopSiteCategory","ParentCategoryId":0,"CategoryId":108703,"PageIndex":3,"ItemListActionName":"PostList"}' ;
29. $html = postHtml( $json , $url ); //抓取的原始页面
30. $reg = '/<a class="titlelnk" href="(?P<href>.*?)"[^>]+>(?P<subject>.*?)<\/a>[\s\S]*?<\/a>(?P<time>[\s\S]*?)<span class="article_comment">/' ;
31.
32. //因为上面的抓取主要是提供的标题  时间  有url  所有后面还要到url去抓取文章的内容
33. if (preg_match_all( $reg , $html , $matches )){
34. $data = array ();
35. foreach ( $matches [ "href" ] as $i => $href ){
36. $html = getHtml( $href );
37. $reg = '/<div class="postText">(?P<content>[\s\S]*?)<\/div>\s*\r\s*<p class="postfoot">/' ;
38. if (preg_match_all( $reg , $html , $rs )){
39. echo $matches [ "href" ][ $i ]. "----------------成功<br>" ;
40. $data [ $i ] = array ( "href" => $href , "subject" => $matches [ "subject" ][ $i ], "time" => $matches [ "time" ][ $i ], "body" => $rs [ "content" ][0]);
41. } else {
42. echo $matches [ "href" ][ $i ]. "----------------失败<br>" ;
43. }
44. }
45. }
第3中抓取是需要登录的,有些数据需要登录后才可见.  就需要模拟登陆.
http是无状态的,服务端要识别客户端的请求用户是否登陆,就需要用到cookie, 所以先抓包登陆请求地址需要post参数,然后讲cookie存起来放到其他文件,然后请求登陆后的页面的时候把cookie带上,就可以模拟登陆请求一些需要的东西了,(ps 有些网站似乎有限制  不管怎么请求都没有返回)
一个例子
[php]
01. <?php
02.
03. $url = "http://passport.cnblogs.com/login.aspx" ;
04. $login = "xxxxx" ; //账号和密码
05. $password = "xxxxx" ;
06.
07. $post_data = array ( "__EVENTTARGET" => "" , "__EVENTARGUMENT" => "" , "__VIEWSTATE" => "/wEPDwULLTE1MzYzODg2NzZkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBQtjaGtSZW1lbWJlcm1QYDyKKI9af4b67Mzq2xFaL9Bt" , "__EVENTVALIDATION" => "/wEWBQLWwpqPDQLyj/OQAgK3jsrkBALR55GJDgKC3IeGDE1m7t2mGlasoP1Hd9hLaFoI2G05" , "tbUserName" => $login , "tbPassword" => $password , "btnLogin" => "登++录" , "txtReturnUrl" => "http://home.cnblogs.com/" );
08. $cookie_jar = tempnam( './temp' , 'cookie' ); //存放COOKIE的文件
09.
10. $ch = curl_init();
11.
12. curl_setopt( $ch , CURLOPT_URL, $url );
13.
14. curl_setopt( $ch , CURLOPT_POST, 1);
15.
16. curl_setopt( $ch , CURLOPT_HEADER, 0);
17.
18. curl_setopt( $ch , CURLOPT_RETURNTRANSFER, 0);
19.
20. curl_setopt( $ch , CURLOPT_POSTFIELDS, $post_data );
21.
22. curl_setopt( $ch , CURLOPT_COOKIEJAR, $cookie_jar ); //保存cookie信息
23.
24. curl_exec( $ch );
25.
26. curl_close( $ch );
27.
28. //上面是登陆的请求  post的东西是我抓包获取的  主要是为了获得cookie
29. //下面的才是用登陆后的cookie  去获取登陆后的东西
30.
31.
32. $url = "http://home.cnblogs.com/" ;
33. $ch = curl_init();
34.
35. curl_setopt( $ch , CURLOPT_URL, $url );
36.
37. curl_setopt( $ch , CURLOPT_REFERER, $url ); //伪装REFERER
38.
39.
40. curl_setopt( $ch , CURLOPT_RETURNTRANSFER, 1); //返回数据，而不是直接输出
41.
42. curl_setopt( $ch , CURLOPT_HEADER, 0); // 设置是否显示header信息 0是不显示，1是显示  默认为0
43.
44. curl_setopt( $ch , CURLOPT_COOKIEFILE, $cookie_jar ); //发送cookie文件
45.
46. $output2 = curl_exec( $ch ); //发送HTTP请求
47. curl_close( $ch );
48.
49. echo $output2 ;
50. ?>

[隐藏样式|查看源码]