[thelist] html page analyser program

Alex Beston alex at deltatraffic.co.uk
Wed Jul 14 06:14:01 CDT 2004


hi all

ive written a little php script which extracts useful info from a webpage.

heres the url:

http://www.deltatraffic.co.uk/regexp/elements.php

put in the full url of any page and it will come back at you with the info.

now the problem is, is that if i try www.photos.org the header shown in 
the prog is this:

Headers Content:

HTTP/1.1 400 Bad Request
Date: Wed, 14 Jul 2004 11:00:50 GMT
Server: Apache/1.2.6
Connection: close
Content-Type: text/html

now, looking for 400 bad request on a google search,

this page:

http://www.codestyle.org/sitemanager/FAQ.shtml#why400
 
sheds some light that their server isnt setup that well.

however, when i run ethereal and use a browser, it returns a 200 access 
ok code.

GET / HTTP/1.1
Host: www.photos.org
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.1) 
Gecko/20040707
Accept: 
application/x-shockwave-flash,text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,image/jpeg,image/gif;q=0.2,*/*;q=0.1
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.1 200 OK
Date: Wed, 14 Jul 2004 11:05:46 GMT
Server: Apache/1.2.6
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html

going back to my program, the code that gets the headers is this:

function get_headers_php4($url)
{
   $url_info = parse_url($url);
   $fp = fsockopen($url_info['host'],80,$errno,$errstr,30);
   if (!$fp)
   {
       print("failed to get headers");
       exit;
   }
   else
   {
       $head = "GET ".$url_info['path']."?".$url_info['query'];
       $head .= " HTTP/1.0\r\nHost: ".$url_info['host']."\r\n\r\n";
       fputs($fp,$head);
       echo "<pre>";
       while(!feof($fp))
       {
           $line = fgets($fp,1024);
           echo($line);
           if (strpos($line,"\r\n",0) === 0)
           {
               fclose($fp); echo "</pre>";
               return $header;
           }
           else
           {
               $header[] = $line;
           }
       }
     
   }
}

so the question is, how can i modify this code so that i dont see a 400 
response code?

maybe i ought to put some thing like:

$head .= "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; 
rv:1.7.1) Gecko";

tried that but it comes back with the same 400 code.

thanks
Alex




More information about the thelist mailing list