[thelist] html page analyser program
Alex Beston
alex at deltatraffic.co.uk
Wed Jul 14 06:14:01 CDT 2004
hi all
ive written a little php script which extracts useful info from a webpage.
heres the url:
http://www.deltatraffic.co.uk/regexp/elements.php
put in the full url of any page and it will come back at you with the info.
now the problem is, is that if i try www.photos.org the header shown in
the prog is this:
Headers Content:
HTTP/1.1 400 Bad Request
Date: Wed, 14 Jul 2004 11:00:50 GMT
Server: Apache/1.2.6
Connection: close
Content-Type: text/html
now, looking for 400 bad request on a google search,
this page:
http://www.codestyle.org/sitemanager/FAQ.shtml#why400
sheds some light that their server isnt setup that well.
however, when i run ethereal and use a browser, it returns a 200 access
ok code.
GET / HTTP/1.1
Host: www.photos.org
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.1)
Gecko/20040707
Accept:
application/x-shockwave-flash,text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,image/jpeg,image/gif;q=0.2,*/*;q=0.1
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
HTTP/1.1 200 OK
Date: Wed, 14 Jul 2004 11:05:46 GMT
Server: Apache/1.2.6
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html
going back to my program, the code that gets the headers is this:
function get_headers_php4($url)
{
$url_info = parse_url($url);
$fp = fsockopen($url_info['host'],80,$errno,$errstr,30);
if (!$fp)
{
print("failed to get headers");
exit;
}
else
{
$head = "GET ".$url_info['path']."?".$url_info['query'];
$head .= " HTTP/1.0\r\nHost: ".$url_info['host']."\r\n\r\n";
fputs($fp,$head);
echo "<pre>";
while(!feof($fp))
{
$line = fgets($fp,1024);
echo($line);
if (strpos($line,"\r\n",0) === 0)
{
fclose($fp); echo "</pre>";
return $header;
}
else
{
$header[] = $line;
}
}
}
}
so the question is, how can i modify this code so that i dont see a 400
response code?
maybe i ought to put some thing like:
$head .= "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.7.1) Gecko";
tried that but it comes back with the same 400 code.
thanks
Alex
More information about the thelist
mailing list