[thelist] PHP: getting domain names from a url - being foiled by .co.uk addresses
Dunstan Orchard
dunstan at 1976design.com
Sat Dec 6 13:29:30 CST 2003
Hi there,
This is a tricky one I think.
I have a collection of URLs, for example:
http://story.news.yahoo.com/news/
http://www.amazon.com/exec/obidos/ASIN/B0000C0FKA/
http://www.metafilter.com/mefi/28945
http://www.s-curverecords.com/joss/
http://www.tenaciousd.com/
I need to get only the main domain names from this list, eg:
yahoo.com
amazon.com
metafilter.com
s-curverecords.com
tenaciousd.com
This I am doing by using PHP's parse_url()
********************************************
// eg: http://www.php.net/download-php.php3?csel=br
// $url['scheme'] = http
// $url['host'] = www.php.net
// $url['path'] = /download-php.php3
// $url['query'] = csel=br
$link = 'http://www.metafilter.com/mefi/28945';
$linkbits = parse_url($link);
$host = $linkbits['host'];
********************************************
So now $host = www.metafilter.com
All I have to do is remove the 'www.':
********************************************
// find pos of first dot
$dot_pos = strpos($host, '.', 0) + 1;
// make a new substring
$domain = substr($host, $dot_pos);
********************************************
So now $domain = metafilter.com
Perfect.
But what about when there are subdomains? That will ruin my 'get
position of first dot'
Not a problem, I'll just loop up until the penultimate dot.
So for http://story.news.yahoo.com/news/
********************************************
// count dots
$dot_num = substr_count($host, '.');
// set initial search position
$dot_pos = 0;
// if subdomains exist
if ($dot_num > 1)
{
// find up to the penultimate dot
for($i=1; $i < $dot_num; $i++)
{
// reset pos
$dot_pos = strpos($host, '.', $dot_pos) + 1;
}
// set domain
$domain = substr($host, $dot_pos);
********************************************
So now $domain = yahoo.com
Perfect.
_But_, what happens if my domain name isn't a .net, or .com, but a
.co.uk (or one of the many other '2-dot' names)?
I can't loop up until the penultimate dot now (and in fact my 'do
subdomains exist' statement messes up as well). If I did then:
http://www.pugh.co.uk/
Would become:
co.uk
So (yes, we're finally at the question), does anyone have any ideas as
to how to get around this problem?
:o)
Thanks very much - Dunstan (on digest, so I'd love a CC if pos)
-------------------------------------
Dorset, England
Work: http://www.1976design.com/
Play: http://www.1976design.com/blog/
Learn: http://webstandards.org/
More information about the thelist
mailing list