[thelist] PHP: getting domain names from a url - being foiled by .co.uk addresses

Dunstan Orchard dunstan at 1976design.com
Sat Dec 6 13:29:30 CST 2003


Hi there,

This is a tricky one I think.

I have a collection of URLs, for example:

http://story.news.yahoo.com/news/
http://www.amazon.com/exec/obidos/ASIN/B0000C0FKA/
http://www.metafilter.com/mefi/28945
http://www.s-curverecords.com/joss/
http://www.tenaciousd.com/

I need to get only the main domain names from this list, eg:

yahoo.com
amazon.com
metafilter.com
s-curverecords.com
tenaciousd.com


This I am doing by using PHP's parse_url()

********************************************
// eg: http://www.php.net/download-php.php3?csel=br
// $url['scheme'] = http
// $url['host'] = www.php.net
// $url['path'] = /download-php.php3
// $url['query'] = csel=br

$link = 'http://www.metafilter.com/mefi/28945';
$linkbits = parse_url($link);
$host = $linkbits['host'];
********************************************

So now $host = www.metafilter.com

All I have to do is remove the 'www.':

********************************************
// find pos of first dot
$dot_pos = strpos($host, '.', 0) + 1;

// make a new substring
$domain = substr($host, $dot_pos);
********************************************

So now $domain = metafilter.com

Perfect.

But what about when there are subdomains? That will ruin my 'get 
position of first dot'
Not a problem, I'll just loop up until the penultimate dot.

So for http://story.news.yahoo.com/news/

********************************************
// count dots
$dot_num = substr_count($host, '.');

// set initial search position
$dot_pos = 0;

// if subdomains exist
if ($dot_num > 1)
{
// find up to the penultimate dot
for($i=1; $i < $dot_num; $i++)
{
// reset pos
$dot_pos = strpos($host, '.', $dot_pos) + 1;
}

// set domain
$domain = substr($host, $dot_pos);
********************************************

So now $domain = yahoo.com

Perfect.

_But_, what happens if my domain name isn't a .net, or .com, but a 
.co.uk (or one of the many other '2-dot' names)?

I can't loop up until the penultimate dot now (and in fact my 'do 
subdomains exist' statement messes up as well). If I did then:

http://www.pugh.co.uk/

Would become:

co.uk

So (yes, we're finally at the question), does anyone have any ideas as 
to how to get around this problem?

:o)

Thanks very much - Dunstan (on digest, so I'd love a CC if pos)

-------------------------------------
Dorset, England
Work: http://www.1976design.com/
Play: http://www.1976design.com/blog/
Learn: http://webstandards.org/




More information about the thelist mailing list