[thelist] Office 2003, XML, and CMS

Burhan Khalid thelist at meidomus.com
Sun Aug 24 05:06:17 CDT 2003


Quoting GravyFace <gravyface at bmfsquad.com>:
[ snip ]

> I know very little practically about XML, but am familiar with XML
> conceptually -- to save me from wading through the W3C stew, can anyone give
> me an explanation of how I could/should Parse an XML file, dumping into a DB
> (SQL), and store the XML tags and content "as is":
>

[ /snip ]

Well, let me give you a quick rundown of XML parsing (as I'm swimming in it 
everyday on this current project).

There are basically two kinds of parsers. SAX and DOM. SAX parsers are the kind 
that run through the entire XML file, and you have to capture "events" such as 
an open tag, and ending tag, end of file, etc. and then handle the situation in 
your code.

The DOM parser (which I hear is becoming the "standard") will run through your 
XML file and return to you a DOM-style tree of your XML file. Each "node" is a 
tag, and you can use the familiar dot (.) syntax to traverse the tree.

Once you understand the two kinds of parsers, then its easier to write your own 
custom "parser" that extends either of these. I have written a small example 
(tutorial) in PHP that shows you how to parse a XML file using the built-in 
parser that uses the expat library.

In addition to all this goodness, there are plenty of XML parsing classes 
available for your particular brand of programming language. For PHP lovers, 
there are a few in PEAR, and a few floating around in phpclasses.org. If you 
are a .NET developer, most of the XML parsing stuff is taken care of for 
you...thanks to the almost religious support of XML in .NET.

Once you can parse the XML file, you can do anything you want with the 
information.  Storing it in a DB (like your example), would just be a matter of 
writing the appropriate SQL statements.

I can give you a quick PHP example that does this for you (given your sample 
XML file) :

<?php

   /* This class will hold all our information
      in addition to the database logic.
 
      We will populate this class from the XML
      file.
   */

   class article
   {
      var $type;
      var $author;
      var $title;
      var $summary;
      var $body;

      function article()
      {
         $this->type = "";
         /* set other variables to default values */
      }
      function setType($x)
      {
         $this->type = $x;
      }
      /* other set functions */
      function insert($dbinfo)
      {
        /* $dbinfo is an array that holds your
           database information.

           Once you have opened a connection,
           simply insert the values from the
           class.
         */

          $strQuery = "INSERT INTO [...] VALUES('$this->author',";
       }
   }
      
   $current = ""; //holds the current tag
   $obj = new article(); //our article object
   $dbinfo = new Array(); //db information
   $dbinfo['host'] = "localhost"; //etc
   

   function start_tag($parser, $name, $attribs) {
     
     global $current = $name; //sets the current tag

     /* We check what tag the parser is processing
        if its on the article tag, then
        attribs, which is an array that holds the
        attributes for the current tag, will be
        populated, and we set the type of our
        article in the class.
     */

     if ($current == "ARTICLE")
     {
        $obj->setType($attribs["TYPE"]);
     }
   }

   /* Our parser will call this function
      when it is reading the contents of a
      tag.

      We will check what tag we are on
      via $current and then call the
      appropriate method in the
      class.
   */

   function tag_contents($parser, $data) { 
       
       global $current; //current tag
       
       /* $data is what the contents of
          the tag are...stuff between
          the opening and closing tags
       */

       if ($current == "TITLE")
       { 
          $obj->setTitle($data);
       }

       /* more if checks */ 
   }

   /* our parser will call this function
      when it reaches an ending tag
   */

   function end_tag($parser,$name)
   {
      $global $dbinfo;

      if($name == "ARTICLE")
      {

         /* we have reached the end of
            one article's information,
            so insert it into the database
         */
         $obj->insert($dbinfo);
       }
   }
   
   // Finally, we start up our parser
   // We set the element handlers, and the character
      data handler.

   xml_set_element_handler($xmlparser, "start_tag", "end_tag");
   xml_set_character_data_handler($xmlparser, "tag_contents");
   
   $filename = "sample.xml"; 
   if (!($fp = fopen($filename, "r"))) { die("cannot open ".$filename); } 
     
    while ($data = fread($fp, 4096)){
       //strip whitespace 
       $data=eregi_replace(">"."[[:space:]]+"."<","><",$data); 
       if (!xml_parse($xmlparser, $data, feof($fp))) { 
           $reason = xml_error_string(xml_get_error_code($xmlparser)); 
           $reason .= xml_get_current_line_number($xmlparser); 
           die($reason); 
       } 
    } 
?>       

> This...
> 
> <article Type=33>
>   <title>News Flash: Maple Leafs won the Stanley Cup!</title>
>    <summary>After an overwhelmingly-dominant season, the Maple Leafs have
> finally brought Lord Stanley home!</summary>
>    <author>GravyFace</author>
>    <body>blah blah blah <emphasis id=1>BLAH!</emphasis>. blah blah <emphasis
> id=2>BLAAAAAH!</emphasis></body>
> </article id=555>

I'm assuming you mean </article Type="33">

> Some of the nodes would be parsed of their content and inserted as plain
> text/integer (Type, Title etc), while others (body) would retain their
> "markup" for later XSLT/CSS presentation processing ('emphasis' becomes
> "font: verdana bold 12px;").

You can take care of all this in the sample article() class (in my example), or 
in whatever object you choose to represent your xml data.

> 
> I do have some concerns that I may be approaching this wrong:
> 
> a) should all these articles be stored natively in an XML file and disregard
> the DB?  I still think that SQL would out-perform the "XML file server"
> thing, but I could be wrong.

The thing about XML is that its not for data storage, more for data 
description. Of course if you can get them to somehow skip the xml part and 
insert it into the database directly, you wouldn't have to deal with the XML 
parsing headaches :)

> b) should I really be using XHTML as I'm primarily using this for the Web
> and if we decide to port it somewhere else, at least XHTML is well-formed.

XML is going to be better for this imo because you can use XSLT to apply style 
to your XML documents and have them displayed appropriately.

-- 
Burhan Khalid
thelist[at]meidomus[dot]com
http://www.meidomus.com


More information about the thelist mailing list