Experienced XML Interview Questions and Answers (Part5)

41. What is parsing and how do I do it in XML
Parsing is the act of splitting up information into its component parts (schools used to teach this in language classes until the teaching profession collectively caught the anti-grammar disease).
‘Mary feeds Spot’ parses as
1. Subject = Mary, proper noun, nominative case
2. Verb = feeds, transitive, third person singular, present tense
3. Object = Spot, proper noun, accusative case
In computing, a parser is a program (or a piece of code or API that you can reference inside your own programs) which analyses files to identify the component parts. All applications that read input have a parser of some kind, otherwise they'd never be able to figure out what the information means. Microsoft Word contains a parser which runs when you open a .doc file and checks that it can identify all the hidden codes. Give it a corrupted file and you'll get an error message.
XML applications are just the same: they contain a parser which reads XML and identifies the function of each the pieces of the document, and it then makes that information available in memory to the rest of the program.
While reading an XML file, a parser checks the syntax (pointy brackets, matching quotes, etc) for well-formedness, and reports any violations (reportable errors). The XML Specification lists what these are.
Validation is another stage beyond parsing. As the component parts of the program are identified, a validating parser can compare them with the pattern laid down by a DTD or a Schema, to check that they conform. In the process, default values and datatypes (if specified) can be added to the in-memory result of the validation that the validating parser gives to the application.

<person corpid="abc123" birth="1960-02-31" gender="female">
The example above parses as: 1. Element person identified with Attribute corpid containing abc123 and Attribute birth containing 1960-02-31 and Attribute gender containing female containing ...
2. Element name containing ...
3. Element forename containing text ‘Judy’ followed by ...
4. Element surname containing text ‘O'Grady’
(and lots of other stuff too).
As well as built-in parsers, there are also stand-alone parser-validators, which read an XML file and tell you if they find an error (like missing angle-brackets or quotes, or misplaced markup). This is essential for testing files in isolation before doing something else with them, especially if they have been created by hand without an XML editor, or by an API which may be too deeply embedded elsewhere to allow easy testing.

42. When should I use a CDATA Marked Section?
You should almost never need to use CDATA Sections. The CDATA mechanism was designed to let an author quote fragments of text containing markup characters (the open-angle-bracket and the ampersand), for example when documenting XML (this FAQ uses CDATA Sections quite a lot, for obvious reasons). A CDATA Section turns off markup recognition for the duration of the section (it gets turned on again only by the closing sequence of double end-square-brackets and a close-angle-bracket). Consequently, nothing in a CDATA section can ever be recognised as anything to do with markup: it's just a string of opaque characters, and if you use an XML transformation language like XSLT, any markup characters in it will get turned into their character entity equivalent.
If you try, for example, to use:
some text with <![CDATA[markup]]> in it. in the expectation that the embedded markup would remain untouched, it won't: it will just output some text with <em>markup</em> in it.

In other words, CDATA Sections cannot preserve the embedded markup as markup. Normally this is exactly what you want because this technique was designed to let people do things like write documentation about markup. It was not designed to allow the passing of little chunks of (possibly invalid) unparsed HTML embedded inside your own XML through to a subsequent process—because that would risk invalidating the output.

As a result you cannot expect to keep markup untouched simply because it looked as if it was safely ‘hidden’ inside a CDATA section: it can't be used as a magic shield to preserve HTML markup for future use as markup, only as characters.

43. How can I handle embedded HTML in my XML?
Apart from using CDATA Sections, there are two common occasions when people want to handle embedded HTML inside an XML element:
1. when they have received (possibly poorly-designed) XML from somewhere else which they must find a way to handle;
2. when they have an application which has been explicitly designed to store a string of characters containing < and & character entity references with the objective of turning them back into markup in a later process (eg FreeMind, Atom).
Generally, you want to avoid this kind of trick, as it usually indicates that the document structure and design has been insufficiently thought out. However, there are occasions when it becomes unavoidable, so if you really need or want to use embedded HTML markup inside XML, and have it processable later as markup, there are a couple of techniques you may be able to use:
* Provide templates for the handling of that markup in your XSLT transformation or whatever software you use which simply replicates what was there, eg
<xsl:template match="b">
* Use XSLT's ‘deep copy’ instruction, which outputs nested well-formed markup verbatim, eg
<xsl:template match="ol">
<xsl:copy-of select="."/>
* As a last resort, use the disable-output-escaping attribute on the xsl:text element of XSL[T] which is available in some processors, eg
<xsl:text disable-output-escaping="yes"><![CDATA[<b>Now!</b>]]></xsl:text>
* Some processors (eg JX) are now providing their own equivalents for disabling output escaping. Their proponents claim it is ‘highly desirable’ or ‘what most people want’, but it still needs to be treated with care to prevent unwanted (possibly dangerous) arbitrary code from being passed untouched through your system.

44. What are the special characters in XML ?
For normal text (not markup), there are no special characters: just make sure your document refers to the correct encoding scheme for the language and/or writing system you want to use, and that your computer correctly stores the file using that encoding scheme. See the question on non-Latin characters for a longer explanation. If your keyboard will not allow you to type the characters you want, or if you want to use characters outside the limits of the encoding scheme you have chosen, you can use a symbolic notation called ‘entity referencing’. Entity references can either be numeric, using the decimal or hexadecimal Unicode code point for the character (eg if your keyboard has no Euro symbol (€) you can type €); or they can be character, using an established name which you declare in your DTD (eg ) and then use as € in your document. If you are using a Schema, you must use the numeric form for all except the five below because Schemas have no way to make character entity declarations. If you use XML with no DTD, then these five character entities are assumed to be predeclared, and you can use them without declaring them:
The less-than character (<) starts element markup (the first character of a start-tag or an end-tag).
The ampersand character (>) starts entity markup (the first character of a character entity reference).
The greater-than character (>) ends a start-tag or an end-tag.
The double-quote character (") can be symbolised with this character entity reference when you need to embed a double-quote inside a string which is already double-quoted.

The apostrophe or single-quote character (') can be symbolised with this character entity reference when you need to embed a single-quote or apostrophe inside a string which is already single-quoted. If you are using a DTD then you must declare all the character entities you need to use (if any), including any of the five above that you plan on using (they cease to be predeclared if you use a DTD). If you are using a Schema, you must use the numeric form for all except the five above because Schemas have no way to make character entity declarations.

45. How can I include a conditional statement in my XML?
You can't: XML isn't a programming language, so you can't say things like
<google if {DB}="A">bar</google>
If you need to make an element optional, based on some internal or external criteria, you can do so in a Schema. DTDs have no internal referential mechanism, so it isn't possible to express this kind of conditionality in a DTD at the individual element level.
It is possible to express presence-or-absence conditionality in a DTD for the whole document, by using parameter entities as switches to include or ignore certain sections of the DTD based on settings either hardwired in the DTD or supplied in the internal subset. Both the TEI and Docbook DTDs use this mechanism to implement modularity.
Alternatively you can make the element entirely optional in the DTD or Schema, and provide code in your processing software that checks for its presence or absence. This defers the checking until the processing stage: one of the reasons for Schemas is to provide this kind of checking at the time of document creation or editing.
I have to do an overview of XML for my manager/client/investor/advisor.
46. What should I mention?
* XML is not a markup language. XML is a ‘metalanguage’, that is, it's a language that lets you define your own markup languages (see definition).
* XML is a markup language [two (seemingly) contradictory statements one after another is an attention-getting device that I'm fond of], not a programming language. XML is data: is does not ‘do’ anything, it has things done to it.
* XML is non-proprietary: your data cannot be held hostage by someone else.
* XML allows multi-purposing of your data.
* Well-designed XML applications most often separate ‘content’ from ‘presentation’. You should describe what something is rather what something looks like (the exception being data content which never gets presented to humans).
Saying ‘the data is in XML’ is a relatively useless statement, similar to saying ‘the book is in a natural language’. To be useful, the former needs to specify ‘we have used XML to define our own markup language’ (and say what it is), similar to specifying ‘the book is in French’.
A classic example of multipurposing and separation that I often use is a pharmaceutical company. They have a large base of data on a particular drug that they need to publish as:
* reports to the FDA;
* drug information for publishers of drug directories/catalogs;
* ‘prescribe me!’ brochures to send to doctors;
* little pieces of paper to tuck into the boxes;
* labels on the bottles;
* two pages of fine print to follow their ad in Reader's Digest;
* instructions to the patient that the local pharmacist prints out;
* etc.
Without separation of content and presentation, they need to maintain essentially identical information in 20 places. If they miss a place, people die, lawyers get rich, and the drug company gets poor. With XML (or SGML), they maintain one set of carefully validated information, and write 20 programs to extract and format it for each application. The same 20 programs can now be applied to all the hundreds of drugs that they sell.
In the Web development area, the biggest thing that XML offers is fixing what is wrong with HTML:
* browsers allow non-compliant HTML to be presented;
* HTML is restricted to a single set of markup (‘tagset’).
If you let broken HTML work (be presented), then there is no motivation to fix it. Web pages are therefore tag soup that are useless for further processing. XML specifies that processing must not continue if the XML is non-compliant, so you keep working at it until it complies. This is more work up front, but the result is not a dead-end.
If you wanted to mark up the names of things: people, places, companies, etc in HTML, you don't have many choices that allow you to distinguish among them. XML allows you to name things as what they are:
<person>Charles Goldfarb</person> worked
at <company>IBM</company>
gives you a flexibility that you don't have with HTML:
<B>Charles Goldfarb</B> worked at<B>IBM<</B>
With XML you don't have to shoe-horn your data into markup that restricts your options.

47. What is the purpose of XML namespaces?
XML namespaces are designed to provide universally unique names for elements and attributes. This allows people to do a number of things, such as:
* Combine fragments from different documents without any naming conflicts. (See example below.)
* Write reusable code modules that can be invoked for specific elements and attributes. Universally unique names guarantee that such modules are invoked only for the correct elements and attributes.
* Define elements and attributes that can be reused in other schemas or instance documents without fear of name collisions. For example, you might use XHTML elements in a parts catalog to provide part descriptions. Or you might use the nil attribute defined in XML Schemas to indicate a missing value.
As an example of how XML namespaces are used to resolve naming conflicts in XML documents that contain element types and attributes from multiple XML languages, consider the following two XML documents:
<?xml version="1.0" ?>
<Street>Apple 7</Street>
<?xml version="1.0" ?>
Each document uses a different XML language and each language defines an Address element type. Each of these Address element types is different -- that is, each has a different content model, a different meaning, and is interpreted by an application in a different way. This is not a problem as long as these element types exist only in separate documents. But what if they are combined in the same document, such as a list of departments, their addresses, and their Web servers?
48. How does an application know which Address element type it is processing?
One solution is to simply rename one of the Address element types -- for example, we could rename the second element type IPAddress. However, this is not a useful long term solution. One of the hopes of XML is that people will standardize XML languages for various subject areas and write modular code to process those languages. By reusing existing languages and code, people can quickly define new languages and write applications that process them. If we rename the second Address element type to IPAddress, we will break any code that expects the old name. A better answer is to assign each language (including its Address element type) to a different namespace. This allows us to continue using the Address name in each language, but to distinguish between the two different element types. The mechanism by which we do this is XML namespaces. Note that by assigning each Address name to an XML namespace, we actually change the name to a two-part name consisting of the name of the XML namespace plus the name Address. This means that any code that recognizes just the name Address will need to be changed to recognize the new two-part name. However, this only needs to be done once, as the two-part name is universally unique.

49. What is an XML namespace?
An XML namespace is a collection of element type and attribute names. The collection itself is unimportant -- in fact, a reasonable argument can be made that XML namespaces don't actually exist as physical or conceptual entities . What is important is the name of the XML namespace, which is a URI. This allows XML namespaces to provide a two-part naming system for element types and attributes. The first part of the name is the URI used to identify the XML namespace -- the namespace name. The second part is the element type or attribute name itself -- the local part, also known as the local name. Together, they form the universal name. This two-part naming system is the only thing defined by the XML namespaces recommendation.

50. Does the XML namespaces recommendation define anything except a two-part naming system for element types and attributes?
This is a very important point and a source of much confusion, so we will repeat it:
In particular, they do not provide or define any of the following:
* A way to merge two documents that use different DTDs.
* A way to associate XML namespaces and schema information.
* A way to validate documents that use XML namespaces.
* A way to associate element type or attribute declarations in a DTD with an XML namespace.
