Ben Nadel
On User Experience (UX) Design, JavaScript, ColdFusion, Node.js, Life, and Love.
I am the chief technical officer at InVision App, Inc - a prototyping and collaboration platform for designers, built by designers. I also rock out in JavaScript and ColdFusion 24x7.
Meanwhile on Twitter
Loading latest tweet...
Ben Nadel at the New York ColdFusion User Group (Jun. 2010) with:

My First Look At The XML ENTITY Tag In ColdFusion XML Documents

Posted by Ben Nadel
Tags: ColdFusion

Last year in the comments of my blog post on cleaning high ascii values out of text, Eric Stevens suggested that one might use the XML ENTITY tag in an XML document to define named data variables. I had marked this as something I wanted to look into; but, I had somewhat forgotten about until last night when Clarke Bishop asked about a similar topic. As such, I decided it was time to look at this XML ENTITY tag and see what it does.

In an XML document, the ampersand (&) is a special character. Just as in an HTML document, the ampersand is used to define placeholders for XML data entities. Out of the box, there are five predefined internal XML entities:

  • <
  • >
  • &
  • '
  • "

These should look pretty familiar; I am sure that most of us have used at least one or two of these to define the output in our HTML documents.

If you try to parse an XML document in ColdFusion and the XML data contains entities outside this predefined set, you'll probably get an error that looks like this:

An error occured while Parsing an XML document. The entity "mdash" was referenced, but not declared.

In this case, the XML parser came across the m-dash entity (—) and could not find the data that this particular entity was representing. To fix this, we can add our own ENTITY definition to the XML document. This can be done either in an external DTD file or in an internal DOCTYPE tag.

To demonstrate this concept, let's look at a single-file DOCTYPE example. In the following code, we are defining a DOCTYPE declaration which, itself, defines our — entity:

  • <!--- Define our XML document. --->
  • <cfxml variable="xhtml">
  •  
  • <?xml version="1.0" encoding="UTF-8"?>
  •  
  • <!---
  • The document type definition can go inline or in a seperate
  • file. It has to start with the root element and can define
  • entities that are used within the XML.
  •  
  • NOTE: We are defining the entity "mdash" which will allow us
  • to use the entity &mdash; in our XML document without
  • getting any parsing errors.
  • --->
  • <!DOCTYPE data [
  • <!ENTITY mdash "--">
  • ]>
  •  
  •  
  • <!--- The actual XML Data. --->
  • <data>
  •  
  • <h2 class="name">
  • Tricia Smith
  • </h2>
  •  
  • <p class="position">
  • Vice President &mdash; Sales &amp; Marketing
  • </p>
  •  
  • </data>
  •  
  • </cfxml>
  •  
  •  
  • <!--- Output the name and position of the person defined. --->
  • <cfoutput>
  •  
  • <!--- Name. --->
  • #xhtml.data.h2.xmlText#
  •  
  • <!--- Position. --->
  • (#xhtml.data.p.xmlText#)
  •  
  • </cfoutput>

As you can see, our DOCTYPE is defining the "mdash" entity as representing the double-dash (--). When the XML document gets parsed, all instances of the "&mdash;" entity will be replaced with our double-dash. And, when we run the above code, we get the following output:

Tricia Smith ( Vice President -- Sales & Marketing )

As you can see, the above text actually has two substitutions. The first is our explicitly-defined "mdash" entity; the second is the internally-defined "amp" entity.

At this time, that's about all I know about the XML ENTITY declaration. It seems really cool; and, while I was looking into this, I was definitely reminded that XML is a lot more powerful and more robust than I even realize. I wonder how many cool things I could do if I knew more about how XML worked. Oh well - one baby step at a time.



Reader Comments

An entity can have any name you specify in an XML document. HTML entities (such as &amp;mdash;) have a standard set of definitions. See Appendix A on http://www.w3.org/TR/WD-entities-961125

But you can do some neat things with this. For example, let's say you have a standard inline HTML markup logo for a company called "RedBlue" which you want to be able to output without having to write the full markup every time. Define an entity like this:
<code><!ENTITY RedBlueLogo "&lt;span style='color:red'&gt;Red&lt;/span&gt;&lt;span style='color:blue'&gt;Blue&lt;/span&gt;"><code>

Now you can just put &amp;RedBlueLogo; in your output, and you'll get the fully marked up code! Basically it can be used as a shorthand for any markup, no matter how complex.

For some of our sites, we have the model generate XML, and the view is XSL, with these sort of custom entities, you can do a lot of really interesting things.

Reply to this Comment

@Eric,

Yeah, it definitely seems that with XSLT, the really exciting stuff can happen! When it comes to parsing XHTML, it seems more to just avoid parsing errors (hence the standard HHTML doc-type I assume). But, when you're talking about generating content, this kind of entity substitution can be quite exciting.

Thanks, as always, for the great insights!

Reply to this Comment

Ben - I have noticed something "interesting" about CF when dealing with entities in XML.

Recently, I have had to do to fairly intensive processing of several XML files that involved replacing values, changing entire tags, etc. Some of these files contained custom entities as you noted above.

Without either adding the entity declaration directly to the header, or adding a reference to an external file containing those entity replacements, the file would not parse (as you mention.)

However, if I include those entities (or include a path to the external entity file,) and then use ToString to rewrite the results back to an XML file, the original entities end up being replaced in the resulting XML. (I.E. the custom entity no longer appears in the XML, and has been permanently replaced with the value from the entity declaration.

This is not good. I want the resulting XML to retain the original entity references. I can, in theory, replace the ampersand with a unique combo of characters before parsing and then replace it with the ampersand again when I re-write the file, but I would think there should be a way to tell CF to simply not actually replace the custom entity.

Any ideas?

Thanks,

David

Reply to this Comment

@David: this is likely not possible. Entities are just an encoding of (usually a single) character, and there are multiple ways to encode a given character. For example, the following are equivalent representations of the & character: &amp; &#38; &#x26;. The XML engine couldn't hope to know which one you wanted - what if all three were used individually in the document?

Entities being just an encoding, once the document is parsed, there's no reference kept to what their original encoding was, in the same way that with a multi-byte string (such as UTF-8 or UTF-16), the original byte encoding is lost once the string is parsed. You also can't tell the difference between a CDATA block and a entity-escaped block of text (this is a fourth way to encode an & in XML). It's just a way to represent the data in a file, and there are several lexically equivalent ways of doing so.

This is not something unique to ColdFusion, it is part of the XML specification, and every compliant DOM parser will behave the same way.

If you need to preserve the original format of the entities, a kludge you could use would be to replace all & characters with &amp; in the document before parsing. Do all your manipulation of the document (and double-encode any new entities you're adding to the document: &amp;amp;), and at the end after you produce the final output, replace &amp; back with &.

Reply to this Comment

@Eric: Thanks. I am going ahead with the plan of replacing ampersands that are not already "&amp;" when the file is initially read with a unique string marker (like "{**!**}" or whatever,) before the file contents are parsed as XML. XMLParse will then NOT replace the original entity reference, and I will simply replace the marker string with ampersands after I have used ToString before I re-write the XML to file. Thanks again.

Reply to this Comment

Ben (or Eric), et al...

I have another XML "entity declaration" question for you all:

Since any entity or notation declarations inside the DOCTYPE definition at the top of an XML file do not appear to be accessible once that file is parsed as an XML object in CF, what is the best way to add / edit such declarations in a XML file using CF?

My specific problem is that I am converting one XML file to another XML of a different format / schema. In the source XML, there are entity declarations that replace specific text in that doc with a directory path for a file:

<!ENTITY pic_56_1 SYSTEM "rpstl\MATV0730.EPS" NDATA EPS>

I need to parse that info from the source and do some magic to convert it to a similar, but altogether different entity declaration in the resulting XML file.

Is there a better / easier way to systematically do this without resorting to what I can only assume would be some some seriously nasty string manipulation? (I.E. Reading the source XML as a string to parse out the info in question, do some heavy duty string manipulation to generate the desired result, and then inserting that result into the resulting XML as a string prior to writing the whole thing out to a file?) Ugh.

Ideas?

Thanks,

David

Reply to this Comment

Post A Comment

?
You — Get Out Of My Dreams, Get Into My Comments
Live in the Now
Oops!
Comment Etiquette: Please do not post spam. Please keep the comments on-topic. Please do not post unrelated questions or large chunks of code. And, above all, please be nice to each other - we're trying to have a good conversation here.