ToughQuiz V

A document uses XHTML 1.0 Strict. It contains a few <blockquote>s, and in Strict they are not allowed to have text nodes as children. Instead, any text in the element should be marked up in a block level element, for instance <p>. Initially the document satisfies this requirement.

After the document has loaded a script similar to Simon Willison's Blockquote Citations runs in the document and adds the content of the cite attribute of each <blockquote> to the visible text of the quote. Due to an oversight of the programmer the script does not put this text in a block level element of its own. Now the <blockquote> has a text node as a child.

Has the restructured document now become invalid XHTML 1.0 Strict? Or does only the initial state count, and does the document remain valid? If it were technically possible, should the validator run the script and judge the resulting document?

Discuss.

This is the blog of Peter-Paul Koch, mobile platform strategist, consultant, and trainer. You can also follow him on Twitter.
Atom RSS

I’m speaking at the following conferences:

(Data from Lanyrd)

Categories:

Monthlies:

Comments

Comments are closed.

1 Posted by Pingu on 4 July 2005 | Permalink

Well, you could mess up the DOM in any number of ways. Does that stop the document being valid? Sure. Does that stop the originating XHTML markup being valid? No.

Syntactically correct.
Semantically wrong.

2 Posted by Tino Zijdel on 5 July 2005 | Permalink

I don't think of the DOM as (X)HTML anymore; it's just a representation after rendering and (for text/html) errorcorrections performed by the browser. In most cases it isn't 100% equivalent anymore; if you were to recreate (X)HTML from the DOM structure in most cases you will get a different document.
Luckily browsers don't actually validate syntax rules to the DOM according to a DTD; if they would they probably should have not allowed such an operation.
I'd say in the DOM itself it's OK; the DOM allows it and doesn't have to follow syntax rules outside of it's own, but since you will not be able to recreate a document that does follow the original syntax rules stated in the DTD you could say that at least at some point it becomes invalid syntax.

3 Posted by Dante on 5 July 2005 | Permalink

The validity of a web page should have nothing to do with whether Javascript is on in the browser or not. In my view, only the *initial* state counts. This may change in the future though, in standards-compliant browsers.

4 Posted by Angus Turnbull on 5 July 2005 | Permalink

IMHO, the document is now invalid.

I don't see a substantial difference between the DOM and (X)HTML. One is just a serialised representation of the other, and it's trivial to convert between them using .innerHTML or fancier means like document fragments.

While I acknowledge the other posters above have a point in saying that it shouldn't matter what JS does to the document, imagine a situation in which a user script or bookmarklet obtains a portion of the DOM, serialises it, and transmits it to a third party server (e.g. a "blog this selection" utility). Whether or not it's valid (X)HTML does matter in that situation!

5 Posted by Dimitri Glazkov on 5 July 2005 | Permalink

There is no question, IMHO. The document now represents an invalid XHTML 1.0 Strict.

However, where I am still undecided is whether this question is important or even valid to ask.

Since you left the serialized markup realm and modified the DOM, can you still be held liable to the same DOCTYPE declaration that was in the original document?

Here's an interesting read on this topic: http://www.w3.org/TR/2002/WD-DOM-Level-3-Val-20021008/DOM3-Val.html

6 Posted by Fuzztrek on 5 July 2005 | Permalink

Invalid.

I think you're asking the wrong question. I've always wondered what the point of the "cite" attribute was if it isn't going to be rendered. Just there for people snooping through the source?

7 Posted by kevin c smith on 5 July 2005 | Permalink

As I see it, the DTD exists solely to inform those writing document
processors (user agents or scripts) as to what structures they might
be dealing with. A document neither is nor has a DOM - that's something
the UA constructs after it parses a given doc. Once that's happened,
the notion of validity is moot because the DTD is no longer useful.

8 Posted by Cris Perdue on 5 July 2005 | Permalink

For me, concepts like XML validity either
do not make sense as ends in themselves.
These are tools that have value is means to other ends.

As you put it, "judging the resulting document" after running the script is
not something I see a point in. If there were not a reasonable
interpretation for the DOM tree after running the script, that would be an
issue -- or that sort of check could be a useful tool to help developers
catch bugs. But "judging" the validity of a modified DOM tree is not a value
in itself.

9 Posted by 4rn0 on 5 July 2005 | Permalink

Theoretically I would say that structural changes made through the DOM can render a document invalid. It is however such a nice and clean way to implement features into a document that are technically invalid.

For instance, one website I created required a JavaScript controlled Quicktime MP3 player. In order to make it work on both Trident and Gecko based browsers you need to wrap an embed tag in an object tag: Say goodbye to your validation! By inserting those things through the DOM you can have your pie and eat it too!

For the FireFox user, check out this extension: http://jennifermadden.com/scripts/ViewRenderedSource.html

10 Posted by Tino Zijdel on 5 July 2005 | Permalink

4rn0: embed is not part of the (X)HTML rec. and doesn't have to be since all modern browsers support the object element (only IE's implementation is not fully according to spec).
If you search a little you will find crossbrowser (X)HTML valid examples for embedding things like flash or quicktime.

11 Posted by zakoops on 5 July 2005 | Permalink

From a pure standpoint, such document is invalid. It's just that actual validation tools cannot differentiate between the pristine or virginal (wow!) document and its final state after javascript transformations.

We could reverse the situation: deliver invalid xhtml document, and then change it for a valid strict one through DOM manipulations. But validators are of no help in such case.

12 Posted by Antonio Bueno on 5 July 2005 | Permalink

IMHO the document becomes invalid. For me validity is to comply with a list of conditions. Some of them I agree (e.g. excluding target="_blank" from link tags) and some of them I don't (what's the problem of having text directly in the blockquote?).

IF you trick the validator to give you the ok, that's just because validators don't execute Javascript. Imagine someday they do it: Your validated document becomes suddenly invalid! I'd love validators to process Javascript. Right now I can validate my static x/html but not my dynamic one.

Whether validity is a good or a bad thing or something in between is to be discussed somewhere else.

And as for Firefox Extensions to see modified source, I prefer this one:
https://addons.mozilla.org/extensions/moreinfo.php?id=697

13 Posted by Jeremy French on 5 July 2005 | Permalink

Invalid.
If the change were made in a server side script instead of a client side script there would be no argument.

How should the UA render this text? It may not know. It will have to make an assumption, which is something we don't want.

The source file is not invalid however, only the document after the script is run.

14 Posted by Maian on 5 July 2005 | Permalink

Although this doesn't answer the question, I should mention that in XHTML 1.1 supports child text nodes.

15 Posted by Caliban Darklock on 5 July 2005 | Permalink

I think the biggest problem is that question of "if it were technically possible". How does the validator know what is possible? It would burn so many cycles trying to figure out whether something *should* be validated, validation would take forever.

Sometimes things are only acceptable within known boundaries, and the validator can't know them. Consider a recursive factorial function that doesn't validate its arguments; if you pass it a non-numeric string or call it with a number greater than (say) 14, it dies horribly. But if the input comes from an array of nine buttons, each labeled with a number from 1 to 9, do you really have a problem? How could a validator *know* your argument is properly restricted to this problem space?

16 Posted by Philip Hazelden on 5 July 2005 | Permalink

Invalid. The point of validity isn't so you can say to people, "Look! My code is valid! Worship me!", it's so the browser can understand your code. A validating browser is allowed to crash and burn if it sees a text node inside a blockquote (I don't get why, but that's another matter), and it won't care whether that was already there when it first saw the page or not.

As to the validator, the easiest thing to do would be to use a web browser which supports Javascript and can validate as it goes along. Getting a computer to work out what sequence of user events can make the page invalid would be possible, but hard and slow.

17 Posted by Joost Diepenmaat on 5 July 2005 | Permalink

Tino Zijdel said:

"Luckily browsers don't actually validate syntax rules to the DOM according to a DTD; if they would they probably should have not allowed such an operation."

But the question then becomes: can you count on the DOM not to validate? It's perfectly possible to create a DOM implementation that just disallowes all invalid (according to the DTD) manipulation.

Personally, I would rather not take the risk. On the other hand, I don't know of any mechanisms to check the generated DTD, though it could be done in javascript, ofcourse. Might be an interesting project for someone :-)

18 Posted by Alex Lein on 5 July 2005 | Permalink

Wow, I thought this would be far less one sided.

We really need to be clear about the definitions of "document" and "DOM". Honestly, the document never changes, but yur DOM tree does, and thats what the UserAgent displays. So if you want to be SUPER-technical, the document remains valid since the initial state is all that matters, but your DOM does contain invalid XHTML 1.0 Strict markup.

19 Posted by Ryan Cannon on 6 July 2005 | Permalink

As a general rule, any time I include content not editted by myself dynamically (user content, or blockquote content as in the example above), I always wrap it in an additional DIV element, which protects the validity of the document in most cases.

Not sure if this changes the context of the citation, but that would be transparent to the reader.

20 Posted by David on 6 July 2005 | Permalink

The restructured document is invalid. As others have stated, the point of validation is to help ensure that standards-compliant browsers can properly render the document. If it were to just be able to put up a "Valid XHTML 1.0 Strict" banner to trumpet how good of a webdesigner you are, and modifying the document through Javascript didn't invalidate the page, then you could put together a basic, empty page, and load all the content through Ajax (or some such). Because the basic page is "valid", you'd then be able to argue that all of your pages are "valid", because hey, the Javascript is only changing the DOM, right? Just ignore the unclosed tags, the improper overlapping of open/close tags, etc.; those were all added after the document was validated!

Obviously, this is an extreme example, but it should serve to illustrate my point--validity is a tool, not an end in and of itself. Any manipulation of the document, before or after it is loaded, should have the possibility of invalidating the document.

As for validators exercising Javascript--ideally, they would, but I can't see it happening. How would they traverse all menus to verify that they don't muck up the document? Simulate clicking on all elements, in all possible orders? And so on, and so forth? It's just not feasible. But the question said to assume it was possible--so yes, they should.

21 Posted by Jacob on 6 July 2005 | Permalink

IMO, the page becomes invalid. The script has changed the document structure such that it is no longer valid. You could use innerHTML to put in all sorts of rubbish that clearly wouldn't be valid - this isn't much different. The DOM remains correct but the page structure does not.

22 Posted by ppk on 11 July 2005 | Permalink

Well, the opinion of my readers is clear: the document is invalid. Only one clear vote was cast for a valid document, against 10 clear votes for an invalid document.

I myself just don't know, that's why I asked the question. I raised the same question on XHTML-L three or four years ago, and there valid and invalid supporters were about equal in number.

I'm going to leave this ToughQuiz open on the off chance that a new commenter might give a useful insight. I'm especially interested in views that say the document is valid.

23 Posted by Alex Lein on 12 July 2005 | Permalink

I did some more investigation into this and came up with an interesting test page.

http://www.baka.ca/test/xhtml_markup.htm

The page is Valid XHTML Strict, but when you alert() the innerHTML, all elements are not returned as Valid XHTML at all. In IE6, the quotations are removed from numeric attributes as well.
My point is that the document is still "Valid", but the DOM (what the user reads and we manipulate) is never valid.

24 Posted by Thomas Barnett on 13 July 2005 | Permalink

Look at it another way.
Assume the document is initially XHTML 1.0 strict. After the document loads a script runs and re-writes the whole document, declaring it as HTML 4.01, and making the necessary changes.
Is the restructured document still XHTML 1.0 Strict? Should user agents treat as XHTML 1.0 strict?
Answer, without doubt, is no.
By the way your home page http://www.quirksmode.org/home.shtml, declared to be valid XHTML 1.0, does not validate.

25 Posted by David on 13 July 2005 | Permalink

Thomas,

The validator balks on one anchor tag: http://www.theonion.com/news/index.php?issue=4127&n=1 . The tag looks correctly formed to me. The part of the href that seems to be causing the validator problems is the '&n=1' at the end of the url. (The first error it gives is, 'cannot generate system identifier for general entity "n"', as if it were trying to treat the '&n' similar to '&amp;', which doesn't make sense.) Color me clueless, but how is one supposed to put multiple arguments in the href in XHTML?

Unless I'm wrong, I'd say we're looking at a problem in the validator, rather than a problem in ppk's page.

Incidentally, while trying to compose this note, I had to struggle with the preview system converting my "&amp;"s into simple ampersands in the text box, which meant that the text in the box (where I'm typing) didn't match the previewed text. Rather annoying.

26 Posted by Alex Lein on 14 July 2005 | Permalink

@Thomas Barnett
You bring up an interesting issue. If the page is declared with an XHTML 1.0 Strict DTD, then even after the page was re-written to be valid HTML 4.01, the UserAgent would still consider the page to be an XHTML page, and some of the finer details of the HTML content would not be rendered correctly.

@David
In XHTML 1.0 Strict, when you pass arguments in the QueryString, you deliminate them with "&amp;" and not "&". Your UserAgent should interpret the "&amp;" as an "&".

27 Posted by Dante on 17 July 2005 | Permalink

I've read through the 'invalid' comments, and the reason I said it was valid wasn't true. The document is invalid, but unless someone attatches some kind of validate() function to the onchange event of the document, it doesn't matter. Validators only take the static source of the webpage.

28 Posted by Jim on 6 August 2005 | Permalink

The document is valid.

XHTML 1.0 is a specification that describes a document format. It doesn't describe an in-memory tree, it describes syntax and structure as if the two were inseperable.

Talking about whether a DOM tree is valid or invalid is a nonsensical question, because validity is tightly coupled to syntax. It's like asking what colour a song is - a nonsensical question that has no correct answer.

The only sensible questions to ask are:

1. Is the *document* valid (not an in-memory representation of the document after it has been altered, but the actual sequence of bits)?

Answer: Yes, it is.

2. Would a document produced by serialising the DOM be valid?

Answer: No, it wouldn't.