About Me

I'm just someone struggling against my own inertia to be creative. My current favorite book is "Oh the places you'll go" by Dr. Seuss

Thursday, June 10, 2010

Simplifying HTML Part 2 of 4

In part one I called out a few of the more awful features of HTML for removal. Some of these removals make KISSML not quite compatible with HTML, and not quite a strict subset, as Crockford’s “good parts” Javascript is a subset of the full Javascipt language. This was criticised by a coworker or mine, and quite rightly too! Here is my response to that criticism:

Two of my removals, the script tag, and the style tag, while on their own would result in a strictly compatible subset version of html, in order to mean anything useful they must be enforced in some way. The whole point of removing these tags is to prevent XSS style attacks. Currently, if you wanted to eliminate XSS attacks on some specific site, you would engage in a rushed kind of language sub-setting exercise. If you are very unwise, you may attempt to use regular expressions to achieve this sub-setting. So, if you’re doing this impromptu language design exercise, there’s one of two goals you can aim for, but you cannot achieve both goals simultaneously:

        1.        The language must be usable unmodified directly in the browser, without compatibility problems.
        2.        The language must be acceptable, unmodified, from user facing inputs.

After trying out both of these goals in various systems, I believe goal 2. is the more pragmatic, wiser choice. (Though please point out if I’ve presented a false dillemma!)

With HTML, to prevent XSS, it is already the case that we must use sanitisers such as HTML Purifier, or pseudo markup languages like Markdown. In both these cases, there is a transformation from what gets input by the user, and what ultimately gets served to the browser. The programmer must make the choice between storing the original user input, or storing the transformed version of the input, possibly both. In addition, we also already have template languages like PHP in common use and these are interpreted and transformed before being sent to the browser. I would like to suggest, that since we are already transforming our inputs well before it gets to the browser, then making slightly incompatible changes with old versions of HTML which can be transformed into legacy HTML is not all that bad a deal. The only other situation where it may become difficult is the situation of authoring static pages with no server side components, and no requirements for user input.

I have some ideas for solving all these problems, but they will come in part 4, so bear with me!

What to Generalise and Consolidate: The Incompatible Parts

Preformatted and Literal Text: I’ve already wrote that I’m removing HTML entities. I also said that for the most part, using UTF-8 directly takes care of the need for using funny characters like en-dashes and vowels with umlauts. However, there’s still one thing UTF-8 can’t quite do. Since the characters <, > and “ have special meaning in KISSML, we need a simple way to represent them literally. If we want to display blocks of code in HTML today, we could use the <pre> or <code> elements, and <, > ” characters within those elements are not interpreted as HTML until the closing </pre> or </code> is encountered. But what if we want to talk about pre tags inside a pre tag? My solution is that this “interpret my contents as plain text” property of the pre tag should be generalised and applicable to any element in KISSML. I will call this attribute “literal”. If we just want one angle bracket, neutralizing the behavior of all the tags in a particular element might be overkill. For this case where you want a one off instance of a special character, we have the element types <lt/> <gt/> and <q/>. These are plain inline elements with the cdata attribute, and predefined to contain the text <, >, and “ respectively.

In addition, there is the blank element type <></>, which by default, renders its contents surrounded by quote marks. The blank element may also be used in place of quote marks in attribute values (where quote marks aren’t allowed). Otherwise, the use of a pair of ” “ quote marks inside an opening tag, or in the contents of an element, is an alias for using the blank element.

I eliminated html entities by replacing them with equivalent functionality defined in terms of html elements, and attributes. I consolidated html entities with html elements. I combined them into one concept. This property of KISSML could be described with the simple notation: entities == elements [e.g. &lt; == <lt/> ]. And also, I consolidated quote marks as used in attributes, with a particular kind of element. Thus, ”foo“ == <>foo</> Here I will summarise the rest of KISSML’s generalisations in this equation style.

tag name == name == class == id == attribute name = css property name
and also:
element attribute == css property

This one is quite iconoclastic indeed. I’ve never understood, why do we need 6 drastically different ways to attach name, value sets to elements? KISSML has only one way.
And so the following labyrinth of HTML:

<button name=”mylink“><a href=”http://example.com“ id=”mylink“ class=”buttonimage contentimage“><img src=”button.jpg“ style=”display: block; width:100%;“ /></a></button>

May become in KISSML:

<button a img href=”http://example.com“ src=”button.jpg“ mylink buttonimage contentimage display=”block“ width=”100%“ />

In KISSML, we eliminate the specialness of ”tag types“ like ”a“ and ”p“. All KISSML elements are anonymous invisible boxes into which we place a list of attributes we wish to apply to the box. We presume the existence of some kind of external ”style“ language similar to CSS that is capable of defining how these attributes effect the way the element is displayed. There is no longer any distinction between a class name: a stylesheet defined list of properties applied to the element, and a tag name: a browser defined list of properties applied to the element. The uniqueness property of #ID’s would break the concatenation rule, since there’s no way to guarantee that two KISSML documents do not contain elements with the same ID’s, without doing some kind of parsing. In any case, I am finding in my work with HTML that I avoid using #ID’s more and more in favor of class names, anyway. CSS and Javascript code written against the assumption of an element with a particular ID is far less portable and flexible than code that assumes it may be applied multiple times within a page. This also fits with the no special case pattern since the logical consequence of this consolidation is the replacement of the dom methods getElementsByTagName, getElementsByClassName, and getElementById, with a single method, getElements, which returns an array of elements, and the only result case you need to handle is iterate through an array of elements.

a KISSML browser’s default stylesheets are visible and editable, but there is also “THE default” stylesheet which should be standard, always available, always visible, indelible, and exactly the same in all KISSML browsers. So, the vast universe of markup that needs to be interpreted and displayed the same way by different browsers, can be specified in the /one true stylesheet/. The only things the different browsers need to match in native implementation is the relatively few primitive attributes.

DTD == Stylesheet
Doctype Declaration == Stylesheet link
Validator == KISSML-LINT

that default stylesheet in our theoretical style language should also be usable for validation purposes. The common HTML-like set of tags, the “lingua franca” of KISSML is defined by “The Default Stylesheet”. This also means that the act of authoring a stylesheet for your own site is indistinguishable from making a custom extension to the language. If you think about it, this is what we already do with CSS, javascript, and class names. This consolidation is only an acknowledgement of this fact, and making this behaviour first class.

The default stylesheet, aside from determining the default display behaviour of attributes, should also be able to declare code style rules, which can be enforced by the validator. Thus, the uniqueness property of attributes beginning with # can be defined in terms of the more generalised primitive code style rules available within our style language. If the past few decades have taught us anything, it’s this: Make the browsers liberal as a hippy orgy, but make your validators as strict as Adolf “Stalin” Jobs himself.

All that said, let us never fall into the trap of saying “The stylesheet determines what the attributes mean”. Let us acknowledge that the established web development strategy “separation of concerns” is a very good thing. Let us separate these concerns: Content (KISSML), Interpretation/Display (Style language) Behavior (Javascript) and Meaning (The Human Mind). let us endeavour to avoid mixing these concerns, and let us not be foolish as to think that a document full of computer code indicates community-wide agreement on the meaning of words, which rightfully should be determined by prose, debate and negotiation.

attribute value == element content == node list

and so:

<img src=“example.png” title=“here is some <strong>markup</strong> <q>language</q>” > But let’s also get rid of the alt tag, because the img tag can <em>already</em> contain marked up content ! </img>

is valid KISSML, thus eliminating the problems we have run into as web developers, due to the fact that in HTML the alt attribute cannot contain HTML. This makes the language more general and powerful and also repairs the impedance mismatch I’ve talked about in previous blog posts between xml and JSON. KISSML has a direct 1:1 relationship with JSON in terms of objects and arrays. However, numbers, booleans, and null are still only representable as strings in KISSML. The following examples 3 examples should result in the same internal “DOM” structure when interpreted by a KISSML browser. The first 2 examples are KISSML, and the third is JSON.

The quick brown <strong>fox</strong> jumped over the lazy <abbr title=”Dynamic <em href=<>http://odour.net</> >Odour</em> Generator“ >dog</abbr>.

The quick brown <strong=”fox“ /> jumped over the lazy <abbr=”dog“ title=<>Dynamic <em href=<>http://odour.net</> >Odour</em> Generator</> />.

[”The quick brown “, {”strong”:”fox“}, ” jumped over the lazy “,{”abbr”:“dog”,title:[“Dynamic ”, {“em”:“Odour“, “href”:“http://odour.net”}, “ Generator”]},“.”]

From this comparison, you can kind of see KISSML as in the same spirit of JSON, while addressing JSON’s weaknesses for representing documents. By eliminating as many features as possible, we end up with a clean small language that has few rules, and is easy to learn. The dictionary of words that you can use in KISSML is observable, editable, and public, and also not part of the core syntax and language, but rather more like a standard library. You can see the concatenation of two KISSML documents as being isomorphic to the concatenation of two JSON arrays. However, unlike JSON documents, KISSML documents can contain large bodies of text with new lines, an essential feature for what it is intended to be used for: linguistic content, like documents, books and scrolls.

No comments: