Some time ago, I had my first experience of “Why’s (Poignant) Guide to Ruby” courtesy of “The Best Software Writing” selected and introduced by Joel Spolsky. I was intrigued by the strange writing style, the cartoon foxes, and indeed the language Ruby. But honestly, for whatever reason, it was not an appropriate time for me to follow up.
When I recently saw the full-colour web version, I was wowed by the sheer extent of the effort that has been put into it. The site really appealed to me visually, but I had a strong feeling that such effort ‘should’ be reproducible in a book form. Of course, I could print out the separate chapters using my web browser, but that would lead to all sorts of pagination issues. Someone had produced a pdf format of the Guide as a whole, but as well as having some corruption in it, it too lacked proper pagination. I felt it deserved a paginated pdf… if you want the pdf I created, please click here. This page covers the story of that conversion.
I decided to try and convert it in such a way that it could be input to InDesign, laid out, and from there output to pdf file.
My assumption was that some sort of html to xml to InDesign could be done, using XSLT to do the key conversions.
Understanding the Options in InDesign
If you’re going to manipulate data on the move from one application to another, you had better understand the interfaces that are available to you. Our source data was ‘easy’ in broad terms, in the sense that I had access to the source html and css files. In detail it was somewhat harder as no single chapter appeared to use all of the styles.
The destination (InDesign CS) has two main ways of importing information:
- Tagged Text, an InDesign special format that includes formatting information about text. However, it can’t handle pictures, frames etc.
- XML; Very tightly specified XML can be imported in sensible ways if every single element is present in exactly the expected / specified place, and in there are exactly the right number of them. There are some useful bits of functionality in terms of mapping xml tags to styles (character or paragraph) – there are no mechanisms for inspecting attributes on a tag though to help define the style that it should be mapped to (This is important because a lot of html / css formatting will be done with <p class=”xxx”> type settings). Further, when we start to consider ‘loosely’ specified xml, i.e. the sort of thing we would be likely to get from a translation of the Guide, there is no way for InDesign to respond to certain tags with clever things like: “When you see an <img> tag, please import the document specified in the “src” attribute.” And you certainly can’t create new frames and things like that. Or at least, you can’t do any of these things without using scripting… and honestly, I was not up for learning the appropriate scripting language and the InDesign interface.
In considering the downsides of InDesign, it became clear that I should consider one key alternative; that being XSL-FO. XSL-FO is all about print layout, it seems, but you need a software tool to take your XSL-FO file and convert it to PDF. After a reasonable investigation, I was unable to find a tool that was free, yet well specified.
So, although it is perhaps a little bit odd to start with something that is ‘Creative Commons’ and use and XSLT tool that is ‘open source’ (Xalan, not Stylus Studio 2006) to move to a proprietary application like InDesign. However, financial issue (I already owned InDesign), power and personal experience won the day in this case.
Starting at the Destination
I decided that I was going to build an InDesign ‘book’ made up of several chapters, as this approach nicely echoes the original html documents. This lead me to create a template for the book / chapter as the best place to start. The template set the:
- Basic page appearance, using page templates;
- Create Paragraph and Character styles to roughly map to those in the CSS / html source;
- Specify the mappings between xml tags and styles.
Prior to Conversion
Before the html could be converted, I needed to do some minor edits to the html. <DOCTYPE and <html lines were stripped and replaced with an xml header, the <html> and closing tag were removed. I also had to search the document for and delete them, and replace <br /> elements with ‘REPLACEBR’. Finally, all ‘src’ and ‘href’ attributes had to be converted to be fully specified URLS. I am not sure when it started, but at some point InDesign started complaining that any other type of locater was an invalid file. This is particularly idiotic, as in all other respects it appeared to ignore any attributes!
I used Xalan processor to handle the XSLT transformations. Well, I did also play around with a demo of Stylus Studio 2006, which was very useful at times (debug traces and so on), but in the end, Xalan handled some complexities like special characters and whitespace stripping slightly more smoothly.
One of the main challenges was that InDesign (as has already been noted) does not really understand the concept of a ‘block level element’ in a loosely-formatted xml document. Sure, you can map an xml tag to a frame if you want, but it won’t use it unless it appears in the exact right place of the file. Essentially then, InDesign did not understand blocks. As a result, one of the early decisions for the XSLT was that it would convert certain html tags to a new tag name that included the class attribute. For example, <div class=”example”> in the source would get converted to <divexample> in the output. This would prove useful for some tags in mapping them to different styles.
I also had to create some rules for changing tags based on context. This is because the source document is html / css where css is ‘Cascading style sheets’. The style sheet sets up a hierarchy of styles, where the effect of a style on a parent will cascade to a child element, unless it is overridden for the child. Unfortunately, InDesign CS does not understand that particular concept of ‘cascading’. It has something called ‘Nested styles’ which guide the use of character styles within a paragraph style, but again, this can only work for very specific situations; for example ‘all the characters before the first tab should be formatted with ‘chapterNumber’ style. Not the same thing as cascading styles at all. It therefore became necessary to re-tag some items based on the context in which they were found; <p> elements normally went unchanged, but when they were inside a <div class=”sidebar”> element, they were output as <sidebar-p> elements. This allowed InDesign to differentiate the two with different mappings.
In other words, I needed to build a reasonable amount of context (or hierarchical) sensitivity into the XSLT transforms.
The Leg-Work in InDesign
The manual labour really happened in InDesign. After many trials, it became apparent that the best way to import the content of the source was to ignore header and footer type data, and import just the newly created <divcontent> element into the main template text frame. Auto-flow created new pages as needed, and the tag to style mapping applied styles appropriately.
Some manual search-replace operations were required, for example to search for ‘REPLACEBR’ and replace each occurence with a newline. A cack-handed approach to say the least, but a necessary consequence of some of the issues raised by whitespace handling throughout the process.
Many of the tags that were created in conversion, and the styles they mapped to, were designed to highlight areas of manual intervention. For example, <img> tags were adjusted to output in text the name and title of the graphic to be imported. The style that it mapped to highlighted this with a special format that was easy to see. InDesign CS2 has apparently improved in-line image handling substantially, but I think even with that application I would probably have had to import each image by hand, rescaling it suitably and ensuring that the line height was suitable to contain it. Similarly, whilst the <a> ‘anchor’ tags did not require any particular import, they were highlighted in such a way to remind me to create an InDesign bookmark at that location (then I would delete the highlight text as appropriate). Other things like Sidebars had to be moved from the in-line position they imported with, into suitably formatted frames that I had created on my template’s pasteboard.
Even InDesign would sometimes blow a gasket on certain amounts of repagination, throwing images into header areas and so on… but normally these issues were recoverable with a little manual prod (even simply highlighting the image would normally do the trick).
Iterate, Iterate, Iterate
As each chapter of the source used different styles, and demonstrated different issues, it was not reasonable to identify at the start every circumstance that the XSLT conversion would come across, and then the InDesign template. Each chapter normally lead to a process (granted, a very frustrating one) of importing the converted xml, checking the broad layout that resulted, looking for new xml tags not mapped to a style, and then altering the mapping or the XSLT appropriately, before redoing the conversion and import. Unfortunately, InDesign seems to maintain an ’embedded template’ system. That is to say, when you base a document on a template, you can override and add styles and settings within that document, but changes are not echoed back to the ‘real’ template. Conversely, changes made to the template are not echoed back into the documents that use it, though certain styles etc can be copied from the template (or any other document). This meant that on occasion I had to create a new style in one or several documents, and the original template ready for the work on future chapters.
This is just one respect that InDesign really fails the user in my opinion; especially considering that it has facilities for ‘book’ creation where just such issues could apply. So does Word, for that matter. DreamWeaver is the only application that I have used, as far as I can recall, that enables a ‘linked template’ by default; in this system, changes made to the template can be auto-updated in the ‘children’ that use that template, and my experience has been that this works very well for websites indeed.
The front page and index were too much to be laid out using any automated process. I also added the indexing by hand, which again was an education in the effort required to build a sensible index; I do not pretend I achieved that, but I think I moved some way towards it.
One of the biggest issues encountered was that of white-space. Html essentially ignores multiple-repeating whitespace (with a few exceptions like non-breaking spaces). InDesign, however, did not ignore this whitespace. In the end, the solution was primarily concentrated on the xsl:strip-space and (for pre’s and p’s) xsl:preserve-space. Unfortunately, Stylus Studio 2006 did not support this functionality at all, and it was this key point that forced me into using Xalan.
As it was also preferable to be able to see at least a semblance of structure in the intermediate (converted) xml file, it also meant that a lot of the functionality in the XSLT stylesheet controlled when newlines were inserted into the converted file.
Well, it relied a lot on manual efforts, but I think I did a reasonable job of layout considering the confines of the applications and processes used, and also I would now generally acknowledge that layout of larger documents with lots of different graphics (that do not easily fall into a repeatable pre-definable layout) is generally a time-consuming business. The fact that the process is not repeatable (should large tracts of a chapter be changed, for example) without similar levels of manual intervention is a frustration; but this was clearly going to be the case after my initial investigations.
Anyway, it was an interesting project for me to start to learn XSLT (hence, I have not reproduced the file here because I fear it is poor quality), and I hope that the end result is of use to someone. I have offered the results to Why The Lucky Stiff, but as yet without response.