An Experiment in Transitioning to Open Document Formats

Recently I read an interesting article by Vint Cerf, mostly known as the man behind the TCP/IP protocol that underpins modern Internet communication, where he brought up a very scary problem with everything going digital. I’ll quote from the article (Cerf sees a problem: Today’s digital data could be gone tomorrow – posted June 4, 2013) to explain:

One of the computer scientists who turned on the Internet in 1983, Vinton Cerf, is concerned that much of the data created since then, and for years still to come, will be lost to time.

Cerf warned that digital things created today — spreadsheets, documents, presentations as well as mountains of scientific data — won’t be readable in the years and centuries ahead.

Cerf illustrated the problem in a simple way. He runs Microsoft Office 2011 on Macintosh, but it cannot read a 1997 PowerPoint file. “It doesn’t know what it is,” he said.

“I’m not blaming Microsoft,” said Cerf, who is Google’s vice president and chief Internet evangelist. “What I’m saying is that backward compatibility is very hard to preserve over very long periods of time.”

The data objects are only meaningful if the application software is available to interpret them, Cerf said. “We won’t lose the disk, but we may lose the ability to understand the disk.”

This is a well known problem for anyone who has used a computer for quite some time. Occasionally you’ll get sent a file that you simply can’t open because the modern application you now run has ‘lost’ the ability to read the format created by the (now) ‘ancient’ application. But beyond this minor inconvenience it also brings up the question of how future generations, specifically historians, will be able to look back on our time and make any sense of it. We’ve benefited greatly in the past by having mediums that allow us a more or less easy interpretation of written text and art. Newspaper clippings, personal diaries, heck even cave drawings are all relatively easy to translate and interpret when compared to unknown, seemingly random, digital content. That isn’t to say it is an impossible task, it is however one that has (perceivably) little market value (relatively speaking at least) and thus would likely be de-emphasized or underfunded.

A Solution?

So what can we do to avoid these long-term problems? Realistically probably nothing. I hate to sound so down about it but at some point all technology will yet again make its next leap forward and likely render our current formats completely obsolete (again) in the process. The only thing we can do today that will likely have a meaningful impact that far into the future is to make use of very well documented and open standards. That means transitioning away from so-called binary formats, like .doc and .xls, and embracing the newer open standards meant to replace them. By doing so we can ensure large scale compliance (today) and work toward a sort of saturation effect wherein the likelihood of a complete ‘loss’ of ability to interpret our current formats decreases. This solution isn’t just a nice pie in the sky pipe dream for hippies either. Many large multinational organizations, governments, scientific and statistical groups and individuals are also all beginning to recognize this same issue and many have begun to take action to counteract it.

Enter OpenDocument/Office Open XML

Back in 2005 the Organization for the Advancement of Structured Information Standards (OASIS) created a technical committee to help develop a completely transparent and open standardized document format the end result of which would be the OpenDocument standard. This standard has gone on to be the default file format in most open source applications (such as LibreOffice, OpenOffice.org, Calligra Suite, etc.) and has seen wide spread adoption by many groups and applications (like Microsoft Office). According to Wikipedia the OpenDocument is supported and promoted by over 600 companies and organizations (including Apple, Adobe, Google, IBM, Intel, Microsoft, Novell, Red Hat, Oracle, Wikimedia Foundation, etc.) and is currently the mandatory standard for all NATO members. It is also the default format (or at least a supported format) by more than 25 different countries and many more regions and cities.

Not to be outdone, and potentially lose their position as the dominant office document format creator, Microsoft introduced a somewhat competing format called Office Open XML in 2006. There is much in common between these two formats, both being based on XML and structured as a collection of files within a ZIP container. However they do differ enough that they are 1) not interoperable and 2) that software written to import/export one format cannot be easily made to support the other. While OOXML too is an open standard there have been some concerns about just how open it actually is. For instance take these (completely biased) comparisons done by the OpenDocument Fellowship: Part I / Part II. Wikipedia (Open Office XML – from June 9, 2013) elaborates in saying:

Starting with Microsoft Office 2007, the Office Open XML file formats have become the default file format of Microsoft Office. However, due to the changes introduced in the Office Open XML standard, Office 2007 is not entirely in compliance with ISO/IEC 29500:2008. Microsoft Office 2010 includes support for the ISO/IEC 29500:2008 compliant version of Office Open XML, but it can only save documents conforming to the transitional schemas of the specification, not the strict schemas.

It is important to note that OpenDocument is not without its own set of issues, however its (continuing) standardization process is far more transparent. In practice I will say that (at least as of the time of writing this article) only Microsoft Office 2007 and 2010 can consistently edit and display OOXML documents without issue, whereas most other applications (like LibreOffice and OpenOffice) have a much better time handling OpenDocument. The flip side of which is while Microsoft Office can open and save to OpenDocument format it constantly lags behind the official standard in feature compliance. Without sounding too conspiratorial this is likely due to Microsoft wishing to show how much ‘better’ its standard is in comparison. That said with the forthcoming 2013 version Microsoft is set to drastically improve its compatibility with OpenDocument so the overall situation should get better with time.

Current day however I think, technologically, both standards are now on more or less equal footing. Initially both standards had issues and were lacking some features however both have since evolved to cover 99% of what’s needed in a document format.

What to do?

As discussed above there are two different, some would argue, competing open standards for the replacement of the old closed formats. Ten years ago I would have said that the choice between the two is simple: Office Open XML all the way. However the landscape of computing has changed drastically in the last decade and will likely continue to diversify in the coming one. Cell phone sales have superseded computers and while Microsoft Windows is still the market leader on PCs, alternative operating systems like Apple’s Mac OS X and Linux have been gaining ground. Then you have the new cloud computing contenders like Google’s Google Docs which let you view and edit documents right within a web browser making the operating system irrelevant. All of this heterogeneity has thrown a curve ball into how standards are established and being completely interoperable is now key – you can’t just be the market leader on PCs and expect everyone else to follow your lead anymore. I don’t want to be limited in where I can use my documents, I want them to work on my PC (running Windows 7), my laptop (running Ubuntu 12.04), my cellphone (running iOS 5) and my tablet (running Android 4.2). It is because of these reasons that for me the conclusion, in an ideal world, is OpenDocument. For others the choice may very well be Office Open XML and that’s fine too – both attempt to solve the same problem and a little market competition may end up being beneficial in the short term.

Is it possible to transition to OpenDocument?

This is the tricky part of the conversation. Lets say you want to jump 100% over to OpenDocument… how do you do so? Converting between the different formats, like the old .doc or even the newer Office Open XML .docx, and OpenDocument’s .odt is far from problem free. For most things the conversion process should be as simple as opening the current format document and re-saving it as OpenDocument – there are even wizards that will automate this process for you on a large number of documents. In my experience however things are almost never quite as simple as that. From what I’ve seen any document that has a bulleted list ends up being converted with far from perfect accuracy. I’ve come close to re-creating the original formatting manually, making heavy use of custom styles in the process, but its still not a fun or straightforward task – perhaps in these situations continuing to use Microsoft formatting, via Office Open XML, is the best solution.

If however you are starting fresh or just converting simple documents with little formatting there is no reason why you couldn’t make the jump to OpenDocument. For me personally I’m going to attempt to convert my existing .doc documents to OpenDocument (if possible) or Office Open XML (where there are formatting issues). By the end I should be using exclusively open formats which is a good thing.

I’ll write a follow up post on my successes or any issues encountered if I think it warrants it. In the meantime I’m curious as to the success others have had with a process like this. If you have any comments or insight into how to make a transition like this go more smoothly I’d love to hear it. Leave a comment below.

This post originally appeared on my personal website here.



2 Comments

  1. Great and lengthy article Tyler. I’ve got a few additional thoughts and opinions about the entire open document format.

    First, to provide my own subjective experience to contrast what was reported about Vint Cerf’s PowerPoint file reading issues, I’ve found Office 2011 for OS X is very good at handling documents created in previous revisions of the suite, including Office for Windows. In my regular job I run OS X 10.6 with Office 2011, and encounter documents from Office ’97 and 2000 more frequently than I’d like to admit. It used to be everyone would create highly technical documentation and knowledge transfer slides, and those live on in various enterprisey content management systems. The most recent issue I can remember is Word 2011 failing to populate certain drop-down fields from a Word 2012 document, and that could be due to a wide variety of problems or way that the source file was created. Generally, I’ve found that reading Windows Office documents on OS X to be no big deal though.

    Having said that, the problem of version incompatibility is not something to shrug off entirely either as a WORKSFORME. Going the other way – reading a “created in” OS X Office document on a Windows Office version can be a crapshoot at best. Like you alluded to, bulletted lists are dicey when converting to an open format, but I can’t even trust that bullets won’t appear in different places and have different tab-stop settings when viewing that OSX Office document on Windows. This has caused me significant personal grief, especially when preparing a resume or legal form for submission.

    Speaking of resumes, that’s probably one of the most high-profile drawbacks to anything that doesn’t “just open” in a relatively recent version of Word. Send a typical recruiter or HR contact for a large corporation an “.odt” and they’ll come back with “we couldn’t read your file.” PDF is starting to be much more widely accepted, but you’re still likely to get explicit requests for a .doc or .docx.

    So, what have I personally done for documents lately? For most business tasks, I just save in the default .docx format that Word provides. If there’s any uncertainty as to the recipient, it goes back to a .doc. Anything that is non-confidential goes into Google Docs (Drive?) and comes out either as a printed version, shared with others when appropriate, or rendered to PDF as necessary.

    Anyway, this should also constitute my offer to run test cases through Office 2013 to see if compatibility or standards compliance is improved.

  2. @Jake B

    My thoughts exactly. It’s sad when PDF becomes the universal standard for making things display consistently (although I guess that is what it was designed for…).

    As for Office 2013 its major compatibility changes should be the ability to open version 1.2 of ODF in addition to formatting fixes across the board. Although after converting a bunch of documents between two Microsoft formats, .doc and .docx, and seeing formatting differences I’m not holding my breath for too much improvement.

Leave a Reply

Your email address will not be published.


*