Proprietary Binary Data Formats: Just Say No!

I shall NEVER keep my information in a proprietary binary format (such as Word, WordPerfect, Excel etc). All my personal files are kept in plain text format. So please do not ask for my resume in the Word format - such requests will be silently ignored.

On the other hand, if others (such as a client or an employer) want, I am willing to access their data in the format they want, if they provide me with the appropriate tools.

Why Not?

People create their files for future retrieval and reuse. E.g., today you might write a paper which you will print tomorrow, send via e-mail to a friend the next day and incorporate his remarks the next week, then you might want to retrieve your original text without his changes. If your data is saved in a proprietary binary format, these simple tasks are complicated beyond your wildest imagination.

your proprietary tool has to include a driver for each printer you might want to use. What if the tool is withdrawn from the market? What about that latest and greatest printer you wanted to buy? PostScript, understood by all modern printers, is here to save your butt.
you will have to send the paper as an attachment, so your friend has to have a mailer which understand attachments (most mailers do). Now, if he wants to search all his e-mail for some phrase, he will have to start an application for each attachment - a hefty price to pay for a small search! He would also have to own a tool which understands your particular idiosyncratic proprietary binary format. What if his software version is different from yours? Do you realize that the binary file you send him contains much more than what your application shows you on the screen? [MS Word, PDF]
Change Management:
the standard diff, patch and version control utilities (like CVS and RCS) do not work well with binary files. You are stuck with the application-specific functionality. Good luck there!

Finally, who do you think will be able to read your Word97 files in 10 years? The computer architecture on which your current Word binary runs will be obsolete, your today's computer will break down, and you will never be able to recover the information stored in your files.

Proprietary data formats DESTROY your information.

Additionally, binary formats stifle competition because a newcomer has to parse all those existing formats before he can hope to gain a market share. This is why we have so many binary formats: Microsoft et al change them all the time to preserve the barrier to entry.

What is under the hood?

Although you cannot fix your car, and probably do not have a very clear idea how it works, it is still important to you that you can open the hood and look at the engine and maybe, with some help from a professional, find out what is wrong. The fact that the hood can be open at any time is the best possible guarantee against fraud: you don't even think that someone might put there something you don't want. If your car dealer or manufacturer would have put there something bad - like illegal drug disposer or a radio transmitter so that someone will be able to watch your location - you can be confident that some mechanic or just a techno-geek will notice that (this is purely imaginary, of course, but it is so because your hood is not sealed by the manufacturer). This is why the world needs reverse engineers.


I use plain text-based formats for all my needs.

word processing:
TeX and LaTeX are your friends. Use LyX and switch from the layout-oriented world of traditional word processors to content-oriented world of document (or text) processing. Your files are plain text, so all the usual tools will work with them.
personal contact database:
I use BBDB - a rolodex for GNU Emacs. It keeps the data in, you guessed it, text format.
personal finance:
I use Emacs again, but you might prefer GNUCash, Gnumeric, Check Book Balancer or XInvest.
web bookmarks:
Netscape keeps your bookmarks in a plain text HTML file.

The long-term universal solution is XML - the Extensible Markup Language. It allows describing document types in a universal way, so that one does not have to write a new parser every time Microsoft changes Word format. To describe a document type in XML, one has to specify 2 things:

Document Type Definition
This specifies the possible logical elements of a document, things like title, list, table etc. E.g., DocBook is one such document type. Another example is XHTML, which is a reformulation of HTML 4 in XML 1.0. Actually, this very document is a valid XHTML document!
Style Sheet
This specifies the physical layout of the document type, i.e., the way its elements are displayed or printed. Cascading Style Sheets offer a standard mechanism for that. XSL DocBook Stylesheets and Modular DSSSL DocBook Stylesheets are examples for DocBook.

Both DTD and Stylesheets are provided by the document type designer, i.e., your vendor, and made publicly available. Your document is kept in plain text format (with XML markup), so you can view and edit it with any tool you like. This way the software vendors will be making money on the quality of their software, not the obscurity of their data formats.

The fact that the logical structure is kept separate from the physical presentation is one of the most attractive features of this model.

Relevant links

Sam Steingold
Valid XHTML Use CSS Valid CSS created: 2000-02-07