The PDF Problem

| | Comments (2)

The latest Alert Box article tells us the evils of PDF files on the web. They are annoying, they need another app to be opened, and my favorite that isn't even mentioned is that many PDF files can choke a postscript printer. That being said I post a ton of PDF files on the EFF site. Why?

I have lots of reasons. The easy ones are when we are mailed PDFs that other people have scanned or things that we received as faxes. Also when we get paper copies of documents the OCR scan usually sucks rocks and isn't reliable. The other reason does get addressed in Alert Box:

Ideally, companies would reformat each type of information for online use. It's actually not very expensive to, say, create a set of Web pages for annual report information as long as the Web design is done while the annual report is being written. The cost comes when companies have a glossy annual report already finished and then say, "Webbify this."

Ideally. Ah yes, the mythical ideal world. I suppose this is the same ideal world where browsers support web standards. It's the world we all wished we lived in, and yet, none of us live there. I agree that we should do what we can to get as close as we can to the ideal world without going crazy.

Lawyers work in word processing programs. They are never going to write legal documents in TeX or docbook. It just won't happen. Ever. Get over it. Microsoft Word will always spit out horrid HTML. It just will. Get over it. Webmasters will always be handed documents 10 minutes before they just have to be up on the website. Adobe makes a really nice Word plug-in and it creates very nice PDF files. They look right, they have hyperlinks if the Word document had hyperlinks, they are small, and most of the time they even print right.

So how do I handle PDF files? Well as Alert Box says, we have a page that links to the PDF. For PDF files that also have a text layer (ie not OCR scans typically, but ones saved out from the original electronic document) I use pdftotext to make a text file. It's quick and dirty and fits right into the 10 minute workflow timeframe that I typically have to get stuff up on the site.

As with any "tip," you need to put it into perspective of your own site and the needs of your users.

2 Comments

quaid said:

What you describe is definitely accurate, especially as how things stand at this moment. However ...

Okay, so we know that the current beta Office version of Word, Excel and Power Point all do their magic in XML, and we know that Microsoft is going to add all kinds of funkiness in there somehow to extend the eXtensibility without actually publishing the details.

But OpenOffice.org (and associated Star Office) now use XML for native format; it's a series of folders with XML, stylesheet particulars and images, tar'd up and gzip compressed. Wowzers!

Right now you can use the Mozilla Composer for WYSIWYG HTML editing that is actually good. And it doesn't throw in all sorts of secret formula tricks and custom tags about Big Software Company.

What we all forget in these days of monopolistic practices as norm (tacitly approved by some DoJ personae), bad software decisions implemented en masse, and proprietary formats and "extensions" of standards, and no standard adherence, and etc., what we forget is that 10 years ago it was all different ... and yet the same. In 10 years, it will be a new set of companies, products, problems and practices. Some of the brand names may be the same, but the culture inside and outside of those companies will be vastly different.

So, I'm just saying, keep on pluggin' at it. The dyke is bursting and all these little girly-man software makers can no longer hold the waters from gushing in. No, sir!

Most of the lawyers I know are so hooked into Word it will take more than that to pry them loose. Anything, and I do mean anything that adds work to their already punishing schedule will be shouted down faster than you can say Docbook.

About this Entry

This page contains a single entry by Patrick published on July 30, 2003 11:15 PM.

Fuel Economy was the previous entry in this blog.

New Amazon Links is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Powered by Movable Type 4.01