The dreaded redesign
My local public library has a survey up about its website. I took it. I was, shall we say, not complimentary (though I was constructive). My local public library's website requires four clicks at the least just to get to a catalog search box.
I'm guessing your library's website isn't quite that bad. Still, one of these days you'll be reworking it, and that's a scary, scary project. How do you even get started?
This post isn't about the social process of reworking a website—the focus groups, the committee meetings, the CMS RFPs, and all that. This post is about how you figure out what you've got, and what you do with it to get it ready for the next steps toward a managed, standards-compliant, accessible redesign.
Unmanaged websites, as many library websites are, "just grow" to an astonishing number of pages. Many of the pages will look different from each other, sometimes wildly different. Many will be obsolete, in content or code or appearance or all of these. Some will be undiscovered gems that deserve inclusion or even more prominent placement in the new site. How do you know? How do you even find everything? You may not have FTP access to the site!
You can click links and do Save Page As… until your index finger wears out, but that's the hard way. Get a web crawler to collect everything for you. That's the easy way. I like HTTrack for Windows users; on my Mac, I use the wizzy and very cool SiteOrbiter. (Feel free to suggest others in the comments.)
You'll have to dig about in the preferences a bit to make sure the crawler doesn't suck down offsite links, but other than that, you can turn it loose and it will obediently recreate your library's site on your hard drive.
Next step? Weeding. Just like collection weeding. Go through every single page and wipe out the ones you know you don't want any more. Don't worry about dead links. Don't worry about reorganization. Just kill obsolete and not-useful information. Also kill "link-farm" pages, pages that exist only to link to other pages on your site. That's site organization, information architecture, and you'll deal with that much later in the redesign process. (I would be tempted to kill link-farms to external sites also, but that's a judgment call you'll have to make for yourself. Just keep in mind that static pages are probably the worst way to manage linklists; they get stale quickly and they're annoying to re-edit. Use a wiki, bookmarking service, or database-based web application instead!)
With any luck, the weeding process will leave you with a pile of pages that almost starts to look manageable. Make a list of the remaining pages with brief descriptions of what's on them; this will come in useful for card-sorts and other information-architecture techniques that you will use to put your new site together.
Next, you want to strip as much design as possible out of these pages, leaving only pure, sweet information with a sprinkling of powdered HTML markup. Don't do this in Dreamweaver or FrontPage. Start out with some permutation of HTML Tidy (try out a web-service version to see what the fuss is about) to eliminate the worst problems, and then work in a text editor.
Be ruthless. Layout tables, gone. Font tags, gone. Pretty imagemaps, gone. Colors, gone. "Navigation," gone. Javascript, gone. CSS (if any!), gone. Even "structural" divs and spans (if you've had a really enlightened web designer all these years) can likely go. All of this is going to change in the redesign, so start fresh! You want to end up with headings, paragraphs, lists, blockquotes, informational images, maybe a data-table or two—and that's all.
Yes, it all looks horrible when you're done—but it'll slide smoothly into your new site later, and that's what counts.
Future posts on this theme will talk about basic information-architecture tools and techniques, and perhaps one of my TechEssence colleagues will tackle content-management systems since I am in no way competent to. Please ask any questions you have in the comments!

Great article on a great topic, thanks. I look forward to the future installments.
This is a great article. I’m really looking forward to the promised future posts about information-architecture tools and techniques. I find web site management to be one of the most challenging aspects of my job and really feel as if these types of discussions help me become better at it. Thanks for taking the time to write about this Dorothea!! More . . .
It might just be quicker to use a method that would just get the text of the html documents in the first place if so little of the html is being kept. I'm a heavy command-line user so I'm not sure what tools exist in the gui world for doing what I have done in these situations.
1) mirror webpages (via wget, HTTrack, SiteOrbiter or some similar tool)
2) run a quick shell script that uses lynx and the dump text option
(I'd put something up like the real code I'd probably use, but people might start screaming. If enough people are interested I'll post it somewhere else and provide a link)
The drawback to these methods is that if someone consistently did use some visual method to "identify" semantic objects you'll lose it. Printing out the results of firefox might help there. Of course, I've seen the worse case scenario where every line in a great number of the pages were in a different font for no apparent reason. That's a great case for the "text-dump" method. From the text you can figure out if it should be normal text, warnings, notifications, events etc.
Rambling again, back to work.
Another way to accomplish roughly the same end is to do a regular-expression search that kills all the markup, something like (extreme example) searching on
<[^>]+>and replacing with nothing.That'll kill your links too, though. It's possible to route around that, but maybe not much fun.
If the amount of HTML in a page is completely disproportionate to the amount of actual text, though, a text-dump would certainly be a better option!
Thank you!
You're most welcome. Now I just have to find my copy of Rosenfeld and Morville's polar-bear book, which has unaccountably disappeared from my desk...