The Cole Papers

Turning those stacks of papers into electronic news archives

Technology has never been the pal to librarians that it has been to the rest of the newsroom, because it's never saved as much work.

We've gotten our libraries more or less off paper, onto microfilm and microfiche, which saved space. Did it save anything else? Ask your librarian (but bring a cup of coffee because you'll be there a while).

Then along came the electronic library, and things started to look up -- librarians suddenly found themselves out of the clipping business and into the electronic storage business.

The markers wound up being "enhancers," and once storage was electronic, reporters would do their own searching, freeing up the librarians (who now were busy enhancing).

This actually has happened at a lot of newspapers, more or less. It is progress, and about time.

But what about the old, pre-electronic clips, the flat files, the microfilm? In the trade, these are known as "legacy" archives.

Most papers have a) stored them off-site or b) let them pile up as the world's most historic fire hazard.

A few, like Florida's Fort Lauderdale Sun-Sentinel, have migrated the clips to CD-ROM, with searchable keywords. The Sun-Sentinel is finally digging its way through its 2 million-clip archive; after five years and God knows how many staff and temps, it's within a mere 500,000 clips of the goal.

Suppliers had yet to come up with a way to index these archives that would allow full-text searching. However, we've been watching some dazzling new technologies and we're here to tell you ... they still haven't.

But they're closing in on it.

So don't let the fire inspector cart those bundles off just yet; there's hope on the horizon.

Newsware
The most potent system seems to be Newsware, which starts with a TIFF image of a page, which you get either from your pagination system or by scanning in old newspapers, whether paper or microfilm.

Iota's Smart Image Technology then indexes every word for full-text searching.

The nifty part? When a search stops at a citation, you find it displayed as it was on the newspaper page, pictures, layout, color -- the works.

As of this year, according to Eli Israeli, Iota president, you can even see it in color.

"We just showed the color version at Seybold in San Francisco this year," said Israeli. "It allows you to search stories that have color tint blocks behind them, and it shows them to you in color, so we maintain the original look of the document."

(Careful, though -- if you scanned in only black-and-white, that tint block would come up as black, he said, rendering all the text inside irretrievable.)

Niftier yet, since Iota works from TIF files, text on a map or chart can be searched as easily as that in the news columns, Israeli claimed.

Searching maps and charts? This is too good to be true, particularly if you've ever had to translate electronic tabbed data from typesetting code into something readable by library systems.

Other nifties:

  • Fuzzy logic sharpens searching by "learning" what the searcher is after, based on prior searches.

  • Three OCR engines scan the same TIF file. When they disagree about a letter, they "vote," resulting in higher accuracy.

  • HyperText Markup Language (HTML) is supported, meaning that Iota files will be able to go out on the World Wide Web. (Israeli promised this within "four or five months.")

    Well, then, if Israeli's so smart, why isn't he rich? Why doesn't every newspaper have six of these?

    Here's part of the reason: Iota's technology was encased in a marketing agreement with Hyphen Inc., and was given a Stealth introduction at NEXPO '94 in Las Vegas. Only a few people saw it before Israeli and cohorts hopped a plane out of the country.

    This low-key coming-out party now seems oddly prophetic, since Hyphen itself has since faded from the newspaper supplier scene, leaving Israeli without a marketing organization just when he was turning a demo into a product.

    Penalized half the distance to the goal, Israeli has had to develop smaller marketing channels, mostly in Europe, where his product's ability to run under all European languages (including Hebrew, of course) at least gets the salesman's foot in the door.

    Another part of the reason: Granted, it's a great idea, but how well does Newsware work? Nobody knows for sure, at least on this side of the Atlantic.

    Israeli points to the Palestine Post, which has scanned in 10 years of legacy documents, with the intention of going back 40 years (everything from the '70s onward is electronic).

    He also points to TV Guide of Radnor, Pa., which he claimed had been using Newsware for two months. A call to the magazine revealed that yes, the Newsware system has been in place for a couple of months, but for evaluation only. TV Guide has made no move to buy.

    And, Israeli says, he's been talking with that ubiquitous purveyor of microfilm, microfiche and CD-ROM, University Microfilms Inc. (EMI), about an agreement under which Iota's software would be teamed with hardware from a microfilm scanner manufacturer whose name he would not reveal.

    So, in effect, we're back to 1994, when we said Iota Industries was worth watching. Now it's even more inviting because as Israeli said, "That was mostly an idea. Now we have a product."

    If he can sell it to a newspaper and it works as advertised, he'll likely have a profitable company as well.

    Adobe Acrobat
    By now the whole world knows of the mysteries of the PDF (Portable Document File) format of Adobe Acrobat; it's become its own standard because Adobe had the smarts to develop it for every computer platform known to humankind.

    Moreover, when the engineers were done programming, it didn't hurt to have Adobe's marketing clout behind it.

    The result is that if your application can output in PostScript, Acrobat can make a PDF file of it.

    Times Mirror's Newsday of Melville, N.Y., is starting PDF archiving of current issues, and even Boss Cole has cranked up the Acrobat distiller and produced PDFs of back issues of this very newsletter for distribution on the World-Wide Web.

    Acrobat has developed into something of a graphical Swiss Army pocket knife. It can be turned into TIF or object-oriented files, put on CD-ROM products, shipped through the ether on e-mail. Netscape is integrating it into the next version of its World-Wide Web browser.

    And every PDF file's text elements can be full-text searched. Which brings us to Acrobat and legacy material.

    The indexing of every text word in a PDF file takes place during the distillation process, meaning that you have to start with a PostScript file. That's fine for today's stuff, but for legacy work, you need a scanner to produce a file that can be distilled.

    This is where Catalog and Capture, Adobe "helper" applications, come in. Capture scans the page and turns it into a PDF file, while Catalog indexes every letter it reads in the PDF file.

    "You can use the same archive and search techniques to go retrospectively, as well as forward," said Gary Cosimini, Adobe's business development manager for publishing, who adds that ending up with the industry-standard PDF format is a "huge benefit."

    On the down side, Capture runs more as an adjunct to Acrobat, and it doesn't seem to have many of the industrial-strength advantages -- for example, color recognition -- Iota cites for its Newsware system. There are improvements planned for Capture, such as OLE (object linking and embedding) and plug-in support, Cosimini said, but it's hardly Adobe's crown jewel.

    Nonetheless, newspapers are gearing up to try it out. Emerge, a consulting outfit and service bureau in Madison, Wis., is working with textbook publishers and Madison's morning daily, the Wisconsin State Journal, to see how feasible legacy archiving can be.

    Andrew Young, president, and Kurt Foss, director of Internet and Acrobat training, have enough experience with the process to be realistic about it.

    "The hurdles are set fairly high at this time," said Foss. "There's no such thing as completely accurate OCR recognition. It's getting a lot easier, but it does require human intervention."

    For example, text stored on microfilm often has to be restored to 100 percent to be scanned fairly accurately.

    "The quality of the image makes a big difference -- at 100 percent and 400 dpi, it's pretty accurate," said Young. "By the time you get to the third or fourth generation of microfilm, it may not be too pretty."

    Again, time -- and experimentation -- will tell.

    "We think it offers a solution," Foss said. "Now we need to talk to librarians to really find out."

    -- John Bryan

    Adobe Systems Inc.,
    (415) 961-4400;
    Emerge Inc.,
    (608) 829-3454;
    Iota Industries Ltd.,
    (011) {972} 3-562 1363.

    Also see Real-world legacy issues

    From THE COLE PAPERS, November 1995, Copyright © 1995, All Rights Reserved.

  • Top | ColeGroup.com | Consulting | Cole Papers | NewsInc. | Cole's Store | Miscellanea | Search
    Copyright © 1990-2012, The Cole Group. All Rights Reserved. Contact us.
    Modified date: 11/ 1/1995, 11:53:46 PM.
    URL: http://www.colepapers.net/TCP.Archive/Cole_Papers_95/TCP_95_11/Archives.HTML