Archive for nwrunner

Possibly final notes on magazine scanning

I spent some time this past weekend experimenting with scanning settings and eventually got one full year of Northwest Runner magazine scanned. I chose 2003 to scan because I think this is a year for which digital copies exist – this meant if something went horribly wrong and I physically ruined an issue or two, I probably wouldn’t get my kneecaps bashed in at an upcoming Winter Grand Prix race series. I had already done extensive experimenting with my own copies of the magazine anyway, so that wasn’t too likely, but I wanted to play it safe.

Here are the key notes:

  1. Scanning in greyscsale images+text at 300DPI is dramatically faster than scanning in color.  However for recent issues this isn’t a great option. Some pages are B&W but many are full color.
  2. Output image size for greyscale vs. color at 300DPI is pretty similar.  The only reason this might matter is because of the behavior I previously mentioned where the only way I can do this with our scanners at work is to have it email the output from the scanner to me – I can’t scan directly to a network share – and my mailserver rejects messages above a certain size.
  3. My mailserver seems to reject messages when they cross a threshold somewhere between ~12-15MB in size.  In practice, this means I can scan about 5 ledger-sized, double-sided sheets, or about 10 pages of the magazine at a time.
  4. It is important to separate the pages and invert the fold along the spine before sending through the auto-feeder.  I didn’t do this with one of the first magazines and I wound up with some paper jams, some slightly mangled pages (not really destroyed or anything, but like what you get with a printer auto-feed after something’s gotten jammed).  I mentioned I scanned the entire year of 2003 – the jams only happened in the first issue or two.  After I started this separating and fold inverting process, the pages did not get “stuck” along the spine and they all fed cleanly.
  5. Sometimes 5 sheets barely hits the “too big” threshold for scanning. If this happens, I need to do something like “scan 3, then scan 2.”  This is rare, but it happens.
  6. Some of the magazines are missing pages or have single pages torn out.  This screws up pagination and might make later post-processing / assembly into PDFs a pain (or I might just ignore it).
  7. The printers at work require me to log in and after some time they will log me out.  If I’m logged out, I need to re-enter the scan settings (2 sided, color images + text, scan as JPG not PDF, 300DPI).  This is tedious.  If I stay attentive during the scan process, I can: feed 5 sheets, wait for it to scan, put the next 5 sheets in the feed reader, wait for confirmation that it sent the email, then press “Scan” again, I won’t get logged out.  This also ensures that the scanner (which is critical path in this assembly line) is always “busy.”
  8. As this process is happening, I’m getting email after email with 10 attached images (scan01.jpg, scan02.jpg, etc. for both sides) that I need to pull out of my inbox and archive in folders.  Because the image names conflict (scan01.jpg will be the cover and also page 11 and page 21, etc.) I need to batch these up, too.  My post-processing jpg rotater, cutter, etc. script will handle these.

That’s about it.  To scan the 2003 year of magazines took almost exactly 2 hours. During this time I am constantly busy with: de-stapling issues, preparing 5 page batches for the scanner (de-“sticking” the spine), running the scanner, adding/removing sheets from the feed tray, processing my inbox (which will fill up if I don’t pull the files out), reassembling scanned magazines, and trying to re-staple.  I think I can make this a little more efficient and bet I’ll trim a decent amount of time off that 2 hour baseline, but this process seems pretty close to optimized to do this job well and keep the original issues intact.

Now I just need to sync up with Martin (or really probably Bill Roe, who I think actually owns these issues) and confirm that they’re OK with me plowing ahead with all the back issues.

Comments

Next notes on scanning

After the initial research with scanning last week, I’ve concluded that trying to scan the back issues of Northwest Runner with my home scanners is probably a job I would never finish. At 1 minute per page and some non-trivial amount of post-processing (orienting all pages properly, assembling the PDFs), the initial time to scan is just more of my life than I’m willing to dedicate to this project. I found that my home printer/scanners do offer a document feed feature, though.  This works pretty well.  I can put a stack of documents in the feed tray, start the scanner function, indicating that the input documents are duplex with the moire suppression option on, click “go” and come back a half hour later, flip the scanned stack to get the other side and it’s pretty much done (and the postprocessing is slightly lower, too).

The problem with that is (for my printer) it requires an ~8 1/2 x 11″ input. I tried this with one of my own back issues of the magazine after taking the staples out and cutting it down the spine and the results were great!  Except for the original which I had cut in half.  This is no big deal to me, but apparently the guys who actually own these magazines I offered to scan and who have been involved in this sport for about as long as I’ve been alive are not exactly thrilled at the idea that I’ll destroy all their original issues.  Time for plan C…

This involves my scanners at work.  The printers at my work are Ricoh Aficio MP 5000’s and with *these* I can do scans of ledger (11×17) sized inputs, with the document feed feature, and will automatically do duplex scanning (no flipping required) and they are very fast.  These take about 6 minutes to scan an entire issue, front and back.  This leaves me with my last problem – how to get the scanned files from the printer.

It seems the Ricoh offers two functions – both of which present some problems.

  • Scan and send to email – this is kind of OK.  It will be inconvenient to need to pull the attachments out of hundreds of emails, but I could deal with it.  The larger problem is that the generated bulk scan from an entire issue is apparently larger than my mailserver will allow.  So I scan the entire issue over 6 minutes only to then have the printer tell me “sorry – couldn’t deliver your document” at which point those scans seem to be lost and I just wasted that time.
  • Scan and store on network share – this would be great except that the interface to get these things to talk with a Windows network share are maddeningly hard to use, might just not work, and might need some administrative rights with the printer that I don’t have.  After much trial and error with this, I think that this option is closed to me.

So my likely path forward will be to scan half an issue at a time (or so – if that’s possible) and go do post processing on those.  To do this, I will need to remove the staples from the back issues and feed in half an issue at a time, but I think it will work and go pretty quickly.  One thing I didn’t mention is that even with this approach, it *seems* that there are characteristics of the scan job that need to be re-entered every single time I start a scan job (select input as color, set DPI, other settings, original orientation settings).  Each of these is slow and tedious to input on the Ricoh touch screen and I’m hoping I can simplify it, but it might be tolerable and this will still be dramatically faster than working with my home scanner.

So – here are my next steps:

  1. Go back with a couple of my own copies of the magazine, do some trial and error to try to understand the maximum number of pages that can be scanned and emailed in one batch without my mailserver rejecting it and get more confident that the document feed will work smoothly / flawlessly before I send any of the originals through the feeder. This will include doing that for color inputs as well as greyscale (the oldest issues are greyscale, then a single color is added on some covers, then there are full color, glossy covers over greyscale pages, and current issues are glossy and full color from back to back).
  2. Start scanning the actual issues, probably starting from most current to oldest.  This way, again, if there are any problems with the process or I hurt some issues until I’m certain this is going perfectly, I have some time to correct the process.
  3. Start post-processing.  This may take a while.
    1. Probably use imagemagick, since I know some of its functionality
    2. Cut the scanned images in half – I’ll have ledger sized scans.
    3. Do some math to figure out page numbering.  If the cover is page 1 and an issue is 60 numbered pages long (back cover is page 60), I should have 15 input pages and 30 scanned images (front + back).  I think my picture batches will be: 1+60, 2+59, 3+58, etc.  Also, if I have to do this in two batches, I will have scanned images with numbering which will need to take some of this into account, too, in a cutting and renaming script e.g. postprocess [yyyy-mm] [first_page] [last_page].
    4. Probably also do some image rotation magic
    5. The final output of this will be perfectly named, oriented, and numbered scans (e.g. 1998-12-p01, 1998-12-p02, 1998-12-p03, etc.)
    6. I could deliver those back to Northwest Runner (this is what I had volunteered to do) or I might do some additional post-processing to attempt to assemble them into searchable PDFs.

And that should just about ruin my summer!

Comments (4)

Initial notes on scanning

So I’m starting to experiment with my scanning capabilities. I have two all-in-one printers. An old Canon MP530 and a newer Kodak ESP9250. I thought I would just use them both and cut my scanning time in half by swapping back and forth between them, but instead I’ve spent much of a lovely Memorial Day understanding their capabilities, what works, what doesn’t work and figuring out how I’ll actually scan 30 years of Northwest Runner magazine. Here’s what I’ve found.

First, the colors between the scanners is very different. Using the default scanning characteristics – here are some samples from the cover of the December 1998 issue.

CanonVsKodak

Obviously the picture is terrible – I’ll get to that in a minute – but the one on the left is the Canon and the one on the right the Kodak.  I ran a few more tests and the Canon gave me reliably more faithful looking scans of the original image than the Kodak, so I think I’m simply going to not use the Kodak.

Next – yeah, that image is terrible. How do I fix that?  That’s a moire pattern and it commonly happens with scanned images. The secret to fixing it is to set an option in the scanning software from the manufacturer to “descreen” the image and this basically eliminates the interference:
NoVsDescreen

Great!  Now I have pretty acceptable looking scans.  At least I have the basics of what I expect.  I’d taken some other stats before on scan time and file size if images are saved as JPG.  Here they are:

Test Kodak Canon
600dpi scan speed 35s / page 1:07 / page
300dpi scan speed 15.5s / page 18s / page
200dpi scan speed 6s / page 18s / page (again)
150dpi scan speed 5s / page 10.5s / page
600dpi file size not measured 5MB
300dpi file size not measured 1.2MB
200dpi file size not measured 600KB
150dpi file size not measured 300KB

Well that’s discouraging, but maybe not surprising. The Kodak is *dramatically* faster.  Making matters worse – the above measures are for the Canon scanner when the moire interference pattern is *not* suppressed.  With the interference pattern suppressed (which is really the only acceptable way to do this), the scan speed is >1 minute per page every time.

Finally, I wanted to decide on a scan DPI.  With the moire suppression enabled it doesn’t seem like I’m going to sacrifice any time on the project if I choose to go with a lower DPI, so all I need to do is figure out what would be acceptable. For archival purposes it seems like the only reasonable thing would be to go as high as possible but something tells me 600DPI (or higher, I think that I could do 1200) is just not really going to benefit anyone ever and it is would almost definitely make this take up even more of my time (in terms of initial processing and any post-processing) so I am planning on 300DPI or lower.  To make the call on this, I noticed that the Canon software is capable of taking some input files and generating a searchable PDF. I can’t stand PDF as a format but there’s no denying that this would be cool and handy, so I don’t want to choose a scan option with low fidelity if it seems that I might one day sacrifice that ability.  A couple tests on this and it turns out that the generated PDFs I make of 200DPI input files sometimes cannot find input search strings that I enter for people’s names in race results that are very clearly words on the printed page but at 300DPI in a handful of tests I didn’t find any misses.  Therefore: 300DPI it is.

To summarize:

  1. Canon wins vs. my Kodak. Other scanners will probably yield different results.
  2. It is absolutely necessary to turn on the descreen operation to reduce moire interference (and this is only available in the printer’s driver / software, not as a generic TWAIN device, it seems)
  3. Super-high DPI isn’t worth my time. In fact, did I mention how stupid it is that I’m doing this?
  4. But 300DPI seems to be the minimum to be able to make text-searchable PDF files and have that work.

I have a few more things to research before I get going, but I’m well on my way with these findings!

Comments

Inventory of Northwest Runner

IMAG0190

UPDATE: 6/4 notes inline with back issue notes thanks to Glenn Tachiyama. If I can pool Glenn’s issues with the issues I already have, this would be a complete collection from 1983-present.

To start out the project, I took an inventory of the issues I gathered from Martin. I’m still collecting details on what formats the various back issues are in, but at a high level:

  • the oldest issues are only available in the print copies (which I have)
  • newer issues (from something like 2000ish on) are available in some digital format

So the right thing to do is to scan the oldest issues and try to work with the digital format of the new issues and make directly consumable digital copies of those.

I don’t have access to the earliest volumes.  These date to the early 1970’s and if anyone has access to these, I would be very happy to digitize them, but I’m going to assume they are lost forever.  Here’s a stock of what I HAVE or what is MISSING:

  • Volumes 1-3: all missing
  • Volume 4: HAVE issues 5, 6, 7, and 10
  • Volume 5: HAVE issues 2, 5
  • Volume 6: MISSING 3, 5, 7, 11
  • Volume 7: MISSING 2
  • …note – all future volumes / years indicate issues that are MISSING…
  • Volume 8: 7, 10+
  • Volume 9 / 1981 (volume numbering changed this year): 1, 2, 4, 8, 9, December
  • 1982: May –> Glenn is also missing May, issue never printed?
  • 1983-1984: complete 🙂
  • 1985: July –> Glenn has this 🙂
  • 1986: complete 🙂
  • 1987: January –> Glenn has this 🙂
  • 1988: February –> Glenn has this 🙂
  • 1989-1993: complete 🙂
  • 1994: June, November –> Glenn has this 🙂
  • 1995-1996: complete 🙂
  • 1997: February, June –> Glenn has this 🙂
  • 1998-1999: complete 🙂
  • 2000: December –> Glenn has this 🙂
  • 2001: complete 🙂
  • 2002: September, October –> Glenn has this 🙂
  • 2003-2005: complete 🙂
  • 2006: February
  • 2007 on: assume digital copies exist

Comments

Archiving Northwest Runner

A couple weeks ago I got this idea that seemed great at the time. “Northwest Runner is a really valuable resource for runners in Seattle and I am positive that there is a ton of great history in there that should be preserved and made more publicly available. I should scan all the back issues.

It’s that last part where this may have taken a turn for the worse.  Anyway, I got in touch with long-time editor and publisher, Martin Rudow, and today I picked up a trunk full of back issues. I’m going to take some notes on this process and archive them in my blog for posterity and as I come up with questions that might be interesting for runners or hobby archivists.  This will probably start with a background of what data is available, go into technical questions / notes / challenges / discoveries, and hopefully just be kind of interesting.

I’ll try to remember to tag all the posts as “nwrunner” so that interested readers don’t have to wade through my extensive and deeply crazed rants on the current state of technology.  Wait…that’s the unabomber…not me…

Comments (2)