Possibly final notes on magazine scanning
I spent some time this past weekend experimenting with scanning settings and eventually got one full year of Northwest Runner magazine scanned. I chose 2003 to scan because I think this is a year for which digital copies exist – this meant if something went horribly wrong and I physically ruined an issue or two, I probably wouldn’t get my kneecaps bashed in at an upcoming Winter Grand Prix race series. I had already done extensive experimenting with my own copies of the magazine anyway, so that wasn’t too likely, but I wanted to play it safe.
Here are the key notes:
- Scanning in greyscsale images+text at 300DPI is dramatically faster than scanning in color. However for recent issues this isn’t a great option. Some pages are B&W but many are full color.
- Output image size for greyscale vs. color at 300DPI is pretty similar. The only reason this might matter is because of the behavior I previously mentioned where the only way I can do this with our scanners at work is to have it email the output from the scanner to me – I can’t scan directly to a network share – and my mailserver rejects messages above a certain size.
- My mailserver seems to reject messages when they cross a threshold somewhere between ~12-15MB in size. In practice, this means I can scan about 5 ledger-sized, double-sided sheets, or about 10 pages of the magazine at a time.
- It is important to separate the pages and invert the fold along the spine before sending through the auto-feeder. I didn’t do this with one of the first magazines and I wound up with some paper jams, some slightly mangled pages (not really destroyed or anything, but like what you get with a printer auto-feed after something’s gotten jammed). I mentioned I scanned the entire year of 2003 – the jams only happened in the first issue or two. After I started this separating and fold inverting process, the pages did not get “stuck” along the spine and they all fed cleanly.
- Sometimes 5 sheets barely hits the “too big” threshold for scanning. If this happens, I need to do something like “scan 3, then scan 2.” This is rare, but it happens.
- Some of the magazines are missing pages or have single pages torn out. This screws up pagination and might make later post-processing / assembly into PDFs a pain (or I might just ignore it).
- The printers at work require me to log in and after some time they will log me out. If I’m logged out, I need to re-enter the scan settings (2 sided, color images + text, scan as JPG not PDF, 300DPI). This is tedious. If I stay attentive during the scan process, I can: feed 5 sheets, wait for it to scan, put the next 5 sheets in the feed reader, wait for confirmation that it sent the email, then press “Scan” again, I won’t get logged out. This also ensures that the scanner (which is critical path in this assembly line) is always “busy.”
- As this process is happening, I’m getting email after email with 10 attached images (scan01.jpg, scan02.jpg, etc. for both sides) that I need to pull out of my inbox and archive in folders. Because the image names conflict (scan01.jpg will be the cover and also page 11 and page 21, etc.) I need to batch these up, too. My post-processing jpg rotater, cutter, etc. script will handle these.
That’s about it. To scan the 2003 year of magazines took almost exactly 2 hours. During this time I am constantly busy with: de-stapling issues, preparing 5 page batches for the scanner (de-“sticking” the spine), running the scanner, adding/removing sheets from the feed tray, processing my inbox (which will fill up if I don’t pull the files out), reassembling scanned magazines, and trying to re-staple. I think I can make this a little more efficient and bet I’ll trim a decent amount of time off that 2 hour baseline, but this process seems pretty close to optimized to do this job well and keep the original issues intact.
Now I just need to sync up with Martin (or really probably Bill Roe, who I think actually owns these issues) and confirm that they’re OK with me plowing ahead with all the back issues.