Friday, November 07, 2008

JSTOR to Amazon Kindle

Wonderful news: I've discovered an easy way to convert scanned PDF files (e.g. from JSTOR) so that they're readable on my Kindle. I simply email the PDF to my gmail account, and (thanks to Google's new OCR capabilities) from there I can "open as HTML" and save the result as a plain text file. As an added bonus, this process preserves page number information (normally lost in Amazon's conversions).

Before transferring to the Kindle, it might be worth tidying up the text file to make it more readable. This is especially easy using the command line in Linux -- I use the 'fmt' command to remove excess whitespace and line-breaks, and 'tr -d' to delete any annoying characters (in my case, asterisks) that one's browser saw fit to scatter throughout the saved text file. This is all taken care of by the single line:
cat oldfile.txt | tr -d \* | fmt -u -w 999 > newfile.txt

Perfect!

Update: a few further notes...
(1) Acrobat Reader is sometimes able to "save as text" even scanned-image PDFs. (I guess these must be "image+text" marked-up PDFs, rather than raw scanned images. But they look the same to the naked eye.)

(2) The method listed for JSTOR articles won't work for two-column scans (e.g. of books). Linux script 'unpnup' enables one to convert such files to single-column PDFs however. PaperCrop is a more powerful solution that works easily in Windows.

(3) Sometimes a book scan is of such bad quality that OCR just can't interpret it. In this case, one can use PDFread to cut up the images into kindle-sized bites, and assemble the images directly into a .prc or .mobi (Kindle-readable) file. This way one can read the scanned images themselves on one's Kindle, without them being shrunk to an illegible size. I've found that this works extremely well.

14 comments:

  1. And just why didn't you buy an iPhone to do all this? Plus, the iPhone allows you to open pdfs from emails - no need to bother with Google's crappy HTML versions...

    ReplyDelete
  2. Can you save the .html version as a .doc file? And then send that to your kindle?

    ReplyDelete
  3. Anlamk - you miss the point. I'm talking about serious reading here, not just skimming a document on the go. The iPhone is far from an ideal reading device. I could read PDFs on my laptop if that was the only issue. But I do a lot of reading (of digitized content -- mostly PDFs of philosophy papers) and the Kindle's e-ink is much more comfortable to read than a backlit computer screen (which in turn is better than a tiny phone screen).

    ReplyDelete
  4. Hi Gil - in my first attempt, I sent the .html file to Amazon to be directly converted into Kindle format (they offer this as a free online service; .doc files require conversion too).

    Unfortunately, the formatting was not very good -- half of my Kindle screen ended up being wasted by the wide left margin, so that only a few words would fit on each line. So I think it is better to save the .html file in unformatted (plain text) form instead, so as to get rid of the wide margins.

    ReplyDelete
  5. If you use an iPhone you can simply copy over the PDF for reading at you leisure. I'd disagree that the iPhone is bad for reading. The advantage is that it's always in my pocket so I can read when I want. The Kindle would be nice except it's so large that it's a pain to pack around. Basically I'd rather just cart around a printout or my MacBook rather than a Kindle. (My personal preferences - I'm not making a general argument against Kindle lovers)

    ReplyDelete
  6. Well maybe this is just a difference in preference. I'd still prefer the iPhone to Kindle. I think the screen is as comfortable as any - plus you don't get to carry one less device. With Kindle, you have to have the phone plus the Kindle. (And Kindle is pretty big, too.)

    The only advantage seems the e-ink, the eye-strain you get from looking at the computer screen. As an engineer who lives and breathes computers, that's not an issue for me.

    And iPhone's other features (mp3, phone, other apps, interface) more than compensate for the lack of the e-ink.

    ReplyDelete
  7. Well, again, I'm sure the iPhone is great for many purposes. But I don't see how any of this is relevant. You presumably wouldn't use it as your primary means of reading, which is what I'm concerned with here (not just 'on the go', as I said, but studying at home, etc.).

    Anyway, it's not for everyone. But for those who like the Kindle, it's great to have a solution to what was previously (again, at least for my purposes) its greatest shortcoming.

    ReplyDelete
  8. I don't know about the Iphone, but my Nokia E90 has been my primary reading means for the last few years. I personally don't have any difficulty with this, and I love that I always have books and papers to read.

    There are plenty of easy ways to convert pdf's into txts or html there are open source tools for doing this on your desktop.

    Cheers
    David

    ReplyDelete
  9. Hi David, it's easy to convert text-based PDFs. Scanned-image PDFs (as from JSTOR) are an altogether different kettle of fish. For that you require Optical Character Recognition software -- and while some open source OCR is available, it isn't particularly easy to use, and my attempts yielded far worse results than Google's. So the new-found ability to use Gmail for OCR is actually pretty significant.

    ReplyDelete
  10. Richard,

    I know you've already got the Kindle, but I think the e-reader from Plastic Logic will be a better choice for reading PDFs and the like, rather than ebooks. It's going to be larger, and seems to be geared more toward those who have their own documents than the Kindle, which seems to be geared more toward those who want to buy books from Amazon.

    Corey

    ReplyDelete
  11. Wow, yeah that looks great. (May still be a ways off in the future, though...)

    ReplyDelete
  12. For home wouldn't you be better off reading on a laptop? There are laptops with great screens. My MacBookPro is the best computer screen I've ever used.

    ReplyDelete
  13. I think the backlighting means that reading from a computer is never ideal (compared to paper/digital ink). It's also more comfortable to read from something hand-held. The whole point of the Kindle is that it imitates the comfort advantages of paper books, but for digital content.

    ReplyDelete
  14. Thanks for the tip! I am looking forward to trying this method.

    ReplyDelete

Visitors: check my comments policy first.
Non-Blogger users: If the comment form isn't working for you, email me your comment and I can post it on your behalf. (If your comment is too long, first try breaking it into two parts.)