Here’s a very simple idea: Use Project Gutenberg for content and Readability for style.
Project Gutenberg has a large collection of public domain books in digital form. The books are available in several formats, none of which are ideal for reading. Project Gutenberg provides text without much styling in order to make it easier for people to use the content as they please.
You can go to the HTML version of a book on Gutenberg and use Readability (or Instapaper) to format it for easier reading. Importing the HTML page to a Kindle similarly improves the formatting.
* * *
Has anyone made a style sheet to approximate the look of Readability or Instapaper? I’d like to use something like that to improve the appearance of the static HTML pages on my site.
There is Calibre also.
The problem with the Project Gutenberg versions of all texts (with the exception of the HTML versions) is that they have hard-coded line-breaks at the end of every approximately 10-ish words. This is a non-issue for HTML since HTML ignores line-breaks, but in the other versions, it makes it a pain to read!
Of course, you have to find a way to remove the annoying boilerplate text. Maybe some crazy researchers worked on this problem:
http://arxiv.org/abs/0707.1913
(Shameless plug.)
Or if you want it really nice to read you could put it into a TeX document and then run a sed script on it to remove all the non-LaTeX characters, turn all the ” into “ and ” and whatnot. Then you get proper typesetting. Shouldn’t be that hard to automate.
Canageek: That was actually my first thought. I had a particular book I wanted to reformat and I thought I’d use LaTeX. Then I thought that the HTML/Readability route would be more convenient.
LaTeX would definitely be better if a book contains symbols or diagrams. See this spherical trig book on Project Gutenberg that someone produced in LaTeX. It’s beautifully done, probably much easier to read than the original.
@John: I was starting to think about writing a script to do this, and found which gives either nicer HTML or LaTeX output. I have not tried it yet. Some of the eBooks also now have very nice HTML pages, for example
@John: I was starting to think about writing a script to do this, and found GutenMark which gives either nicer HTML or LaTeX output. I have not tried it yet, but the examples look nice. Some of the eBooks also now have very nice HTML pages, for example The
Shunned House By H. P. Lovecraft, which frankly looks nicer then I can do with LaTeX in a script. I could probably copy it by hand, but my regex isn’t crazy enough to do drop caps and whatnot in an automated fashion.
Is this another example of how separating content from presentation diminishes both?
@BrendanDowling: Not really. It is easier to parse a presentationless file for sure, since there is less markup to strip out or rework.. I have very mixed feelings on that idea. On one hand, two files is a pain, as is the extra work. On the other, using stuff like title{Foo} and then setting up a title definition above means that you can change how all the titles and such look at once, instead of having to find each one and fix it.
Thumbs up for Readability.