Scanning and creating small PDFs using sane and ghostscript

I tend to try to avoid paper printouts. I have enough backups so scanned archives are enough. I made a few test on the best way to produce small PDF on the command line. I found the following bash functions to be the most effective:

function scan2pdf {
  cd ~/tmprm/scan
  FILE=$1
  [ "$FILE" == "" ] && read FILE
  [ -e "$FILE".pdf ] && return
  # scan A4 gray
  scanimage -l 0 -t 0 -x 215 -y 297 --mode Gray --resolution=300 > "$FILE".pnm
  # convert to ps because gs needs this import format
  pnmtops -dpi 300 "$FILE".pnm > "$FILE".ps
  # convert to PDF with decent /ebook quality setting
  gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dBATCH -sOutputFile="$FILE".pdf "$FILE".ps
  rm -f "$FILE".pnm "$FILE".ps
}

function scan2pdfs {
    cd ~/tmprm/scan
    ENDFILE=$1
    [ "$ENDFILE" == "" ] && read ENDFILE
    for i in `seq --equal-width 999`; do
	echo "(d)one?"
	read NEXT
	[ "$NEXT" == "d" ] && break
	scan2pdf "$ENDFILE"$i
    done
    gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -sOutputFile="$ENDFILE".pdf -f "$ENDFILE"*.pdf"
    echo "OK? (CTRL-C)"
    read OK    
    rm -f $LIST
}

It can be used as follow:

scan2pdf thisfile

scan2pdf thisotherfile

scan2pdfs multiplefiles

It does all the work in ~/tmp/scan but that’s a personal convenience. With this, I get PDF that are smaller than 1MB – while other methods I tried before was producing 5/6MB files for the same content.

Update: now this is provided as general bashrc.d script. It’s included in the -utils package. Now the main command for multiple A4 pages PDF is no longer scan2pdfs but scan2pdf. Its behavior can be changed through variables SCAN2PDF_DIRECTORY (default = ~/tmprm/scan) and SCAN2PDF_DPI (default = 300).

Converting PDFs to multiple HTML pages with pdftk and pdftohtml

As already stated on this blog, Bada OS is total crap. Scripting is a mess, T9 is missing of original versions, updating is not an available option depending on your phone (even if the phone is less than a year old). It keeps being absolutely worthless when it comes to reading PDF. No matter how, even if you feed it a specifically cropped PDF with no margins, you’ll always end up with something not really readable, too big, too small, whatever. A pain in the ass.

I soon realized it’s best, with such an appalling combination of software and hardware, to convert ebooks/PDFs to HTML. And as the provided HTML reader can’t remember what page you last read (not surprising) and, ahem, is unable to load a 3 MB page (low memory it says: even if a 30 MB PDF can be loaded by the PDF reader with no issue on the exact same phone, go figure!), it needs splitted HTML.

PDF is usually an output format, not a source format. While there’s plenty to convert to PDF, fact is there is no complete suite to convert from. pdftk is powerful but not easy to handle IMHO and pdftohtml latest released is almost 10 years old. So I ended writing a small wrapper (pdf2htmls.pl) for both theses tools to convert one PDF to multiples HTML files with basic indexes. It takes –input=file.pdf and (optional) –output=directory arguments. Asides from Perl, it requires debian packages pdftk and poppler-utils.

The indexes are über-crude. They could be improved with chapters/titles, I’ll maybe add that later.