[olug] Trying to extract bar codes from a pdf file

Tue Feb 7 22:54:42 CST 2017

I'm running into some challenges with this and I'm wondering if any of you
smarter people have some insight.

I have a pdf file of bar codes <http://adamhaeder.com/out.pdf>. I want to
extract the barcodes (both the actual bar code part itself, and the text
directly beneath it which tells me what's encoded in the barcode) into
individual image files named after the id. So for example, in the file, the
first barcode is the data A33432008009636B. I want to extract this barcode
into an image file named A33432008009636B.png

At first glance, this file looks pretty well laid out, enough so that I
though it would be easy to just give static coordinates for each barcode,
and use the tool pdftoppm to pull out a certain chunk of the file. For
example, this command:

$ pdftoppm -f 1 -l 1 -r 150 -x 0 -y 160 -W 330 -H 65 -png out.pdf > foo.png

will extract the section starting at x coordinate 0, y coordinate 160, with
a width of 330 pixels and a height of 65 pixels (assuming the dpi is 150)
and save it to the file foo.png. I can then pull the id out of foo.png with
and ocr command like tesseract and rename the png file accordingly. This
works, I tried it. The problem is that the file isn't regular enough: too
many bar codes are not quite lined up right (because I can't guarantee how
many lines of text will be above each one). The sample pdf file is only 1
page of 89, so there are a lot of barcodes in this. I wrote a script to try
the static coordinate method, and then go and walk through each image I
created, searching for the A[0-9]+B string to see how many I missed, and I
got about a 20% error rate. Way too high.

So my next step is to get something that will give me an approximate
coordinate system for each barcode. I found the program pstotext which will
do just that, if I run it with the -bboxes option, like so:

# pstotext -bboxes out.pdf | egrep "A[0-9]{6,}"
GPL Ghostscript 9.18: Some glyphs of the font GGICBJ+ArialMT requires a
patented True Type interpreter.
    19     689     147     714  A33432008009636B
   217     689     345     714  A33432008009637B
   415     689     543     714  A33432008009638B
    19     617     147     642  A33432008515260B
   217     612     345     636  A33432009199973B
   415     612     543     636  A33432009200037B
    19     540     147     564  A33432009200094B
   217     540     345     564  A33432008515245B
   415     540     543     564  A33432008513679B
    19     473     147     498  A33432008009681B
   217     468     345     492  A33432008009682B
   415     468     543     492  A33432008009698B
    19     396     147     420  A33432008898682B
   217     401     345     426  A33432008009700B
   415     396     543     420  A33432008512549B
    19     329     147     354  A33432009196094B
   217     329     345     354  A33432009196102B
   415     329     543     354  A33432008009720B
    19     257     147     282  A33432009196342B
   217     257     345     282  A33432009196359B
   415     257     543     282  A33432009196367B
    19     185     147     210  A33432009196375B
   217     185     345     210  A33432009196383B
   415     185     543     210  A33432009196391B
    19     113     147     138  A33432009196409B
   217     113     345     138  A33432009196417B
   415     113     543     138  A33432009196425B
    19      41     147      66  A33432009196433B
   217      41     345      66  A33432009196441B
   415      41     543      66  A33432009196458B

Success we think! I'll just take those coordinates, feed them to pdftoppm,
and Bob's your uncle. However.... pdftoppm uses a different coordinate
system. Firstly, it starts counting from upper left, unlike pstotext which
counts from lower left. Also, while pdftoppm works in pixels (which is why
I had to pass a dpi value), pstotext works in 'points', which I honestly
haven't been able to figure out yet.

So it seems my 2 options are:
- Somehow convert the above coordinate system output of pstotext to a
format that pdftoppm will be happy to read, or
- do something completely different to programmatically get these barcodes
out of this pdf

Thanks for any advice!

-- 
Adam Haeder
adam at adamhaeder.com

Check out my latest book: LPI Linux Certification in a Nutshell from
O'Reilly: http://bit.ly/bvQQ0I