[olug] Trying to extract bar codes from a pdf file
Adam Haeder
adam at adamhaeder.com
Tue Feb 7 22:54:42 CST 2017
I'm running into some challenges with this and I'm wondering if any of you
smarter people have some insight.
I have a pdf file of bar codes <http://adamhaeder.com/out.pdf>. I want to
extract the barcodes (both the actual bar code part itself, and the text
directly beneath it which tells me what's encoded in the barcode) into
individual image files named after the id. So for example, in the file, the
first barcode is the data A33432008009636B. I want to extract this barcode
into an image file named A33432008009636B.png
At first glance, this file looks pretty well laid out, enough so that I
though it would be easy to just give static coordinates for each barcode,
and use the tool pdftoppm to pull out a certain chunk of the file. For
example, this command:
$ pdftoppm -f 1 -l 1 -r 150 -x 0 -y 160 -W 330 -H 65 -png out.pdf > foo.png
will extract the section starting at x coordinate 0, y coordinate 160, with
a width of 330 pixels and a height of 65 pixels (assuming the dpi is 150)
and save it to the file foo.png. I can then pull the id out of foo.png with
and ocr command like tesseract and rename the png file accordingly. This
works, I tried it. The problem is that the file isn't regular enough: too
many bar codes are not quite lined up right (because I can't guarantee how
many lines of text will be above each one). The sample pdf file is only 1
page of 89, so there are a lot of barcodes in this. I wrote a script to try
the static coordinate method, and then go and walk through each image I
created, searching for the A[0-9]+B string to see how many I missed, and I
got about a 20% error rate. Way too high.
So my next step is to get something that will give me an approximate
coordinate system for each barcode. I found the program pstotext which will
do just that, if I run it with the -bboxes option, like so:
# pstotext -bboxes out.pdf | egrep "A[0-9]{6,}"
GPL Ghostscript 9.18: Some glyphs of the font GGICBJ+ArialMT requires a
patented True Type interpreter.
19 689 147 714 A33432008009636B
217 689 345 714 A33432008009637B
415 689 543 714 A33432008009638B
19 617 147 642 A33432008515260B
217 612 345 636 A33432009199973B
415 612 543 636 A33432009200037B
19 540 147 564 A33432009200094B
217 540 345 564 A33432008515245B
415 540 543 564 A33432008513679B
19 473 147 498 A33432008009681B
217 468 345 492 A33432008009682B
415 468 543 492 A33432008009698B
19 396 147 420 A33432008898682B
217 401 345 426 A33432008009700B
415 396 543 420 A33432008512549B
19 329 147 354 A33432009196094B
217 329 345 354 A33432009196102B
415 329 543 354 A33432008009720B
19 257 147 282 A33432009196342B
217 257 345 282 A33432009196359B
415 257 543 282 A33432009196367B
19 185 147 210 A33432009196375B
217 185 345 210 A33432009196383B
415 185 543 210 A33432009196391B
19 113 147 138 A33432009196409B
217 113 345 138 A33432009196417B
415 113 543 138 A33432009196425B
19 41 147 66 A33432009196433B
217 41 345 66 A33432009196441B
415 41 543 66 A33432009196458B
Success we think! I'll just take those coordinates, feed them to pdftoppm,
and Bob's your uncle. However.... pdftoppm uses a different coordinate
system. Firstly, it starts counting from upper left, unlike pstotext which
counts from lower left. Also, while pdftoppm works in pixels (which is why
I had to pass a dpi value), pstotext works in 'points', which I honestly
haven't been able to figure out yet.
So it seems my 2 options are:
- Somehow convert the above coordinate system output of pstotext to a
format that pdftoppm will be happy to read, or
- do something completely different to programmatically get these barcodes
out of this pdf
Thanks for any advice!
--
Adam Haeder
adam at adamhaeder.com
Check out my latest book: LPI Linux Certification in a Nutshell from
O'Reilly: http://bit.ly/bvQQ0I
More information about the OLUG
mailing list