TextRipper (aka T-Rip)

Nautilus Scripts

Add the source-code for this project on opencode.net

Product
Files (1)
Ratings & Reviews (4)
Affiliates (0)

Description:

An OCR, Optical Character Recognition, gui application or cli script
# Supports the Tesseract engine by default!
# Optionally supports the Ocrad engine for multi-column text.
# These recognition engines have a very high character recognition success rate compared to other OCR's, including proprietary software.
# New: multi-page and multiple file selection support!
# Enhanced XSANE output and TIFF compatibility.
# New: now handles nearly any format out there!
# This script will convert any image of text into editable and indexable text. (for a full list of compatible file formats see the first filter below)
#
# REM: The better/cleaner/higher contrasted/higher resolution your image or scan is the better the results
#
# Dependencies: libtiff-dev (or -devel)(installed FIRST), tesseract-2.04 (latest stable-version), your chosen language data for Tesseract (2.00 and up) *1,
# ImageMagick, ghostscript, Zenity, and OpenOffice or other text editor *2
# This version of tesseract can be downloaded from here: http://code.google.com/p/tesseract-ocr/downloads/list
# Warning: This script will not work with the latest beta version (tesseract 3.00 pre-release) due to database structure modifications.
#
# Optional dependencies: ocrad ->an alternate recognition engine
# If inital results are unsatisfactory, maybe this engine will do better. Most importantly, it supports basic page format recognition. *3
# The latest version of ocrad can be downloaded off the GNU mirror list here: http://www.gnu.org/software/ocrad/
#
# Also: Make sure to select Unicode UTF-8 in OpenOffice's pop-up window (or text editor of your choice).
#
#
#
# *1 Install Tesseract after libtiff-dev. Then extract all the language databases you need into the "wherever_you_installed/tesseract-2.04/tessdata" directory.
# This is done automatically if you extract the language databases from WITHIN the "tesseract-2.04" directory (and allow overwriting).
# This script allows the use of multiple language databases. Default is English and French. For adding others see comments below.
# You NEED at least one language database or tesseract will not work.
# *2 Simply change the occurance of "soffice -writer" below to a text editor of your choice, ie: gedit, KWrite, etc
# Some systems call on OpenOffice Writer differently. If unsure, check the properties tab of your Writer launcher.
# Ie: On customized versions of OOo (such as the ones provided by Linux Mandrake or Gentoo), you start Writer with: oowriter
# *3 If you install ocrad also, TextRipper will recognize this and prompt you to choose between the two offering better recognition or page format support
#
# Troubleshooting:
# If this script ends saying your text editor can't open "OCR output-editable text.txt",
# or if run off the cli: Unable to load unicharset file /usr/local/share/tessdata/eng.unicharset
# do (as superuser):
# echo /usr/local/share /usr/share | xargs -n 1 cp -R wherever_you_installed/tesseract-2.04/tessdata
# Explanation: Tesseract may call on the tessdata directory from the /share directory of your filesystem,
# so you need to make your language databases available from there.

Ratings & Comments

19 Comments

polardude1983

•14 years ago

I am getting this error /home/christoph/Downloads/dog_petition10001.jpg (editable and indexable 001.txt does not exist And I believe I installed everything correctly. I have Zenity, Tesseract-ocr, Tesseract-ocr-eng, imagemagick, libtiff4-dev, ghostscript. Any help would be appreciated. I have tried it on different images in different formats, jpg, png, pdf. Same error for all

Categories

Tag subcategories

TextRipper (aka T-Rip)

Ratings & Comments

Eyecandy for your XFCE-Desktop - xfce-look.org