TesseractをUbuntu（Docker）で試してみる

2021-04-05T20:50:12+09:00

DockerでJupyterLab環境で構築し、その中でTesseractを使って画像から文字を抽出する。

DockerでJupyter Lab環境を構築

Dockerfileを作る

任意のフォルダにDockerfileという名前のファイルを作成する。

$ mkdir ~/Desktop/docker_build
$ cd Desktop/docker_build/
$ touch Dockerfile

Dockerfileの中身

FROM ubuntu:latest

# update
RUN apt-get -y update && apt-get install -y \
sudo \
wget \
vim

#install anaconda3
WORKDIR /opt
# download anaconda package and install anaconda
# archive -> https://repo.continuum.io/archive/
RUN wget https://repo.continuum.io/archive/Anaconda3-2019.10-Linux-x86_64.sh && \
sh /opt/Anaconda3-2019.10-Linux-x86_64.sh -b -p /opt/anaconda3 && \
rm -f Anaconda3-2019.10-Linux-x86_64.sh
# set path
ENV PATH /opt/anaconda3/bin:$PATH

# update pip and conda
RUN pip install --upgrade pip

WORKDIR /
RUN mkdir /work

# execute jupyterlab as a default command
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--allow-root", "--LabApp.token=''"]

Dockerをビルド

$ docker build .
Successfully built d723190a8650

Dockerを起動

$ docker run -p 8888:8888 -v ~/Desktop/ds_python:/work --name my-lab d723190a8650

ブラウザからlocalhost:8888にアクセスしてJupyter Labに入る。

Docker上でTesseract

Tesseractインストール

ターミナル起動

workディレクトリをクリック
File -> New -> Terminal

本体インストール

インストール途中で、ロケーションとタイムゾーンを聞かれるので、Asia、Tokyoを選択

$ sudo apt-get update
$ sudo apt install tesseract-ocr
$ sudo apt install libtesseract-dev

バージョン確認

$ tesseract -v
tesseract 4.1.0-rc1-184-g497d
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found SSE

訓練済みモジュールインストール

sudo apt install tesseract-ocr-jpn  tesseract-ocr-jpn-vert
sudo apt install tesseract-ocr-script-jpan tesseract-ocr-script-jpan-vert

モジュールがインストールされたか確認

$ tesseract --list-langs
List of available languages (6):
Japanese
Japanese_vert
eng
jpn
jpn_vert
osd

Tesseract実行

第一引数は画像の名前、第二引数はOCRの結果を出力するファイル名。デフォルトで.txt拡張子が付く。

$ tesseract image.png ocr_out -l jpn

PythonでTesseract実行

pytesseractインストール

pip install pytesseract

pytesseract実行

import Image
import pytesseract

FILE_NAME = './image.jpg'

print(pytesseract.image_to_string(Image.open(FILE_NAME), lang=('jpn'))

おまけ

OpenCVでグレースケール変換

OpenCVインストール

$ pip install opencv-python
$ apt-get install -y libgl1-mesa-dev

グレースケール変換

import cv2

im = cv2.imread('./image.jpeg')
im_gray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
cv2.imwrite('./image_gray.jpeg', im_gray)
print(pytesseract.image_to_string(Image.open('./image_gray.jpeg'), lang='jpn'))

「OpenCV」の記事 - Crieit

TesseractをUbuntu（Docker）で試してみる