How to use ocr in JavaScript with Tesseract.js?
December 23, 2019
While making node-html-to-image I came across a small problem. How to test it actually works? node-html-to-image
is a Node.js module that generates images (png, jpeg) from HTML. If you want to learn more about it I wrote a small article about this module. The simplest test I could imagine to ensure it works was creating an image from an HTML string containing "Hello world!". Then I could check the image really contains this string using OCR.
What is OCR?
OCR stands for optical character recognition. This technology let you extract text from an image. It can be handrwritten or printed text. OCR involves a lot a complex steps to actually get text from an image but it isn't the prupose of this article. You can learn more reading its wikipedia's article.
We will focus on how to use it with the most popular open source OCR engine, Tesseract. As a lib it is available for C/C++ developpers. Fortunately, it exists a port in JavaScript.
Installation
Tesseract.js
doesn't need you to install anything on your computer unlike node-tesseract-ocr. It also means it doesn't work offline. node-tesseract-orc
is only a wrapper around tesseract
so you need to install tesseract
and tesseract-lang
on your computer. While Tesseract.js
downloads languages and core scripts on the go.
The only thing you need to do is installing npm package Tesseract.js
using your favorite package manager:
# With yarnyarn add tesseract.js# With npmnpm install tesseract.js
How to use
Here is the image we will try to extract text from.
Let's go through it step by step.
First of all we need to import the createWorker
function.
const { createWorker } = require('tesseract.js')
We call it to create a new tesseract worker which is a Child Process in Node.js and a Web Worker in the browser (yes, Tesseract.js also work in the browser).
const worker = createWorker()
A worker instance have several methods. The first we need to call is the load function. It loads core scripts and prepare tesseract worker for what's coming next.
// ...async function getTextFromImage() {await worker.load()}
Then, we need to load the language of the text in our image. We can achieve it with loadLanguages method. I will download a file with trained date for the language in it. In our example, it will be a file called eng.traineddata
. We can load more than one language using +
character (ex: eng+fr
).
// ...async function getTextFromImage() {await worker.load()await worker.loadLanguage('eng')}
Time to make our worker ready to do OCR tasks. We do it with the initialize method. It takes language we want to use as parameters. It can be a subset of the languages we loaded earlier.
// ...async function getTextFromImage() {await worker.load()await worker.loadLanguage('eng')await worker.initialize('eng')}
Let's do OCR! Our worker has a recognize method that takes an image as parameter. It can be an url, a path on the file system or a buffer. It returns a object with a data property that also is an object with a text property in it containing the final result.
// ...async function getTextFromImage() {await worker.load()await worker.loadLanguage('eng')await worker.initialize('eng')const { data: { text } } = await worker.recognize('./hello-world.png');}
Last step we need to clean up our worker using the method terminate.
// ...async function getTextFromImage() {await worker.load()await worker.loadLanguage('eng')await worker.initialize('eng')const { data: { text } } = await worker.recognize('./hello-world.png');await worker.terminate()return text}
Let's test it! we call our function and print the result to the output.
getTextFromImage().then(console.log)
As you call your script you should get the following result.
~ ❯ node tesseract.js ⏎HELLO WORLD!
Nice, but it didn't find all the text from our image...
By default, Tesseract works in SINGLE_BLOCK
mode. A worker instance has a setParameters that let you change Tesseract default behavior. In our case we want to change the tessedit_pageseg_mode
parameter value. Before doing it we need to import the PSM enumeration (this is an acronym for page segmentation mode).
const { createWorker } = require('tesseract.js')const PSM = require('tesseract.js/src/constants/PSM.js')// ...
Finally, we call setParameters
method with the wanted mode. For the example, we will use AUTO
mode and let the engine find all lines.
// ...async function getTextFromImage() {await worker.load()await worker.loadLanguage('eng')await worker.initialize('eng')await worker.setParameters({tessedit_pageseg_mode: PSM.AUTO,})const { data: { text } } = await worker.recognize('./hello-world.png');await worker.terminate()return text}// ...
By calling your should get a different result.
~ ❯ node tesseract.js ⏎HELLO WORLD!made with € by node-html-to-image
As you can see it find the whole text. It seems to have difficulties to identify the emoji character but it is a pretty impressive result.
Here is the final code:
const { createWorker } = require('tesseract.js')const PSM = require('tesseract.js/src/constants/PSM.js')async function getTextFromImage() {await worker.load()await worker.loadLanguage('eng')await worker.initialize('eng')await worker.setParameters({tessedit_pageseg_mode: PSM.AUTO,})const { data: { text } } = await worker.recognize('./hello-world.png');await worker.terminate()return text}getTextFromImage().then(console.log)
There are a lot more examples in Tesseract.js documentation with extra features like:
- progress
- multi languages
- whitelist char
- And more...
If you are curious to see how I tested node-html-to-image. You can find the source here.
You're set 🙌 Hope it will help you!
Feedback or ideas are appreciated 🙏 Please tweet me if you have questions @YvonnickFrin!