验证码识别

使用Google’s Tesseract-OCR Engine来做验证码识别。

安装Tesseract

参考Install Tesseract via pre-built binary package安装Tesseract。

识别验证码

识别验证码可以按照以下步骤进行：

处理验证码图片
1. 去掉背景，有些验证码背景是纯色背景，这样的验证码我们可以先去掉背景色
2. 去噪/干扰线，去掉噪点以及干扰线，有些干扰线是1像素宽度，这样可以使用去噪点的方式去除干扰线
3. 二值化，将图片做二值化处理
4. 图片切割，将图片切分成单个字符的小图片，也可不切割
识别验证码

字体库训练

为了提高验证码的识别率，我们需要先对Tesseract进行训练，生成自己的字体库。

这里我们使用jTessBoxEditor工具。

先设置一些环境变量：

0. 准备验证码图片

准备一些处理后的验证码图片，最好100张以上。

1. 合并图片

打开jTessBoxEditor，使用Tools -> Merge TIFF合并验证码图片，将合并后的图片命名为eng.captcha.exp0.tif。

2. 生成box文件

1	tesseract eng.captcha.exp0.tif eng.captcha.exp0 -l captcha --psm 7 batch.nochop makebox

3. 修改box文件

打开jTessBoxEditor，编辑box文件内容：

4. 生成font_properties

1	echo captcha 0 0 0 0 0 > font_properties

5. 生成训练文件

1	tesseract eng.captcha.exp0.tif eng.captcha.exp0 -l captcha --psm 7 nobatch box.train

6. 生成字符集文件

1	unicharset_extractor eng.captcha.exp0.box

7. 生成shape文件

1	shapeclustering -F font_properties -U unicharset -O eng.unicharset eng.captcha.exp0.tr

8. 生成聚集字符特征文件

1	mftraining -F font_properties -U unicharset -O eng.unicharset eng.captcha.exp0.tr

9. 生成字符正常化特征文件

1	cntraining eng.captcha.exp0.tr

10. 更名

rename normproto captcha.normproto
rename inttemp captcha.inttemp
rename pffmtable captcha.pffmtable
rename unicharset captcha.unicharset
rename shapetable captcha.shapetable

11. 合并训练文件，生成captcha.traineddata

1	combine_tessdata captcha.