资讯

历史

科技

环境与自然

成长

游戏

财经

文学与艺术

美食

健康

家居

文化

情感

汽车

三农

军事

旅行

运动

教育

生活

星座命理

从零开始：手把手教你实现高效OCR文档识别系统

创作时间:

作者:

@小白创作中心

从零开始：手把手教你实现高效OCR文档识别系统

引用

CSDN

https://m.blog.csdn.net/weixin_43413871/article/details/146135078

本文将手把手教你从零开始构建一个高效的OCR（光学字符识别）文档识别系统。通过本教程，你将学习到OCR技术的基础知识、实现步骤以及性能优化技巧。文章包含详细的代码示例，帮助你快速上手并掌握这一实用技术。

OCR技术简介

什么是OCR？
定义： 光学字符识别（Optical Character Recognition）是一种将图像中的文字转换为可编辑文本的技术。
应用场景： 包括文档数字化、车牌识别、票据处理等。
OCR的工作原理： 包括图像预处理、文字检测、文字识别和后处理四个主要步骤。

实现OCR的常用工具与库

Tesseract OCR：
特点： 开源、跨平台的OCR引擎，支持多种语言。
安装方法及配置。
Pytesseract：
功能： Python对Tesseract的封装，便于集成到Python项目中。
OpenCV：
用途： 用于图像预处理（如灰度化、二值化、去噪等）。
其他工具： 包括Google Cloud Vision API、AWS Textract等商业解决方案的简要对比。

环境搭建

安装依赖：
安装Tesseract OCR引擎。
安装Python库：
pytesseract
opencv-python。
验证安装：
测试简单的OCR功能，确保环境配置正确。

手把手实现OCR文档识别

步骤1：加载图像

使用OpenCV读取图像文件。
示例代码：

import cv2
# 加载图像
image = cv2.imread('example.jpg')
cv2.imshow('Original Image', image)
cv2.waitKey(0)

步骤2：图像预处理

灰度化、二值化、去噪等操作。
示例代码：

# 转换为灰度图像
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# 二值化处理
_, binary = cv2.threshold(gray, 128, 255, cv2.THRESH_BINARY)
# 显示处理后的图像
cv2.imshow('Binary Image', binary)
cv2.waitKey(0)

步骤3：调用Tesseract进行文字识别

使用pytesseract提取文字。
示例代码：

import pytesseract
# 设置Tesseract路径（如果需要）
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# 提取文字
text = pytesseract.image_to_string(binary)
print("识别结果：", text)

步骤4：后处理

去除多余空格、标点符号错误等。
示例代码：

# 去除多余空格
cleaned_text = ' '.join(text.split())
print("清理后的文本：", cleaned_text)

性能优化与常见问题

性能优化：
使用GPU加速Tesseract。
调整图像分辨率和预处理参数。
常见问题及解决方法：
图像质量差导致识别率低。
多语言混合文本的处理。

完整项目代码

import numpy as np
import argparse
import cv2

def order_points(pts):
    # 按顺序找到对应坐标0123分别是 左上，右上，右下，左下
    rect = np.zeros((4, 2), dtype="float32")
    s = pts.sum(axis=1)
    rect[0] = pts[np.argmin(s)]  # 左上角
    rect[2] = pts[np.argmax(s)]  # 右下角
    diff = np.diff(pts, axis=1)
    rect[1] = pts[np.argmin(diff)]  # 右上角
    rect[3] = pts[np.argmax(diff)]  # 左下角
    return rect

def four_point_transform(image, pts):
    # 获取输入坐标点并进行透视变换
    rect = order_points(pts)
    (tl, tr, br, bl) = rect
    # 计算变换后的宽度和高度
    width_a = np.sqrt(((br[0] - bl[0]) ** 2) + ((br[1] - bl[1]) ** 2))
    width_b = np.sqrt(((tr[0] - tl[0]) ** 2) + ((tr[1] - tl[1]) ** 2))
    max_width = max(int(width_a), int(width_b))
    height_a = np.sqrt(((tr[0] - br[0]) ** 2) + ((tr[1] - br[1]) ** 2))
    height_b = np.sqrt(((tl[0] - bl[0]) ** 2) + ((tl[1] - bl[1]) ** 2))
    max_height = max(int(height_a), int(height_b))
    # 定义目标图像的四个顶点
    dst = np.array([
        [0, 0],
        [max_width - 1, 0],
        [max_width - 1, max_height - 1],
        [0, max_height - 1]], dtype="float32")
    # 计算变换矩阵并应用透视变换
    m = cv2.getPerspectiveTransform(rect, dst)
    warped = cv2.warpPerspective(image, m, (max_width, max_height))
    return warped

def resize_image(image, width=None, height=None, interpolation=cv2.INTER_AREA):
    # 根据给定的高度或宽度调整图像大小
    dim = None
    h, w = image.shape[:2]
    if width is None and height is None:
        return image
    if width is None:
        r = height / float(h)
        dim = (int(w * r), height)
    else:
        r = width / float(w)
        dim = (width, int(h * r))
    resized = cv2.resize(image, dim, interpolation=interpolation)
    return resized

def preprocess_image(image):
    # 对图像进行预处理：灰度转换、高斯模糊和边缘检测
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    gray = cv2.GaussianBlur(gray, (5, 5), 0)
    edged = cv2.Canny(gray, 75, 200)
    return gray, edged

def find_contours(edged):
    # 查找图像中的轮廓并按面积排序
    contours = cv2.findContours(edged.copy(), cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)[0]
    contours = sorted(contours, key=cv2.contourArea, reverse=True)[:5]
    return contours

def get_screen_contour(contours):
    # 找到最接近矩形的轮廓
    for contour in contours:
        perimeter = cv2.arcLength(contour, True)
        approx = cv2.approxPolyDP(contour, 0.02 * perimeter, True)
        if len(approx) == 4:
            return approx
    return None

def main(image_path):
    # 解析命令行参数
    original_image = cv2.imread(image_path)
    ratio = original_image.shape[0] / 500.0
    resized_image = resize_image(original_image, height=500)
    # 图像预处理
    gray_image, edged_image = preprocess_image(resized_image)
    print("STEP 1: Edge Detection")
    cv2.imshow("Image", resized_image)
    cv2.imshow("Edged", edged_image)
    cv2.waitKey(0)
    cv2.destroyAllWindows()
    # 查找轮廓
    contours = find_contours(edged_image)
    screen_contour = get_screen_contour(contours)
    print("STEP 2: Find Contours")
    cv2.drawContours(resized_image, [screen_contour], -1, (0, 255, 0), 2)
    cv2.imshow("Outline", resized_image)
    cv2.waitKey(0)
    cv2.destroyAllWindows()
    # 进行透视变换并保存结果
    if screen_contour is not None:
        transformed_image = four_point_transform(original_image, screen_contour.reshape(4, 2) * ratio)
        binary_image = cv2.cvtColor(transformed_image, cv2.COLOR_BGR2GRAY)
        binary_ref = cv2.threshold(binary_image, 100, 255, cv2.THRESH_BINARY)[1]
        cv2.imwrite('data/scan.jpg', binary_ref)
        print("STEP 3: Perspective Transform")
        cv2.imshow("Original", resize_image(original_image, height=650))
        cv2.imshow("Scanned", resize_image(binary_ref, height=650))
        cv2.waitKey(0)
        cv2.destroyAllWindows()
    else:
        print("Could not find document edges.")

if __name__ == "__main__":
    main("data/receipt.jpg")

经过上面图片的的预处将不规则带有噪点的图片优化，能有效的提高下一步的orc识别准确率。

from PIL import Image
import pytesseract
import cv2
import os

preprocess = 'blur'  # thresh
image = cv2.imread('data/scan.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
if preprocess == "thresh":
    gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
if preprocess == "blur":
    gray = cv2.medianBlur(gray, 3)
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, gray)
text = pytesseract.image_to_string(Image.open(filename))
print(text)
os.remove(filename)
cv2.imshow("Image", image)
cv2.imshow("Output", gray)
cv2.waitKey(0)