资讯

历史

科技

环境与自然

成长

游戏

财经

文学与艺术

美食

健康

家居

文化

情感

汽车

三农

军事

旅行

运动

教育

生活

星座命理

PDF Viewer

创作时间:

作者:

@小白创作中心

PDF Viewer

引用

来源

https://docs.pingcode.com/baike/2324301

在Web开发中，解析PDF文件是一个常见的需求。本文将介绍三种常用的JavaScript库：PDF.js、PDF-LIB和pdf2json，帮助开发者在不同的应用场景中选择合适的工具。

PDF.js库

PDF.js是一个流行的JavaScript库，用于在浏览器中解析和渲染PDF文件。它由Mozilla开发，并被广泛应用于各种网页应用中。

安装与设置

要使用PDF.js，首先需要将其引入项目中。可以通过以下方式安装：

npm install pdfjs-dist

在HTML文件中引入：

<script src="path/to/pdfjs-dist/build/pdf.js"></script>

基本用法

要解析一个PDF文件，首先需要加载文件并获取其内容：

const pdfjsLib = window['pdfjs-dist/build/pdf'];

const loadingTask = pdfjsLib.getDocument('path/to/pdf');  
loadingTask.promise.then(function(pdf) {  
  console.log('PDF loaded');  
  pdf.getPage(1).then(function(page) {  
    console.log('Page loaded');  
    const scale = 1.5;  
    const viewport = page.getViewport({ scale: scale });  
    const canvas = document.getElementById('the-canvas');  
    const context = canvas.getContext('2d');  
    canvas.height = viewport.height;  
    canvas.width = viewport.width;  
    const renderContext = {  
      canvasContext: context,  
      viewport: viewport  
    };  
    const renderTask = page.render(renderContext);  
    renderTask.promise.then(function () {  
      console.log('Page rendered');  
    });  
  });  
});

解析文本内容

要提取PDF页面中的文本内容，可以使用 getTextContent 方法：

page.getTextContent().then(function(textContent) {
  textContent.items.forEach(function(item) {  
    console.log(item.str);  
  });  
});

PDF-LIB库

PDF-LIB是另一个强大的JavaScript库，用于创建和修改PDF文件。与PDF.js不同，它主要用于生成和编辑PDF，而不是渲染。

安装与设置

可以通过NPM安装PDF-LIB：

npm install pdf-lib

基本用法

要创建一个新的PDF文档，可以使用以下代码：

import { PDFDocument, rgb } from 'pdf-lib';

const pdfDoc = await PDFDocument.create();  
const page = pdfDoc.addPage([600, 400]);  
page.drawText('Creating PDFs with JavaScript is awesome!', {  
  x: 50,  
  y: 350,  
  size: 30,  
  color: rgb(0, 0.53, 0.71),  
});  
const pdfBytes = await pdfDoc.save();

修改现有PDF

要修改一个现有的PDF文件，可以使用以下代码：

const url = 'path/to/pdf';

const existingPdfBytes = await fetch(url).then(res => res.arrayBuffer());  
const pdfDoc = await PDFDocument.load(existingPdfBytes);  
const pages = pdfDoc.getPages();  
const firstPage = pages[0];  
firstPage.drawText('This is a modification!', {  
  x: 50,  
  y: 500,  
  size: 30,  
  color: rgb(1, 0, 0),  
});  
const pdfBytes = await pdfDoc.save();

pdf2json库

pdf2json是一个Node.js库，用于将PDF文件转换为JSON格式。这对于需要对PDF内容进行深入解析和数据提取的场景非常有用。

安装与设置

可以通过NPM安装pdf2json：

npm install pdf2json

基本用法

要将一个PDF文件转换为JSON，可以使用以下代码：

const fs = require('fs');
const PDFParser = require("pdf2json");  

let pdfParser = new PDFParser();  
pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError));  
pdfParser.on("pdfParser_dataReady", pdfData => {  
    fs.writeFile("./pdf2json/test.json", JSON.stringify(pdfData), () => {  
        console.log("PDF data has been extracted!");  
    });  
});  
pdfParser.loadPDF("path/to/pdf");

解析JSON内容

生成的JSON文件包含了PDF的结构和内容，可以通过解析JSON文件来提取所需的信息：

const pdfData = require('./pdf2json/test.json');

pdfData.formImage.Pages.forEach((page) => {  
    page.Texts.forEach((text) => {  
        console.log(decodeURIComponent(text.R[0].T));  
    });  
});

选择合适的工具

浏览器渲染PDF

如果你的主要需求是在浏览器中渲染PDF文件，那么PDF.js是最佳选择。它提供了强大的渲染功能，并且得到了广泛的支持和使用。

生成和修改PDF

对于生成和修改PDF文件，PDF-LIB是一个强大的工具。它提供了灵活的API，可以轻松地添加文本、图像和其他内容到PDF文件中。

数据提取和分析

如果你的主要需求是从PDF文件中提取数据并进行分析，那么pdf2json是一个不错的选择。它可以将PDF文件转换为结构化的JSON格式，便于后续的数据处理和分析。

实战示例

使用PDF.js实现简单的PDF浏览器查看器

首先，创建一个简单的HTML页面：

<!DOCTYPE html>
<html lang="en">  
<head>  
    <meta charset="UTF-8">  
    <meta name="viewport" content="width=device-width, initial-scale=1.0">  
    <title>PDF Viewer</title>  
    <script src="path/to/pdfjs-dist/build/pdf.js"></script>  
    <style>  
        #pdfViewer {  
            width: 100%;  
            height: 100vh;  
        }  
    </style>  
</head>  
<body>  
    <canvas id="pdfViewer"></canvas>  
    <script>  
        const pdfjsLib = window['pdfjs-dist/build/pdf'];  
        const url = 'path/to/pdf';  
        pdfjsLib.getDocument(url).promise.then((pdfDoc) => {  
            pdfDoc.getPage(1).then((page) => {  
                const scale = 1.5;  
                const viewport = page.getViewport({ scale: scale });  
                const canvas = document.getElementById('pdfViewer');  
                const context = canvas.getContext('2d');  
                canvas.height = viewport.height;  
                canvas.width = viewport.width;  
                const renderContext = {  
                    canvasContext: context,  
                    viewport: viewport  
                };  
                page.render(renderContext).promise.then(() => {  
                    console.log('Page rendered');  
                });  
            });  
        });  
    </script>  
</body>  
</html>

使用PDF-LIB生成带有图像和文本的PDF

import { PDFDocument, rgb, StandardFonts } from 'pdf-lib';

async function createPdf() {  
    const pdfDoc = await PDFDocument.create();  
    const page = pdfDoc.addPage([600, 400]);  
    const helveticaFont = await pdfDoc.embedFont(StandardFonts.Helvetica);  
    page.drawText('Creating PDFs with JavaScript is awesome!', {  
        x: 50,  
        y: 350,  
        size: 30,  
        font: helveticaFont,  
        color: rgb(0, 0.53, 0.71),  
    });  
    const jpgUrl = 'https://pdf-lib.js.org/assets/cat_riding_unicorn.jpg';  
    const jpgImageBytes = await fetch(jpgUrl).then(res => res.arrayBuffer());  
    const jpgImage = await pdfDoc.embedJpg(jpgImageBytes);  
    const jpgDims = jpgImage.scale(0.5);  
    page.drawImage(jpgImage, {  
        x: page.getWidth() / 2 - jpgDims.width / 2,  
        y: page.getHeight() / 2 - jpgDims.height / 2,  
        width: jpgDims.width,  
        height: jpgDims.height,  
    });  
    const pdfBytes = await pdfDoc.save();  
    download(pdfBytes, "example.pdf", "application/pdf");  
}  

function download(data, filename, type) {  
    const blob = new Blob([data], { type: type });  
    const url = window.URL.createObjectURL(blob);  
    const a = document.createElement("a");  
    a.style.display = "none";  
    a.href = url;  
    a.download = filename;  
    document.body.appendChild(a);  
    a.click();  
    window.URL.revokeObjectURL(url);  
}  

createPdf();

使用pdf2json解析PDF并提取表格数据

const fs = require('fs');
const PDFParser = require("pdf2json");  

let pdfParser = new PDFParser();  
pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError));  
pdfParser.on("pdfParser_dataReady", pdfData => {  
    const tableData = extractTableData(pdfData);  
    console.log(tableData);  
});  
pdfParser.loadPDF("path/to/pdf");  

function extractTableData(pdfData) {  
    const tableData = [];  
    pdfData.formImage.Pages.forEach((page) => {  
        page.Texts.forEach((text) => {  
            const str = decodeURIComponent(text.R[0].T);  
            if (isTableData(str)) {  
                tableData.push(str);  
            }  
        });  
    });  
    return tableData;  
}  

function isTableData(text) {  
    // Implement logic to determine if the text is part of a table  
    return true;  
}