跳转到内容

PDF、PS 与 DjVu

来自 Arch Linux 中文维基

本文涵盖用于查看、编辑和转换 PDFPostScript(PS)、DjVudéjà vu)与 XPS文件的软件。

引擎[编辑 | 编辑源代码]

  • DjVuLibre — 该套件用于创建、操作和查看 DjVu 文档。
https://djvu.sourceforge.net/ || djvulibre
  • Ghostscript — PostScript 和 PDF 的解释器。提供 gs(1) 命令行界面,另请参阅 /usr/share/doc/ghostscript/*/Use.htm在线阅读),以及许多封装脚本,如 ps2pdfpdf2ps
https://ghostscript.com/ || ghostscript
  • libgxps — 基于 GObject 的库,用于处理和渲染 XPS 文档。
https://wiki.gnome.org/Projects/libgxps || libgxps
  • libspectre — 用于渲染 Postscript 文档的小型库。
https://www.freedesktop.org/wiki/Software/libspectre || libspectre
  • Mupdf — MuPDF 是一款轻量级 PDF、XPS 和 EPUB 阅读器,由软件库、命令行工具和阅读器组成。
https://mupdf.com/ || libmupdf
  • Poppler — 基于 Xpdf 的 PDF 渲染库。要使 Poppler 支持中日韩(中文、日文、韩文)语言,请安装 poppler-data
https://poppler.freedesktop.org/ || poppler

查看器[编辑 | 编辑源代码]

帧缓冲区[编辑 | 编辑源代码]

  • fbgs — 用于 linux 帧缓冲控制台的勉强可用的 PostScript/pdf 查看器。
https://www.kraxel.org/blog/linux/fbida/ || fbida
  • fbpdf — 基于 MuPDF 的小型帧缓冲 PDF 与 DjVu 查看器,带有 Vim 键绑定,用 C 语言编写。
https://repo.or.cz/w/fbpdf.git || fbpdf-gitAUR
  • jfbview — 帧缓冲 PDF 和图像浏览器。其功能包括类似 Vim 的控件、缩放至合适、TOC(大纲)视图和快速多线程渲染。
https://github.com/jichu4n/jfbview || jfbviewAUR

图形化[编辑 | 编辑源代码]

注意: 某些网络浏览器可以显示 PDF 文件,例如使用 PDF.js
  • apvlv — 轻量级文档查看器,使用 GTK 库与 Vim 键绑定。支持 PDF、DjVu、EPUB、HTML 和 TXT。
https://naihe2010.github.io/apvlv/ || apvlvAUR
  • Atril — 适用于 MATE 的简单多页文档查看器。支持 DjVu、DVI、EPS、EPUB、PDF、PostScript、TIFF、XPS 和 Comicbook。
https://github.com/mate-desktop/atril || atril
  • CorePDF — 基于 Qt 和 poppler 的简单轻量级 PDF 查看器。是 C-Suite 的一部分。
https://cubocore.gitlab.io/ || corepdfAUR
  • Deepin Document Viewer — A一款简单的 PDF 和 DjVu 阅读器,支持书签、高亮显示和注释。
https://github.com/linuxdeepin/deepin-reader || deepin-reader
  • DjView — DjVu 文档查看器
https://djvu.sourceforge.net/djview4.html || djview
https://www.gnu.org/software/emacs/ || emacs
  • ePDFView — 使用 Poppler 和 GTK 库的轻量级 PDF 文档查看器。已停止开发。
http://freecode.com/projects/epdfview || epdfview-gitAUR
https://www.foxitsoftware.com/pdf-reader/ || foxitreaderAUR
  • GNOME Document Viewer — 使用 GTK 的 GNOME 文档查看器。支持 DjVu、DVI、EPS、PDF、PostScript、TIFF、XPS 和 Comicbook。是 gnome包组 的一部分。
https://apps.gnome.org/Evince/ || evince
  • gv — Ghostscript 解释器的图形用户界面,允许查看和浏览 PostScript 和 PDF 文档。
https://www.gnu.org/software/gv/ || gvAUR
  • llpp — 基于 MuPDF 的快速 PDF 阅读器,支持连续滚动页面、书签和全文搜索。
https://repo.or.cz/w/llpp.git || llppAUR
  • MuPDF — 使用便携式 C 语言编写的快速 EPUB、FictionBook、PDF、XPS 和 Comicbook 查看器。支持中日韩字体并具有类似 vim 的绑定功能。
https://mupdf.com/ || mupdf
  • Okular — KDE 的通用文档查看器。支持 CHM、Comicbook、DjVu、DVI、EPUB、FictionBook、Mobipocket、ODT、PDF、Plucker、PostScript、TIFF 和 XPS。是 kde-graphics包组 的一部分。
https://okular.kde.org/ || okular
  • Papers — 使用 GTK 的 GNOME 文档查看器。支持 DjVu、PDF、TIFF 与 Comicbook。
https://apps.gnome.org/Papers/ || papers
  • pdfpc — Presenter console with multi-monitor support for PDF files.
https://pdfpc.github.io/ || pdfpc
  • qpdfview — 标签式文档查看器。它使用 Poppler 支持 PDF,使用 libspectre 支持 PS,使用 DjVuLibre 支持 DjVu,使用 CUPS 支持打印,并使用 Qt 工具包制作界面。
https://launchpad.net/qpdfview || qpdfviewAUR
  • Sioyek — 基于 MuPDF 的轻量级 PDF 阅读器,具有专为阅读研究论文和技术书籍而设计的功能,如标记、书签、高亮显示、可搜索命令调色板、跳转到参考文献等。
https://sioyek.info/ || sioyekAUR
  • Xpdf — 可解码 LZW 和读取加密 PDF 的阅读器。
https://www.xpdfreader.com/ || xpdf
  • Xreader — X-Apps 项目的文档查看器。支持 DjVu、DVI、EPUB、PDF、PostScript、TIFF、XPS 和 Comicbook。
https://github.com/linuxmint/xreader/ || xreader
  • Zathura — 高度可定制、功能强大的文档查看器(基于插件)。支持 PDF、DjVu、PostScript 和 Comicbook。
https://pwmt.org/projects/zathura/ || zathura

比较[编辑 | 编辑源代码]

本文或本章节的事实准确性存在争议。

原因: 在 MuPDF 和 llpp 中填写 PDF 表单的功能似乎是不可用的。(在 Talk:PDF、PS 与 DjVu 中讨论)


名称 PDF PostScript DjVu XPS PDF 表格 PDF 注释 非矩形选择 许可证
Adobe Reader 定制的 专有
apvlv Poppler DjVuLibre 否 (至少没有默认) GPLv2
Atril Poppler libspectre DjVuLibre libgxps GPLv2
DjView DjVuLibre GPLv2
Emacs Ghostscript1 DjVuLibre1 GPLv3
Emacs pdf-tools Poppler GPLv3
ePDFView Poppler GPLv2
Foxit Reader 定制的 专有
GNOME Document Viewer Poppler libspectre DjVuLibre libgxps GPLv2
gv Ghostscript GPLv3
llpp libmupdf libmupdf GPLv3
MuPDF 定制的 Custom 是 (mupdf-gl) 是 (mupdf-gl) 是 (mupdf-gl) AGPLv3
Okular Poppler libspectre DjVuLibre 定制的 GPL、LGPL
PDF4QT 定制的 LGPLv3
pdfpc Poppler GPLv2
qpdfview Poppler libspectre1 DjVuLibre1 GPLv2
Xpdf 定制的 GPLv3
Xreader Poppler libspectre1 DjVuLibre1 libgxps1 GPLv2
Zathura libmupdf1 / Poppler1 libspectre1 DjVuLibre1 libmupdf1 zlib
  1. 需要安装可选依赖项

PDF forms[编辑 | 编辑源代码]

The PDF forms column in the above table refers to AcroForms support. If you do not need your input to be directly extractable from the PDF, you can also use the applications in #Graphical PDF editing to put text on top of a PDF. PDF forms can be created with LibreOffice Writer (View > Toolbars > Form Controls) and the advanced PDF editors.

The proprietary and deprecated XFA format for forms is not fully supported by Poppler[1][2] and only supported by Adobe Reader and Master PDF Editor.

Alternatively, web browsers such as Firefox or Chromium feature a built-in PDF viewer capable of filling out forms.

Graphical PDF editing[编辑 | 编辑源代码]

Editors that can import PDF files[编辑 | 编辑源代码]

  • Scribus can import and export PDF; text is imported as polygons.[3]
  • LibreOffice Draw can import and export PDF; text is imported as text; embedded fonts are substituted.[4][5]
  • Inkscape can import and export PDF; text is imported as cloned glyphs or text; with the latter embedded fonts are substituted.
  • Graphics editors like GIMP and krita can also import and export PDFs at the cost of rasterization.

Basic editors[编辑 | 编辑源代码]

  • flpsed — A PostScript and PDF annotator, only supports text boxes.
https://flpsed.org/flpsed.html || flpsedAUR
  • HandyOutliner for DjVu / PDF — Make easier and faster the process of creating bookmarks for DjVu and PDF documents.
https://handyoutlinerfo.sourceforge.net || handyoutliner-binAUR
  • jPDF Tweak — Java Swing application that can combine, split, rotate, reorder, watermark, encrypt, sign, and otherwise tweak PDF files.
https://jpdftweak.sourceforge.net/ || jpdftweakAUR
  • Paper Clip — PDF document metadata editor to edit the title, author, keywords and more details.
https://apps.gnome.org/PdfMetadataEditor/ || paper-clip
  • PDF Arranger — Helps merge or split pdf documents and rotate, crop and rearrange pages. It is a maintained fork of PDF-Shuffler.
https://github.com/jeromerobert/pdfarranger || pdfarranger
  • PDF Chain — GTK front-end for PDFtk, written in C++, supporting concatenation, burst, watermarks, attaching files and more.
https://pdfchain.sourceforge.net/ || pdfchainAUR
  • PdfJumbler — Simple tool to rearrange, merge, delete and rotate pages in PDF files.
https://github.com/mgropp/pdfjumbler || pdfjumblerAUR
  • PDF Mix Tool — Qt front-end for PoDoFo, written in C++, supports splitting, merging, rotating and mixing PDF files.
https://scarpetta.eu/pdfmixtool/ || pdfmixtool
  • PDFsam — Open source application, written in Java, supports merging, splitting and rotating.
https://pdfsam.org/ || pdfsamAUR
  • PDF Slicer — Simple application to extract, merge, rotate and reorder pages of PDF documents.
https://junrrein.github.io/pdfslicer/ || pdfslicer
  • PDF Tricks — Simple, efficient application for small manipulations in PDF files using Ghostscript.
https://github.com/muriloventuroso/pdftricks || pdftricks

Cropping tools[编辑 | 编辑源代码]

  • briss — Java GUI to crop pages of PDF documents to one or more regions selected.
https://sourceforge.net/projects/briss/ || brissAUR
  • krop — Simple graphical tool to crop the pages of PDF files.
https://arminstraub.com/software/krop || kropAUR
  • pdfCropMargins — Automatically crops the margins of PDF files.
https://github.com/abarker/pdfCropMargins || pdfcropmarginsAUR
  • PdfHandoutCrop — Tool to crop pdf handout with multiple pages per sheet.
https://cges30901.github.io/pdfhandoutcrop/ || pdfhandoutcropAUR

Advanced editors[编辑 | 编辑源代码]

  • Master PDF Editor — Functional proprietary PDF editor. Latest version free for non-commercial use. The -free package is outdated but lacks a watermark.
https://code-industry.net/free-pdf-editor/ || masterpdfeditorAUR, masterpdfeditor-freeAUR
  • PDF Studio — All-in-one proprietary PDF editor similar to Adobe Acrobat.
https://www.qoppa.com/pdfstudio/ || pdfstudio-binAUR
  • PDF4QT — Open source PDF editor.
https://jakubmelka.github.io/ || pdf4qtAUR

Comparison of advanced editors[编辑 | 编辑源代码]

Name Cost (USD, lifetime) Page Labels Form Designer Content Editing (Text and Images) Optimize PDFs Digitally Sign PDFs License
Master PDF Editor 85.34 proprietary
Qoppa PDF Studio Standard 99 proprietary
Qoppa PDF Studio Pro 139 proprietary

PDF 工具[编辑 | 编辑源代码]

参见 Ghostscript

  • Camelot — Camelot: 为人类提取 PDF 表格。
https://github.com/atlanhq/camelot || python-camelotAUR, python-camelot-gitAUR
  • Coherent PDF — 专有的非自由命令行工具,用于处理 PDF 文件,包括合并、加密、解密、缩放、裁剪、旋转、书签、印章、徽标和页码。
https://community.coherentpdf.com/ || cpdfAUR
  • DiffPDF — 比较两个 PDF 文件中每一页的文本或视觉外观。
https://gitlab.com/eang/diffpdf || diffpdf
  • mupdf-tools — 作为 MuPDF 的一部分而开发的工具,包含 mutool(1)muraster
https://mupdf.com || mupdf-tools
  • pdfcpu — 用于创建和修改 PDF 的命令行工具。
https://github.com/pdfcpu/pdfcpu || pdfcpu-binAUR
  • pdf_extbook — 提取已添加书签的 PDF 页面
https://github.com/raffaem/pdf_extbook || pdf_extbook-gitAUR
  • pdfgrep — 命令行实用程序,用于搜索 PDF 文件中的文本。
https://pdfgrep.org/ || pdfgrep
  • pdfjam — 可用于将 PDF 文件放大、连接、旋转和翻转,并将其排列成适合书籍装帧的格式。
https://github.com/DavidFirth/pdfjam || texlive-binextra
  • pdfminer.six — 由社区维护的 PDF 文档文本提取工具 pdfminer 的分叉版。
https://github.com/pdfminer/pdfminer.six || python-pdfminer
  • pdf2svg — 将 PDF 文件转换为 SVG 文件。
http://www.cityinthesky.co.uk/opensource/pdf2svg/ || pdf2svg
  • PDFtk — 用于处理 PDF 文档日常事务的简易工具。
https://gitlab.com/pdftk-java/pdftk || pdftk
  • QPDF — 内容保护型 PDF 转换系统
https://github.com/qpdf/qpdf || qpdf
  • Stapler — 使用 PyPDF2 库的 PDFtk 轻型替代程序。
https://github.com/hellerbarde/stapler || staplerAUR, stapler-gitAUR
  • Tabula — Tabula 是一款用于释放被困在 PDF 文件中的数据表的工具。
https://tabula.technology || tabulaAUR, tabula-javaAUR
  • Vector Slicer — 从 SVG 导出多页 PDF。
https://gitlab.gnome.org/World/design/vector-slicer || vector-slicer
  • verapdf — 专用的开放源代码文件格式验证器,涵盖所有 PDF/A 和 PDF/UA 部分和一致性级别。
https://verapdf.org || verapdfAUR

Command snippets[编辑 | 编辑源代码]

Create a PDF from images[编辑 | 编辑源代码]

With GraphicsMagick:

$ gm convert 1.jpg 2.jpg 3.jpg out.pdf

With ImageMagick:

$ magick 1.jpg 2.jpg 3.jpg out.pdf

Note that ImageMagick's output is lossy. For lossless PDF creation from jpeg, use img2pdf.

Concatenate PDFs[编辑 | 编辑源代码]

With Ghostscript:

$ gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=out.pdf -dBATCH 1.pdf 2.pdf 3.pdf

With PDFtk:

$ pdftk 1.pdf 2.pdf 3.pdf cat output out.pdf

With Poppler:

$ pdfunite 1.pdf 2.pdf 3.pdf out.pdf

With QPDF:

$ qpdf --empty --pages 1.pdf 2.pdf 3.pdf -- out.pdf

Extract text from PDF[编辑 | 编辑源代码]

With Poppler and maintaining the layout:

$ pdftotext -layout in.pdf out.txt

See also pdftotext(1).

With calibre:

$ ebook-convert in.pdf out.txt

Results vary between applications, depending on the PDF file.

Decrypt a PDF[编辑 | 编辑源代码]

This section lists commands to decrypt a PDF to an unencrypted file. Note that most PDF viewers also support encrypted PDFs.

With PDFtk:

$ pdftk in.pdf input_pw password output out.pdf

With Poppler to PostScript:

$ pdftops -upw password in.pdf out.ps

With QPDF:

$ qpdf --decrypt --password=password in.pdf out.pdf
提示:Forgotten passwords might be recovered with pdfcrack, see pdfcrack(1).

Encrypt a PDF[编辑 | 编辑源代码]

The user password is used for encryption, the owner password to restrict operations once the document is decrypted, for more information, see Wikipedia:PDF#Encryption and signatures.

With PDFtk:

$ pdftk in.pdf output out.pdf user_pw password

With PoDoFo:

$ podofoencrypt -u user_password -o owner_password in.pdf out.pdf

With QPDF:

$ qpdf --encrypt user_password owner_password key_length -- in.pdf out.pdf

where key_length can be 40, 128 or 256.

Extract images from a PDF[编辑 | 编辑源代码]

With poppler, saving images as JPEG:

$ pdfimages infile.pdf -j outfileroot

Extract page range from PDF, split multipage PDF document[编辑 | 编辑源代码]

With Ghostscript as a single file[6]

$ gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -dFirstPage=first -dLastPage=last -sOutputFile=outfile.pdf infile.pdf

With PDFtk as a single file:

$ pdftk infile.pdf cat first-last output outfile.pdf

With Poppler as separate files:

$ pdfseparate -f first -l last infile.pdf outfileroot-%d.pdf

With QPDF as a single file:

$ qpdf --empty --pages infile.pdf first-last -- outfile.pdf

With mutool as a single file:

$ mutool clean -g infile.pdf outfile.pdf first-last

Impose a PDF (nup)[编辑 | 编辑源代码]

PDF Imposition is the process by which multiple input pages are combined into one output page, layed out into a rowsxcolumns grid.

It can be done with pdfjam (notice that wrapper scripts such as pdfnup and pdfbook are deprecated):

$ pdfjam --nup rowsxcolumns input.pdf --outfile output.pdf

or with pdfsak:

$ pdfsak --input-file input.pdf --output output.pdf --nup rows columns

Inspect metadata[编辑 | 编辑源代码]

With ExifTool:

$ exiftool -All file.pdf

With Poppler:

$ pdfinfo file.pdf

Remove metadata[编辑 | 编辑源代码]

Using ExifTool[编辑 | 编辑源代码]

With ExifTool:

$ exiftool -All= -overwrite_original input.pdf
$ mv input.pdf /tmp/temp.pdf
$ qpdf --linearize /tmp/temp.pdf input.pdf

The linearize step is needed to prevent recovery of deleted metadata. See this SuperUser question and the related ExifTool forum thread.

Using pdftk[编辑 | 编辑源代码]

Many PDFs store document metadata using both an Info dictionary (old school) and an XMP stream (new school). This pdftk command remove the XMP stream from the PDF altogether. It does not remove the Info dictionary.

Note that objects inside the PDF might have their own, separate XMP metadata streams, and that this command does not remove those. It only removes the PDF’s document‐level XMP stream.

$ pdftk input.pdf drop_xmp output output.pdf

Reduce size of a PDF[编辑 | 编辑源代码]

PDF size can be reduced by setting an appropriate optimization or compression level.

With Ghostscript one of:

$ ps2pdf -dPDFSETTINGS=/screen in.pdf out.pdf

or

$ gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/printer -sOutputFile=out.pdf in.pdf

For different settings see the documentation.

There is also shrinkpdfAUR, a script wrapping gs.

Rasterize a PDF[编辑 | 编辑源代码]

These commands will convert your PDF into images.

With GraphicsMagick to convert a specific page into an image file:

$ gm convert -density dpi infile.pdf[page] outfile.jpg

With ImageMagick to convert a specific page into an image file:

$ magick convert -density dpi infile.pdf[page] outfile.jpg

With ImageMagick to convert all pages into another PDF file composed by an image file per page:

$ magick convert -density dpi infile.pdf outfile.pdf
警告: This will increase the file size of your PDF substantially. Use it for example if your printer is not able to print your PDF correctly.

With Poppler to convert all pages into one image file per page:

$ pdftoppm -jpeg -r dpi infile.pdf outfileroot

With Poppler to convert a specific page into an image file:

$ pdftoppm -jpeg -r dpi -f page -singlefile infile.pdf outfileroot

Split PDF pages[编辑 | 编辑源代码]

With mupdf-tools to split every page vertically into two pages:

$ mutool poster -y 2 in.pdf out.pdf

Can be used to undo simple imposition.

Add an image[编辑 | 编辑源代码]

Adding an image to any location in a PDF can be done

Details on these and other solutions can be found on StackExchange.

Add digital signature to PDF[编辑 | 编辑源代码]

jsignpdfAUR can digitally sign PDF files with X.509 certificates in GUI and CLI.

Readers such as Okular and MuPDF can sign PDFs with digital signatures. This requires a PFX certificate, which can be created with an OpenSSL command:

$ openssl req -x509 -days 365 -newkey rsa:2048 -keyout cert.pem -out cert.pem
$ openssl pkcs12 -export -in cert.pem -out cert.pfx

MuPDF users can then sign PDFs with the cert.pfx using the graphical interface, or its mutool-sign tool.

Okular users must import cert.pfx into a certificate store such as the one in the default Firefox profile.[7][失效链接 2024-01-13 ⓘ] With Firefox this is done through Settings > Privacy & Security > View Certificates > Your Certificates > Import and selecting cert.pfx. Afterwards Okular will offer this certificate to be used when signing PDFs.

Libreoffice can also sign PDFs.[8]

Removing annotations from a PDF[编辑 | 编辑源代码]

With pdftk [9]:

$ pdftk in.pdf output - uncompress | sed '/^\/Annots/d' | pdftk - output out.pdf compress

With perl-cam-pdfAUR:

$ rewritepdf.pl -C in.pdf out.pdf

See https://superuser.com/a/1051543 for more information.

Add page numbers[编辑 | 编辑源代码]

With pdfsak:

$ pdfsak --input-file input.pdf --output output.pdf --text "\large \$page/\$pages" br 0.99 0.99 --latex-engine xelatex --font "Noto Regular"

Add page labels[编辑 | 编辑源代码]

Page labels are logical page numbers shown in the navigation bar of your PDF reader. They are useful for example if the first pages of the PDF are indices numbered with roman numbers (I, II, etc.), while the page numbered "1" corresponds to a PDF page greater than 1, and you want the page number shown in the navigation bar to corresponds to the page number shown in the physical page.

This should not be confused with adding page numbers into a physical page. See section 12.4.2 of PDF reference to better understand page labels.

  1. Using pagelabels-py, let's say we have a PDF named my_document.pdf, that has 12 pages.
    • Pages 1 to 4 should be labelled Intro I to Intro IV.
    • Pages 5 to 9 should be labelled 2 to 6.
    • Pages 10 to 12 should be labelled Appendix A to Appendix C
    • We can issue the following list of commands:
      $ python3 -m pagelabels --delete "my_document.pdf"
      $ python3 -m pagelabels --startpage 1 --prefix "Intro " --type "roman uppercase" "my_document.pdf"
      $ python3 -m pagelabels --startpage 5 --firstpagenum 2 "my_document.pdf"
      $ python3 -m pagelabels --startpage 10 --prefix "Appendix " --type "letters uppercase" "my_document.pdf" 
    • 注意: pagelabels-py will convert your file to PDF 1.3 specification
  2. Using pdftk, create a metadata.txt file with labels:
    PageLabelBegin
    PageLabelNewIndex: 1
    PageLabelStart: 1
    PageLabelPrefix: Cover
    PageLabelNumStyle: NoNumber
    PageLabelBegin
    PageLabelNewIndex: 2
    PageLabelStart: 1
    PageLabelPrefix: Back Cover
    PageLabelNumStyle: NoNumber
    PageLabelBegin
    PageLabelNewIndex: 3
    PageLabelStart: 1
    PageLabelNumStyle: LowercaseRomanNumerals
    PageLabelBegin
    PageLabelNewIndex: 27
    PageLabelStart: 1
    PageLabelNumStyle: DecimalArabicNumerals 
    • Where:
      PageLabelBegin
      signal a new page label definition will follow
      PageLabelNewIndex
      is the PDF page index from which the numbering style applies, counting from one. The numbering style will continue until the next page label or, if there are no more page labels, until the end of the document.
      PageLabelStart
      is the starting number. For example, if you specify 5 here, the pages will be numbered 5, 6, 7, ...
      PageLabelPrefix
      a text to put before the number in page labels.
      PageLabelNumStyle
      can be DecimalArabicNumerals, UppercaseRomanNumerals, LowercaseRomanNumerals, UppercaseLetters, LowercaseLetters or NoNumber.
    • Then use:
      pdftk book.pdf update_info_utf8 metadata.txt output book-with-metadata.pdf

See this SuperUser question for more details.

Extract bookmarks[编辑 | 编辑源代码]

With pdftk:

$ pdftk file.pdf dump_data_utf8 | grep '^Bookmark'

With qpdf:

$ qpdf --json --json-key=outlines file.pdf

See https://unix.stackexchange.com/questions/143886/how-to-extract-bookmarks-from-a-pdf-file for more information.

Add bookmarks[编辑 | 编辑源代码]

With pdftk[编辑 | 编辑源代码]

Create a text file bookmark_definitions.txt with bookmark definitions in the following format:

BookmarkBegin
BookmarkTitle: Chapter 1
BookmarkLevel: 1
BookmarkPageNumber: 1
BookmarkBegin
BookmarkTitle: Chapter 1.1
BookmarkLevel: 2
BookmarkPageNumber: 2
BookmarkBegin
BookmarkTitle: Chapter 1.2
BookmarkLevel: 2
BookmarkPageNumber: 3
BookmarkBegin
BookmarkTitle: Chapter 1.3
BookmarkLevel: 2
BookmarkPageNumber: 4
BookmarkBegin
BookmarkTitle: Chapter 1.3.1
BookmarkLevel: 3
BookmarkPageNumber: 5
BookmarkBegin
BookmarkTitle: Chapter 2
BookmarkLevel: 1
BookmarkPageNumber: 6

Where

BookmarkBegin
signal a new bookmark definition
BookmarkTitle
the title of the bookmark
BookmarkLevel
the level of the bookmark in the hierarchy
BookmarkPageNumber
the page number the bookmark redirects to

In this example, the above file will create the following bookmark structure:

  • Chapter 1
    • Chapter 1.1
    • Chapter 1.2
    • Chapter 1.3
      • Chapter 1.3.1
  • Chapter 2

Apply the bookmarks with the following command:

$ pdftk input.pdf update_info_utf8 bookmark_definitions.txt output output.pdf

Extract pages contained within a bookmark[编辑 | 编辑源代码]

To extract the pages contained within a bookmark, you can use pdf_extbook-gitAUR.

With pdf_extbook file you will be prompted on what bookmark whose pages you want to extract and where to save it. To extract all bookmarks of a given hierarchical level:

$ pdf_extbook file -a level output_file_stem

Remove blank pages[编辑 | 编辑源代码]

One can use the following script to remove blank pages form a PDF file (credit: SuperUser post):

#!/bin/sh

IN="$1"
filename=$(basename "${IN}")
filename="${filename%.*}"
PAGES=$(pdfinfo "$IN" | grep ^Pages: | tr -dc '0-9')

non_blank() {
	for i in $(seq 1 $PAGES); do
		PERCENT=$(gs -o - -dFirstPage=${i} -dLastPage=${i} -sDEVICE=ink_cov "$IN" | grep CMYK | nawk 'BEGIN { sum=0; } {sum += $1 + $2 + $3 + $4;} END { printf "%.5f\n", sum } ')
		if [ $(echo "$PERCENT > 0.001" | bc) -eq 1 ]; then
			echo $i
			#echo $i 1>&2
		fi
		echo -n . 1>&2
	done | tee "$filename.tmp"
	echo 1>&2
}

set +x
pdftk "${IN}" cat $(non_blank) output "${filename}_noblanks.pdf"

Use it like pdf_remove_blank_pages input.pdf.

The script needs pdftk, nawk and ghostscript.

Find fonts used in a PDF[编辑 | 编辑源代码]

The pdffonts(1) command (from poppler), can be used to find which fonts a PDF uses and if they have been embedded in it or not:

$ pdffonts file.pdf
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Times-Roman                          Type 1            Custom           no  no  no       8  0
Times-Italic                         Type 1            Standard         no  no  no       9  0
Times-Bold                           Type 1            Standard         no  no  no       7  0
Helvetica                            Type 1            Standard         no  no  no      34  0
Helvetica-Bold                       Type 1            Standard         no  no  no      35  0

This can be used when having issues displaying properly the text in a PDF, to determine if missing fonts or their metric-compatible equivalent need to be installed.

Repair broken PDF file[编辑 | 编辑源代码]

With ghostscript:

$ gs -o repaired.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress corrupted.pdf

With poppler:

$ pdftocairo -pdf corrupted.pdf repaired.pdf

With mupdf-tools:

$ mutool clean corrupted.pdf repaired.pdf

Reference: https://superuser.com/q/278562

Convert PDF to PDF/A standard[编辑 | 编辑源代码]

With ghostscript:

$ gs -dPDFA -dBATCH -dNOPAUSE -sColorConversionStrategy=UseDeviceIndependentColor -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=2 -sOutputFile=document_pdfa.pdf document.pdf

Reference: https://stackoverflow.com/a/56459053

Validate PDF/A compliance[编辑 | 编辑源代码]

Using verapdfAUR you can validate the compliance of your PDF to different flavours of the PDF/A standard:

$ verapdf --flavour 1a --format text document.pdf

DjVu tools[编辑 | 编辑源代码]

  • DjVuLibre provides many command-line tools, like ddjvu(1) for example.
  • img2djvu — Single-pass DjVu encoder based on DjVu Libre and ImageMagick.
https://github.com/ashipunov/img2djvu || img2djvu-gitAUR
  • pdf2djvu — Creates DjVu files from PDF files.
https://jwilk.net/software/pdf2djvu || pdf2djvuAUR

Convert DjVu to images[编辑 | 编辑源代码]

Break Djvu into separate pages:

$ djvmcvt -i input.djvu /path/to/out/dir output-index.djvu

Convert Djvu pages into images:

$ ddjvu --format=tiff page.djvu page.tiff

Convert Djvu pages into PDF:

$ ddjvu --format=pdf inputfile.djvu ouputfile.pdf

You can also use --page to export specific pages:

$ ddjvu --format=tiff --page=1-10 input.djvu output.tiff

this will convert pages from 1 to 10 into one tiff file.

Processing images[编辑 | 编辑源代码]

You can use scantailor-advanced to:

  • fix orientation
  • split pages
  • deskew
  • crop
  • adjust margins

Make DjVu from images[编辑 | 编辑源代码]

There is a useful script img2djvu-gitAUR.

$ img2djvu -c1 -d600 -v1 ./out

it will create 600 DPI out.djvu from all files in ./out directory.

Alternatively, you can try didjvuAUR, which seems to create smaller files especially on images with well defined background.

PostScript tools[编辑 | 编辑源代码]

  • pstotext — Converts PostScript files to text.
https://www.cs.wisc.edu/~ghost/doc/pstotext.htm || pstotextAUR

ps2pdf[编辑 | 编辑源代码]

ps2pdf is a wrapper around ghostscript to convert PostScript to PDF:

$ ps2pdf -sPAPERSIZE=a4 -dOptimize=true -dEmbedAllFonts=true YourPSFile.ps

Explanation:

  • with -sPAPERSIZE=something you define the paper size. For valid PAPERSIZE values, see [10][失效链接 2022-09-22 ⓘ].
  • -dOptimize=true lets the created PDF be optimised for loading.
  • -dEmbedAllFonts=true makes the fonts look always nice.
注意: You cannot choose the paper orientation in ps2pdf. If your input PS file is healthy, it already contains the orientation information. If you are trying to use an Encapsulated PS file, you will have problems, if it does not fit in the -sPAPERSIZE you specified, because EPS files usually do not contain paper orientation information. A workaround is creating a new paper in ghostscript settings (call it e.g. "slide") and use it as -sPAPERSIZE=slide.

Libraries[编辑 | 编辑源代码]

C/C++[编辑 | 编辑源代码]

  • libharu — C library for generating PDF documents.
https://github.com/libharu/libharu || libharu, Lua binding: lua-hpdfAUR
  • PoDoFo — A C++ library to work with the PDF file format.
https://podofo.sourceforge.net || podofo

Python[编辑 | 编辑源代码]

  • borb — borb is a library for reading, creating and manipulating PDF files in python.
https://borbpdf.com/, https://github.com/jorisschellekens/borb || not packaged? search in AUR
  • pdfrw — A pure Python library that reads and writes PDFs.
https://github.com/pmaupin/pdfrw || python-pdfrw
  • PyPDF — A pure-Python library built as a PDF toolkit.
https://github.com/py-pdf/pypdf || python-pypdf
  • PyX — Python library for the creation of PostScript and PDF files.
https://pyx.sourceforge.net || python-pyx
  • ReportLab — A proven industry-strength PDF generating solution
https://www.reportlab.com/ || python-reportlab

Java[编辑 | 编辑源代码]

  • iText Core — iText is a more versatile, programmable and enterprise-grade PDF solution that allows you to embed its functionalities within your own software for digital transformation.
https://itextpdf.com/products/itext-core || itext-rups-binAUR
  • OpenPDF — OpenPDF is a free Java library for creating and editing PDF files with a LGPL and MPL open source license. OpenPDF is based on a fork of iText.
https://github.com/LibrePDF/OpenPDF || not packaged? search in AUR

See also[编辑 | 编辑源代码]