原文档内容和翻译结果重叠，是否存在屏蔽原文档内容选项 #446

tomatolike-self · 2025-01-09T13:45:18Z

问题描述

我在使用 pdf2zh 时遇到了一些问题：

设备信息: Mac M1
Python 版本: Python 3.12.7
pdf2zh 版本: pdf2zh v1.8.8

翻译文档时出现了原文档内容和翻译结果重叠的现象，导致阅读体验较差

是否存在类似 pdf2zh example.pdf -f xxx 的功能，能够将翻译页面中的原文档内容屏蔽，单纯只保留翻译结果？

测试文档

Chankin - jou.roalof ggcloar materials.pdf

The text was updated successfully, but these errors were encountered:

tomatolike-self · 2025-01-09T13:51:33Z

#444 中提到新版本已经有这一功能了，等新版发布吧

hellofinch · 2025-01-10T01:20:23Z

你这个是扫描件，需要OCR，暂时处理不了。
#19
: )

tomatolike-self · 2025-01-10T02:40:13Z

好吧，但是我看翻译页面其实是有翻译后结果的，而且原文档是支持选中文字的。

我想的是能不能存在一个选项，能够在翻译页面关闭原文档的图像画面，以免重叠
之前提到的 #444 中说下版本会出现一个新的关闭选项，所以之前关掉了这个issue，但今天看似乎和我这里不是一个需求？

tomatolike-self · 2025-01-10T02:44:30Z

另外翻译页面中pdf的文字层其实已经是中文翻译结果了

原文档的内容似乎是直接作为背景图给贴上去了？

hellofinch · 2025-01-10T06:23:49Z

这个PDF是个扫描件，你能复制不是PDF本身能复制，是你的阅读器让你能复制。
原始的内容不是我们不想去，是它去不了。
所以，等OCR功能吧。
: )

awwaawwa · 2025-01-10T08:00:47Z

@hellofinch 其实我们应该是能去掉这个图片内容的hhhhhhh

等我把手上事情做一做之后来看一看这个需求，请耐心等待，感谢。

hellofinch · 2025-01-10T10:25:03Z

我之前试着删过，没删掉。。。。

awwaawwa · 2025-01-11T18:41:09Z

@hellofinch 背景图在ops_base里。理论上把这个删了，就只剩下翻译输出的文字+公式了。
https://github.com/Byaidu/PDFMathTranslate/blob/f30133e040a85633c708ccbdc8ad9a6db75ed763/pdf2zh/pdfinterp.py#L267C1-L268C1

https://github.com/Byaidu/PDFMathTranslate/blob/f30133e040a85633c708ccbdc8ad9a6db75ed763/pdf2zh/pdfinterp.py#L273C18-L273C26

awwaawwa · 2025-01-11T19:05:17Z

这是我在新后端上移除ops_base后的测试效果。ps 空白区域有字，只是由于bug显示为空白。

liuyun16 · 2025-02-12T09:15:48Z

C. Campbell et al_Surface Science Reports_1997 _Ultrathin metal films and particles on oxide surfaces structural, electronic and chemisorptive prop-mono.pdf

@hellofinch 背景图在ops_base里。理论上把这个删了，就只剩下翻译输出的文字+公式了。 https://github.com/Byaidu/PDFMathTranslate/blob/f30133e040a85633c708ccbdc8ad9a6db75ed763/pdf2zh/pdfinterp.py#L267C1-L268C1

https://github.com/Byaidu/PDFMathTranslate/blob/f30133e040a85633c708ccbdc8ad9a6db75ed763/pdf2zh/pdfinterp.py#L273C18-L273C26

强，非常期待这个更新！我发现这是老一点的pdf文档的通病。可能年代久远的pdf文件，底层都是图像层而非文字层。我这里也发现了90年代，elsiview的文档都是这个原文图像没有删除。

我也附上了我的翻译了一页的结构，在上述pdf的第五页。可以看到重影。

awwaawwa · 2025-02-12T09:19:04Z

@liuyun16 慢慢蹲吧，这个优先级比较低。得先做其他的。

aik2mlj · 2025-02-15T22:54:15Z

一个权宜的方法：在翻译得到文字/图像重叠的pdf后，用其他工具删除图像层。
https://chatgpt.com/share/67b11a1c-9ecc-800e-9810-fd61a1d84044

我在linux下测试了使用ghostscript，可以得到不错的结果：

gs -o output.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf

tomatolike-self closed this as completed Jan 9, 2025

tomatolike-self reopened this Jan 10, 2025

Byaidu added the enhancement New feature or request label Jan 19, 2025

awwaawwa self-assigned this Feb 12, 2025

awwaawwa added the Low priority label Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

原文档内容和翻译结果重叠，是否存在屏蔽原文档内容选项 #446

原文档内容和翻译结果重叠，是否存在屏蔽原文档内容选项 #446

tomatolike-self commented Jan 9, 2025

tomatolike-self commented Jan 9, 2025

hellofinch commented Jan 10, 2025

tomatolike-self commented Jan 10, 2025

tomatolike-self commented Jan 10, 2025

hellofinch commented Jan 10, 2025

awwaawwa commented Jan 10, 2025

hellofinch commented Jan 10, 2025

awwaawwa commented Jan 11, 2025

awwaawwa commented Jan 11, 2025

liuyun16 commented Feb 12, 2025 •

edited

Loading

awwaawwa commented Feb 12, 2025

aik2mlj commented Feb 15, 2025

原文档内容和翻译结果重叠，是否存在屏蔽原文档内容选项 #446

原文档内容和翻译结果重叠，是否存在屏蔽原文档内容选项 #446

Comments

tomatolike-self commented Jan 9, 2025

问题描述

测试文档

tomatolike-self commented Jan 9, 2025

hellofinch commented Jan 10, 2025

tomatolike-self commented Jan 10, 2025

tomatolike-self commented Jan 10, 2025

hellofinch commented Jan 10, 2025

awwaawwa commented Jan 10, 2025

hellofinch commented Jan 10, 2025

awwaawwa commented Jan 11, 2025

awwaawwa commented Jan 11, 2025

liuyun16 commented Feb 12, 2025 • edited Loading

awwaawwa commented Feb 12, 2025

aik2mlj commented Feb 15, 2025

liuyun16 commented Feb 12, 2025 •

edited

Loading