Is there a way to digitally markup a pdf so its not OCR-readable?

cheese_greater@lemmy.world · 3 months ago

Is there a way to digitally markup a pdf so its not OCR-readable?

OsrsNeedsF2P@lemmy.ml · 3 months ago

PDF scanning is done by both OCR and PDF analysis so no. If you, a human can read it, a bot can read it too.

Your best bet is the classic inserting BS in a 1-hex-off-white font

BCsven@lemmy.ca · 3 months ago

Zip them into a password protected file or pgp

Natanael@slrpnk.net · 3 months ago

There’s DRM solutions but they’re by definition not perfect, if it can be read then a photo can be taken

MystikIncarnate@lemmy.ca · 3 months ago

I would OCR it myself, but edit the meta data in the file so that the text in the OCR metadata is lorem ipsum.

So any bots that assume that the OCR text is what’s on the image in the PDF (and why wouldn’t they), it will only read useless junk. Only someone reading the text from the image would “see” it, and only a bot programmed to OCR a file that already has OCR metadata would realize that there’s any inconsistency.

I’m not entirely sure how to accomplish that, but I’d figure it out if I was worried about the data being compromised.

Personally, I would simply keep the file in an encrypted container, then I wouldn’t worry about what can scan the file since it would be entirely unreadable ciphertext without the correct security key or passphrase.

Nomecks@lemmy.ca · 3 months ago

Adobe makes a whole DRM platform to do exactly this. Digital Editions

cannedtuna@lemmy.world · 3 months ago

OCR cannot scan documents that have been certified or digitally signed.

Note that once you certify a document it can no longer be edited, combined with another PDF, or have pages inserted or extracted.

Once a PDF has been digitally signed it is locked and you can no longer add pages, delete pages, or read it via OCR.

MystikIncarnate@lemmy.ca · 3 months ago

This works, right up until you introduce PDF compatible software that doesn’t give a shit about your rules, of which there’s plenty.

You can also print/scan, or even print to PDF to get around such limitations. The original document cannot be altered since that would invalidate the digital signature on the file, but you can create a perfect digital copy, omitting the signature, and modify it however you want.

If online systems that are skimming documents for their contents don’t give a shit about what the signature is, and simply take a copy and OCR it to train an AI or amalgamate the information for data harvesting or other purposes.

I get what you’re saying and in concept, it should be fine, the problem is that it’s a software lock/restriction on a file type that isn’t inherently closed source, unknown, nor was the PDF format built to be secure from the ground up. So we’re applying security to a system that wasn’t built for it.