Could you compress text files by mapping a word to how commonly it is used and translating it with an application?

Corroded@leminal.space · 1 year ago

Could you compress text files by mapping a word to how commonly it is used and translating it with an application?

Snarwin@kbin.social · 1 year ago

This technique is called Huffman coding, and it’s used in a number of compression algorithms.

Corroded@leminal.space · 1 year ago

Huh that was an interesting read. I’m not sure I understood the entirety of it but it sounds like it would be a lot more efficient than what I was thinking

morhp@lemmy.wtf · 1 year ago

Yes, it’s very efficient and the core of what complession formats like .zip do.

The main difference to your idea is that computers count in binary like 0, 1, 10, 11, 100, 101 and so on and that you don’t want to assign these very low codes words directly. Say you’d have assigned 1 the most common word, then that will would be encoded very short, but you’d sort of take the one away from all the other codes, as you don’t know if 11 is twice the most common word or once the 11th (3rd in decimal) common word.

Huffman essentially computes the most optimal word assignments mathematically.

The main other difference between your suggestion and most compression algorithms is that you wouldn’t use a huge dictionary in real time and loading it and looking into it would be very slow. Most compression algorithms have a rather small dictionary buildin and/or they build one on the fly looking at the data that they want to compress.