• 1 Post
  • 1.13K Comments
Joined 5 个月前
cake
Cake day: 2025年2月10日

help-circle











  • The problem with trying to increase the signal to noise ratio is that you don’t know all of the datapoints that are being collected and some of those datapoints could be used to filter the real from the fake.

    Like, in your example, if you made all of these account from the same browser then they could be linked together. If they were made on the same IP, they could be linked together. If you were using the same phone, they could be linked together. Those are just the datapoints that we know to try to protect, it’s the datapoints that you don’t know that get you.

    Like, maybe your phone or desktop is screenshotting itself every 5 seconds (“for AI purposes”) or maybe the app that you’re trying to fool also secretly sends your GPS location during account creation or maybe the adversary has malware running on your PC which is keylogging you.

    IF you knew all of the ways that they were collecting data on you, then you could take countermeasures. Since you don’t, you have to assume that any of your identities can be linked to your person unless you take unusual measures such, not using Microsoft/Google/Meta/Amazon/etc products at a minimum. Depending on your security needs this could also mean things like using burner hardware, non-commercial VPNs, physically disabling sensors/radios/ports, traffic/network monitoring, etc.




  • Unless the person is use math terms elsewhere, I always assume people mean ‘unexpected’ then they say random.

    It’s not random in the sense of a uniform distribution which is what is implied by “generate a random [phone] number”.

    Yeah, true.

    There, I was speaking more to the top level comment’s statement that an LLM cannot generate random numbers. Random numbers are pretty core to how chatbots work… which is what I assumed they meant instead of the literal language model.

    You could say that they’re technically correct in that the actual model only produces a deterministic output vector for any given input. Randomness is added in the implementation of the chatbot software through the design choice of having the software treat the language model’s softmax’d output as a distribution from which it randomly chooses the next token.

    But, I’m assuming that the person isn’t actually making that kind of distinction because of the second sentence that they wrote.




  • Because they were told that Trump would protect them against all of the bad things that Trump’s backers were blasting into their brains 24/7 via social media and partisan ‘news’ organizations.

    Ignorance and disinformation did way more work that Trump’s charisma which is, as you’ve said, lacking


  • How are they gonna trace that to you?

    The modern Internet is essentially about spying on you as much as possible and then selling the data to whoever wants to buy it. Linking identities with devices/browsers is worth a lot of money and so most every website/app has a way of linking you to the devices and software that you use.

    Unless the user took some pretty extreme measures to create the account, they’ve likely logged in from a phone/ip/browser that has been linked to their real identity at some point in its lifetime. That link will be sold to data brokers and used to tie the random handle to you, the person. Then the State Department just buys that information.

    Alternatively, you should be assuming that sovereign entities with the means are reading all public network data. There’s a lot of information that you can learn from that as well. Like, over time, the posts from the ‘random’ account could be strongly correlated to the times that you were accessing the site even if all of the data was encrypted with HTTPS.

    Alternatively, alternatively. There is a threat known as Store Now Decrypt Later (SNDL). The idea is basically: Quantum Computers are coming and they can break some cryptographic primitives. If someone saves all of the encrypted traffic that they would want to read, in a few years they will have the means to read that data. We won’t know when this moment occurs, because it’ll likely be a secret, but we do know that it will happen and so you should additionally assume that anything that isn’t using post-quantum encryption, which transited a public network, will be read and used to link you to your identities.

    This is, essentially, the core thing that the Privacy community is attempting to mitigate.


  • No way, tokens are almost always sub-word length.

    Using larger tokens means that you need way more tokens to represent data and so encoders always learn to use short tokens unless they’re specifically forced not to.

    Just to put it in perspective. Imagine that you were trying to come up with a system for writing down every phone number. The easiest system would be to have a vocabulary of 10 items (digits), with such a vocabulary you can write down all phone numbers. While storing entire phone numbers as a single ‘word’ would require a vocabulary of 10 billion items in order to write down all phone numbers.

    That’s why encoders learn to use the smallest token sizes possible.

    LLMs can’t generate random numbers, but the process of selecting the next token involves selecting a random (using a pseudorandom number generator) next token from the distribution of possible next tokens. The ‘Temperature’ setting alters how closely the random selection is to the distribution in the vector describing the next token.

    An extreme example would be, on one end of the Temperature scale it always chooses the highest probability next token (essentially what the person you’re responding to is thinking happens) and on the other end of the scale it completely ignores the distribution and chooses a completely random token. The middle range is basically ‘how much do I want the distribution to affect my choice?’

    In the end, the choice of the next token is really random. What’s happening is that the LLM is predicting the distribution among all possible tokens so that the sentence fits into its model of how language works.