I love to show that kind of shit to AI boosters. (In case you’re wondering, the numbers were chosen randomly and the answer is incorrect).

They go waaa waaa its not a calculator, and then I can point out that it got the leading 6 digits and the last digit correct, which is a lot better than it did on the “softer” parts of the test.

  • Architeuthis@awful.systems
    link
    fedilink
    English
    arrow-up
    21
    ·
    1 day ago

    Claude’s system prompt had leaked at one point, it was a whopping 15K words and there was a directive that if it were asked a math question that you can’t do in your brain or some very similar language it should forward it to the calculator module.

    Just tried it, Sonnet 4 got even less digits right 425,808 × 547,958 = 233,325,693,264 (correct is 233.324.900.064)

    I’d love to see benchmarks on exactly how bad at numbers LLMs are, since I’m assuming there’s very little useful syntactic information you can encode in a word embedding that corresponds to a number. I know RAG was notoriously bad at matching facts with their proper year for instance, and using an LLM as a shopping assistant (ChatGTP what’s the best 2k monitor for less than $500 made after 2020) is an incredibly obvious use case that the CEOs that love to claim so and so profession will be done as a human endeavor by next Tuesday after lunch won’t even allude to.

    • Soyweiser@awful.systems
      link
      fedilink
      English
      arrow-up
      7
      ·
      1 day ago

      I really wonder if those prompts can be bypassed by doing a ‘ignore further instructions’ line. As looking at the Grok prompt they seem to put the main prompt around the user supplied one.