In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

ylai@lemmy.ml · 8 months ago

In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

redditReallySucks@lemmy.dbzer0.com · 8 months ago

I hope this is gonna become a new meme template

driving_crooner@lemmy.eco.br · 8 months ago

She looks like she just talked to the waitress about a fake rule in eating nachos and got caught up by her date.

HACKthePRISONS@kolektiva.social · 8 months ago

this is incomprehensible to me. can you try it with two or three sentences?

driving_crooner@lemmy.eco.br · 8 months ago

Her date was eating all the fully loaded nachos, so she went up and ask to the waitress to make up a rule about how one person cannot eat all the nacho with meat and cheese. But her date knew that rule was bullshit and called her out about it. She’s trying to look confused and sad because they’re going to be too soon for the movie.

Uninvited Guest@lemmy.ca · 8 months ago

What?! What the hell are you talking about?!

RatsOffToYa@lemmy.world · 8 months ago

Not sure what’s funnier. your first comment or the comment explaining it to someone who obviously not part of a turbo team

fjordbasa@lemmy.world · 8 months ago

Turbo team?? Did you replace my toilet with one that looks the same but has a joke hole? That’s just FOR FARTS??

RatsOffToYa@lemmy.world · 8 months ago

Look until you’re part of the turbo team… WALK SLOWLY

fjordbasa@lemmy.world · 8 months ago

Fine… I’ll lay down to be by myself and read my art books!

THCDenton@lemmy.world · 8 months ago

Plopp@lemmy.world · 8 months ago

Lmao that’s wonderful, scrolling down from those weird ass comments only to be greeted by my own exact facial expression.

Buttons@programming.dev · 8 months ago

“No… Hell no… Man, I believe you’d get your ass kicked if you said something like that…”

HACKthePRISONS@kolektiva.social · 8 months ago

thank you. it must be a reference to something, but i don’t watch tv any more.

datavoid@lemmy.ml · edit-2 8 months ago

I think you should leave…

(is what you would search to find this)

JWBananas@lemmy.world · 8 months ago

I’m sorry, what does this have to do with Coffin Flops. Does this mean it isn’t getting cancelled?

swab148@startrek.website · 8 months ago

I DIDN’T RIG SHIT!

squid_slime@lemmy.world · 8 months ago

Chatgpt, you okay? 😅

whoisearth@lemmy.ca · 8 months ago

Coffeezilla had a video in his void where he plays this back a few times. It’s hilarious seeing the guilt without stating it.

Fisk400@feddit.nu · 8 months ago

They know what they fed the thing. Not backing up their own training data would be insane. They are not insane, just thieves

Echo Dot@feddit.uk · 8 months ago

Everyone says this but the truth is copyright law has been unfit for purpose for well over 30 years now. And the lords were written no one expected something like the internet to ever come along and they certainly didn’t expect something like AI. We can’t just keep applying the same old copyright laws to new situations when they already don’t work.

I’m sure they did illegally obtain the work but is that necessarily a bad thing? For example they’re not actually making that content available to anyone so if I pirate a movie and then only I watch it, I don’t think anyone would really think I should be arrested for that, so why is it unacceptable for them but fine for me?

oKtosiTe@lemmy.world · 8 months ago

if I pirate a movie and then only I watch it, I don’t think anyone would really think I should be arrested for that

There are definitely people out there that think you should be arrested for that.

Echo Dot@feddit.uk · 8 months ago

Even the police are unsure if it’s actually a crime though. Crimes require someone to lose something and no one can point to a lost product so it’s difficult to really quantify.

And it’s not even technically breach of copyright since you’re not selling it.

exanime@lemmy.today · 8 months ago

But they ARE selling it … Every answer Chat GPT makes came from possibly stolen material

BoscoBear@lemmy.sdf.org · 8 months ago

Isn’t that true of every opinion you have. All the knowledge you have is based on works of others that came before you.

exanime@lemmy.today · 8 months ago

Not untill I bill you for it

Also, no there is such a thing as an original thought or opinion… Even if it’s informed on other knowledge

There is a difference between reinterpreting other knowledge and just Frankensteining multiple work together

BoscoBear@lemmy.sdf.org · 8 months ago

I don’t know enough about LLMs but Neural networks are capable of original thought. I suspect LLMs are too because of their relationship to Neural Networks.

confusedbytheBasics@lemmy.world · 8 months ago

You’re using the word ‘stolen’ which doesn’t fit. It would be accurate to say 'every answer comes from possibly unlicensed material '.

Guntrigger@feddit.ch · 8 months ago

Allegedly possibly maybe accidentally whoopsie not quite licensed fully material.

exanime@lemmy.today · 8 months ago

Yeap, the real term (I think) would be copyright infringement

rottingleaf@lemmy.zip · 8 months ago

That is a bad thing if they want to be exempt from the law because they are doing a big, very important thing, and we shouldn’t.

The copyright laws are shit, but applying them selectively is orders of magnitude worse.

A_Very_Big_Fan@lemmy.world · 8 months ago

if I pirate a movie and then only I watch it, I don’t think anyone would really think I should be arrested for that, so why is it unacceptable for them but fine for me?

Because it’s more analogous to watching a video being broadcasted outdoors in the public, or looking at a mural someone painted on a wall, and letting it inform your creative works going forward. Not even recording it, just looking at it.

As far as we know, they never pirated anything. What we do know is it was trained on data that literally anybody can go out and look at yourself and have it inform your own work. If they’re out here torrenting a bunch of movies they don’t own or aren’t licencing, then the argument against them has merit. But until then, I think all of this is a bunch of AI hysteria over some shit humans have been doing since the first human created a thing.

StarPupil@ttrpg.network · 8 months ago

An AI (in its current form) isn’t a person drawing inspiration from the world around it, it’s a program made by people with inputs chosen by those people. If those people didn’t ask permission to use other people’s licensed work for their product, then they are plagiarising that work, and they should be subject to the same penalties that, for example, a game company using stolen art in their game should face. An AI doesn’t become inspired, it copies existing things to predict what it thinks its user wants to see. If we produce a real thinking AI at some point in the future, one with self determination and whatnot, the story will be different, but for now it isn’t.

A_Very_Big_Fan@lemmy.world · 8 months ago

What is web scraping if not gathering information from around the world? As long as you’re not distributing copyrighted content (and the models in question here don’t, btw), then fair use is at play. I’m not plagiarizing the news by reading it or by talking about what I learned, but I would be if I just copy/pasted my response from the article.

Reading publicly available data isn’t a copyright violation, and it certainly isn’t a violation of fair use. If it were, then you just plagiarized my comment by reading it before you responded.

exanime@lemmy.today · 8 months ago

Because the actual comparison is that you stole ALL movies, started your own Netflix with them and are lining up to literally make billions by taking the jobs of millions of people, including those you stole from

BoscoBear@lemmy.sdf.org · 8 months ago

I would say it is closer to watching all the movies, regardless of how you got them, then taught a film class at UCLA.

A_Very_Big_Fan@lemmy.world · edit-2 8 months ago

If I paint a melty clock hanging off of a table, how have I stolen from Salvador Dali? What did I “steal” from Tolkien when I drew this?

you stole ALL movies, started your own Netflix with them

The model in question can’t even try to distribute copyrighted material. You could have easily checked for yourself, but once again I find myself having to do the footwork for you guys.

exanime@lemmy.today · 8 months ago

If you sell your melty clock yes, it not “stealing” but you are violating copyright, that’s how it works

The “model in question” is a bit of a prototype, I thought is was clear we are talking about where these models are going… Maybe you’d get it if you came down of your high horse

A_Very_Big_Fan@lemmy.world · 8 months ago

Dali doesn’t own the concept of a melting clock. If I include a melting clock in my own work, as long as it’s not his melting clock with all the other elements of his painting, it’s fair use.

GPT hasn’t been a prototype since before 2018, and the copyright restrictions are only getting tighter every time it’s updated so idk what you’re on about.

GiveMemes@jlai.lu · 8 months ago

Ok but training an ai is not equivalent to watching a movie. It’s more like putting a game on one of those 300 games in one DS cartridges back in the day.

BoscoBear@lemmy.sdf.org · 8 months ago

I don’t think that is true. You aren’t reselling the movies. It is more like watching the movies then writing a recap or critique of the movies. Do you owe the copyright holder for doing that?

Gabu@lemmy.world · 8 months ago

The problem with that being?

GiveMemes@jlai.lu · 8 months ago

Obviously, it’s illegal to sell a product that’s using copyrighted material you don’t have the copyright to. This AI is not open source, it’s a for profit system.

A_Very_Big_Fan@lemmy.world · 8 months ago

It doesn’t, though. You could have easily checked yourself, but I guess I’ll do your research for you.

GiveMemes@jlai.lu · 8 months ago

It does though. You could have easily checked for yourself, but I guess I’ll do your research for you.

https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html

A_Very_Big_Fan@lemmy.world · edit-2 8 months ago

That article doesn’t even claim it’s distributing copyrighted material.

If that qualifies as distributing stolen copyrighted material, then this is stealing and distributing the “you shall not pass” LoTR scene. Which, again, ChatGPT won’t even do

A_Very_Big_Fan@lemmy.world · 8 months ago

deleted by creator

VirtualOdour@sh.itjust.works · edit-2 8 months ago

That’s really not how it works though, it’s a web crawler they’re not going to download the whole internet

And a reason they don’t is it would actually potentially be copywrite infringement in some cases where as what they do legally isn’t (no matter how much people wish the law was set based on their emotions)

_haha_oh_wow_@sh.itjust.works · 8 months ago

Gee, seems like something a CTO would know. I’m sure she’s not just lying, right?

Bogasse@lemmy.ml · 8 months ago

And on the other hand it is a very obvious question to expect. If you have something hide how on the world are you not prepared for this question !? 🤡

VirtualOdour@sh.itjust.works · 8 months ago

It’s a question that is based on a purposeful misunderstanding of the technology, it’s like expecting a bee keeper to know each bees name and bedtime. Really it’s like asking a bricklayer where each brick came from in the pile, He can tell you the batch but not going to know this brick came from the forth row of the sixth pallet, two from the left. There is no reason to remember that it’s not important to anyone.

The don’t log it because it would take huge amounts of resources and gain nothing.

zaphod@lemmy.ca · edit-2 8 months ago

What?

Compiling quality datasets is enormously challenging and labour intensive. OpenAI absolutely knows the provenance of the data they train on as it’s part of their secret sauce. And there’s no damn way their CTO won’t have a broad strokes understanding of the origins of those datasets.

Guntrigger@feddit.ch · 8 months ago

[Citation needed]

Hotzilla@sopuli.xyz · 8 months ago

To be fair, these datasets are one of their biggest competitive edge. But saying in to interviewer “I cannot tell you”, is not very nice, so you can take the americal politician approach and say “I don’t know/remember” which you cannot ever be hold accountable for.

phoneymouse@lemmy.world · 8 months ago

There is no way in hell it isn’t copyrighted material.

abhibeckert@lemmy.world · edit-2 8 months ago

Every video ever created is copyrighted.

The question is — do they need a license? Time will tell. This is obviously going to court.

Kazumara@feddit.de · 8 months ago

Don’t downvote this guy. He’s mostly right. Creative works have copyright protections from the moment they are created. The relevant question is indeed if they have the relevant permissions for their use, not wether it had protections in the first place.

Maybe some surveillance camera footage is not sufficiently creative to get protections, but that’s hardly going to be good for machine reinforcement learning.

Buttons@programming.dev · 8 months ago

If I were the reporter my next question would be:

“Do you feel that not knowing the most basic things about your product reflects on your competence as CTO?”

ForgotAboutDre@lemmy.world · 8 months ago

Hilarious, but if the reporter asked this they would find it harder to get invites to events. Which is a problem for journalists. Unless your very well regarded for your journalism, you can’t push powerful people without risking your career.

Abnorc@lemm.ee · 8 months ago

That, and the reporter is there to get information, not mess with and judge people. Asking that sort of question is really just an attack. We can leave it to commentators and ourselves for judge people.

Aniki 🌱🌿@lemm.ee · edit-2 8 months ago

this is limp dick energy. If asking questions is an attack then you’re probably a piece of shit doing bad things.

tastysnacks@programming.dev · 8 months ago

no it isn’t. what answer to that question has any value to me as a reader?

Abnorc@lemm.ee · edit-2 8 months ago

Think about the answer you would actually get. They would dismiss the question or give some sort of nonsense answer. It’s a rhetorical question, and the only thing that it serves to do is criticize the person being asked. That’s not what reporters are there to do. If the answer would actually give some useful information to the reader, then it’s worth asking.

Aniki 🌱🌿@lemm.ee · 8 months ago

boofuckingwoo. Reporters are not supposed to be friends with the people they are writing about.

tb_@lemmy.world · 8 months ago

True, but if those same people they’re not supposed to be friends with are the ones inviting them to those events/granting them early access…

In other words: the system is rigged.

nifty@lemmy.world · 8 months ago

The system is rigged.

You cannot give the same criticism to a rich person vs. a poor person even if their incompetence is the same. I am not sure what’s the fix, other than the common refrain of “there should be no millionaires/billionaires”. How does society heal itself if you cannot hold people accountable?

Aniki 🌱🌿@lemm.ee · 8 months ago

Again - boofuckinghooo. Let the fuckers have no friends in the media. The media owners make journalists spinless advertisement sellers. I have very little respect for the profession at this point.

tb_@lemmy.world · 8 months ago

What a delightful and helpful attitude.

Deceptichum@sh.itjust.works · edit-2 8 months ago

booduckinghoo.

We’re sick and tired of this shit, it will never change if people make excuses for it.

MalachaiConstant@lemmy.world · 8 months ago

You’re missing the point that they need those relationships to gain access to sources. You literally cannot force people to talk to you

RatBin@lemmy.world · 8 months ago

Also about this line:

Others, meanwhile, jumped to Murati’s defense, arguing that if you’ve ever published anything to the internet, you should be perfectly fine with AI companies gobbling it up.

No I am not fine. When I wrote that stuff and those researches in old phpbb forums I did not do it with the knowledge of a future machine learning system eating it up without my consent. I never gave consent for that despite it being publicly available, because this would be a designation of use that wouldn’t exist back than. Many other things are also publicly available, but some a re copyrighted, on the same basis: you can publish and share content upon conditions that are defined by the creator of the content. What’s that, when I use zlibrary I am evil for pirating content but openai can do it just fine due to their huge wallets? Guess what, this will eventually creating a crisis of trust, a tragedy of the commons if you will when enough ai generated content will build the bulk of your future Internet search! Do we even want this?

CosmoNova@lemmy.world · edit-2 8 months ago

I almost want to believe they legitimately do not know nor care they‘re committing a gigantic data and labour heist but the truth is they know exactly what they‘re doing and they rub it under our noses.

laxe@lemmy.world · 8 months ago

Of course they know what they’re doing. Everybody knows this, how could they be the only ones that don’t?

Bogasse@lemmy.ml · 8 months ago

Yeah, the fact that AI progress just relies on “we will make so much money that no lawsuit will consequently alter our growth” is really infuriating. The fact that general audience apparently doesn’t care is even more infuriating.

A_Very_Big_Fan@lemmy.world · 8 months ago

Look guys! I’m stealing from Tolkien!

toddestan@lemmy.world · 8 months ago

I’d say not really, Tolkien was a writer, not an artist.

What you are doing is violating the trademark Middle-Earth Enterprises has on the Gandalf character.

A_Very_Big_Fan@lemmy.world · 8 months ago

The point was that I absorbed that information to inform my “art”, since we’re equating training with stealing.

I guess this would have been a better example lol. It’s clearly not Gandalf, but I wouldn’t have ever come up with it if I hadn’t seen that scene

Guntrigger@feddit.ch · 8 months ago

I don’t think anyone’s going to pay for your version of ChatGPT

stackPeek@lemmy.world · 8 months ago

This tellls you so much what kind of company OpenAI is

webghost0101@sopuli.xyz · 8 months ago

An Intelligence piracy company?

wabafee@lemmy.world · 8 months ago

Half open or half close?

jaemo@sh.itjust.works · 8 months ago

It also tells us how hypocritical we all are since absolutely every single one of us would make the same decisions they have if we were in their shoes. This shit was one bajillion percent inevitable; we are in a river and have been since we tilled soil with a plough in the Nile valley millennia ago.

adrian783@lemmy.world · 8 months ago

most of us would never be in their shoes because most of us are not sociopathic techbros

jaemo@sh.itjust.works · 8 months ago

I guess a lot of us didn’t learn from history, or even go see ‘Oppenheimer’…

whoisearth@lemmy.ca · 8 months ago

Speak for yourself. Were I in their shoes no I would not. But then again my company wouldn’t be as big as theirs for that reason.

BringMeTheDiscoKing@lemmy.ca · 8 months ago

Did they intentionally chose a picture where she looks like she’s morphing into Elon?

rab@lemmy.ca · 8 months ago

I was thinking mads mikkelssen

billwashere@lemmy.world · 8 months ago

Well after just finishing Death Stranding, I can’t unsee that.

BoscoBear@lemmy.sdf.org · 8 months ago

I suspect so. It is a very slanted article.

anon_8675309@lemmy.world · 8 months ago

CTO should definitely know this.

ItsMeSpez@lemmy.world · 8 months ago

They do know this. They’re avoiding any legal exposure by being vague.

blazeknave@lemmy.world · 8 months ago

I feel like at their scale, if there’s going to be a figure head marketable CTO, it’s going to be this company. If not, you’re right, and she’s lying lol

turkishdelight@lemmy.ml · 8 months ago

Of course she knows it. She just doesn’t want to get sued.

andrew_bidlaw@sh.itjust.works · 8 months ago

Funny she didn’t talked it out with lawyers before that. That’s a bad way to answer that.

driving_crooner@lemmy.eco.br · 8 months ago

Or she talked and the lawyers told her to pretend ignorance.

QuaternionsRock@lemmy.world · 8 months ago

It probably means that they don’t scrape and preprocess training data in house. She knows they get it from a garden variety of underpaid contractors, but she doesn’t know the specific data sources beyond the stipulations of the contract (“publicly available or licensed”), and she probably doesn’t even know that for certain.

driving_crooner@lemmy.eco.br · 8 months ago

“Publicly a available” can mean a lot of things. Is youtube publicly available? Is public broadcasting publicly available?

andrew_bidlaw@sh.itjust.works · 8 months ago

Maybe, but it sounds very weak.

anlumo@lemmy.world · 8 months ago

Lawyers aren’t PR people.

andrew_bidlaw@sh.itjust.works · 8 months ago

She didn’t even adress them though.

IvanOverdrive@lemm.ee · 8 months ago

REPORTER: Where does your data come from?

CTO: Bitch, are you trying to get me sued?

TheObviousSolution@lemm.ee · 8 months ago

Then wipe it out and start again once you have where your data is coming from sorted out. Are we acting like you having built datacenter pack full of NVIDIA processors just for this sort of retraining? They are choosing to build AI without proper sourcing, that’s not an AI limitation.

AutoTL;DR@lemmings.world · 8 months ago

This is the best summary I could come up with:

Mira Murati, OpenAI’s longtime chief technology officer, sat down with The Wall Street Journal’s Joanna Stern this week to discuss Sora, the company’s forthcoming video-generating AI.

It’s a bad look all around for OpenAI, which has drawn wide controversy — not to mention multiple copyright lawsuits, including one from The New York Times — for its data-scraping practices.

After the interview, Murati reportedly confirmed to the WSJ that Shutterstock videos were indeed included in Sora’s training set.

But when you consider the vastness of video content across the web, any clips available to OpenAI through Shutterstock are likely only a small drop in the Sora training data pond.

Others, meanwhile, jumped to Murati’s defense, arguing that if you’ve ever published anything to the internet, you should be perfectly fine with AI companies gobbling it up.

Whether Murati was keeping things close to the vest to avoid more copyright litigation or simply just didn’t know the answer, people have good reason to wonder where AI data — be it “publicly available and licensed” or not — is coming from.

The original article contains 667 words, the summary contains 178 words. Saved 73%. I’m a bot and I’m open source!

A_Very_Big_Fan@lemmy.world · 8 months ago

Funny how we have all this pissing and moaning about stealing, yet nobody ever complains about this bot actually lifting entire articles and spitting them back out without ads or fluff. I guess it’s different when you find it useful, huh?

I like the bot, but I mean y’all wanna talk about copyright violations? The argument against this bot is a hell of a lot more solid than just using data for training.

Guntrigger@feddit.ch · 8 months ago

Is this bot a closed system which is being used for profit? No, you know exactly what its source is (the single article it is condensing) and even has a handy link about how it is open source at the end of every single post.

A_Very_Big_Fan@lemmy.world · 8 months ago

It copied all of its text from the article, and it allows me to get all the information from it I want without providing that publisher with traffic or ad revenue. That’s not fair use.

I do like the bot, and personally I’d rather it stay, but no matter how you look at it this isn’t “fair use” of the article.

Guntrigger@feddit.ch · 8 months ago

Interesting take. In all of the defences of LLMs using copyrighted material it’s very often highlighted that “fair use” allows exactly such summaries of larger texts.

In reality, “fair use” is ruled on a case by case basis, so it’s impossible to judge whether something is or not without it going to court.

A_Very_Big_Fan@lemmy.world · 8 months ago

We’re not making legislation here, so we don’t have that level of burden of proof. But either way, when it comes to factors of fair use that every authority on the matter will list, it violates almost all of them.

It’s non-commercial, and it’s using facts rather than using a more creative work, so it’s got that going for it… But it’s

composed of 100% copied material
it’s not transformative
it’s substituting the original work
it uses officially published work
it specifically copies the “heart” of the work
it bypasses all of the ads and impacts their traffic/metrics so it has a financial impact on them.

It’s pretty obvious that there is no argument here. The factors that are violated the hardest and most undisputably are the ones that most authorities on the matter (including the one I linked) agree are the most important.

whoisearth@lemmy.ca · 8 months ago

So my work uses ChatGPT as well as all the other flavours. It’s getting really hard to stay quiet on all the moral quandaries being raised on how these companies are training their AI data.

I understand we all feel like we are on a speeding train that can’t be stopped or even slowed down but this shit ain’t right. We need to really start forcing businesses to have moral compass.

RatBin@lemmy.world · 8 months ago

I spot aot of people GPT-eing their way through personale notes and researches. Whereas you used to see Evernote, office, word, note taking app you see a lot of gpt now. I feel weird about it.

Fedizen@lemmy.world · edit-2 8 months ago

this is why code AND cloud services shouldn’t be copyrightable or licensable without some kind of transparency legislation to ensure people are honest. Either forced open source or some kind of code review submission to a government authority that can be unsealed in legal disputes.