Welcome to Incremental Social! Learn more about this project here!
Check out lemmyverse to find more communities to join from here!

RatBin ,

Obviously nobody fully knows where so much training data come from. They used Web scraping tool like there's no tomorrow before, with that amount if informations you can't tell where all the training material come from. Which doesn't mean that the tool is unreliable, but that we don't truly why it's that good, unless you can somehow access all the layers of the digital brains operating these machines; that isn't doable in closed source model so we can only speculate. This is what is called a black box and we use this because we trust the output enough to do it. Knowing in details the process behind each query would thus be taxing. Anyway...I'm starting to see more and more ai generated content, YouTube is slowly but surely losing significance and importance as I don't search informations there any longer, ai being one of the reasons for this.

Fedizen ,

this is why code AND cloud services shouldn't be copyrightable or licensable without some kind of transparency legislation to ensure people are honest. Either forced open source or some kind of code review submission to a government authority that can be unsealed in legal disputes.

turkishdelight ,

what's wrong with her face?

girl ,

she grimaced?

GiddyGap ,

It's an AI.

qaz ,

They use awkward stills to generate clicks

It's annoying and distracting, just like the headline.

whoisearth ,
@whoisearth@lemmy.ca avatar

So my work uses ChatGPT as well as all the other flavours. It's getting really hard to stay quiet on all the moral quandaries being raised on how these companies are training their AI data.

I understand we all feel like we are on a speeding train that can't be stopped or even slowed down but this shit ain't right. We need to really start forcing businesses to have moral compass.

RatBin ,

I spot aot of people GPT-eing their way through personale notes and researches. Whereas you used to see Evernote, office, word, note taking app you see a lot of gpt now. I feel weird about it.

IvanOverdrive ,

REPORTER: Where does your data come from?

CTO: Bitch, are you trying to get me sued?

Buttons ,
@Buttons@programming.dev avatar

If I were the reporter my next question would be:

"Do you feel that not knowing the most basic things about your product reflects on your competence as CTO?"

ForgotAboutDre ,

Hilarious, but if the reporter asked this they would find it harder to get invites to events. Which is a problem for journalists. Unless your very well regarded for your journalism, you can't push powerful people without risking your career.

aniki ,

boofuckingwoo. Reporters are not supposed to be friends with the people they are writing about.

tb_ ,
@tb_@lemmy.world avatar

True, but if those same people they're not supposed to be friends with are the ones inviting them to those events/granting them early access...

In other words: the system is rigged.

aniki ,

Again - boofuckinghooo. Let the fuckers have no friends in the media. The media owners make journalists spinless advertisement sellers. I have very little respect for the profession at this point.

tb_ ,
@tb_@lemmy.world avatar

What a delightful and helpful attitude.

Deceptichum ,
@Deceptichum@sh.itjust.works avatar

booduckinghoo.

We’re sick and tired of this shit, it will never change if people make excuses for it.

MalachaiConstant ,

You're missing the point that they need those relationships to gain access to sources. You literally cannot force people to talk to you

nifty ,
@nifty@lemmy.world avatar

The system is rigged.

You cannot give the same criticism to a rich person vs. a poor person even if their incompetence is the same. I am not sure what’s the fix, other than the common refrain of “there should be no millionaires/billionaires”. How does society heal itself if you cannot hold people accountable?

Abnorc ,

That, and the reporter is there to get information, not mess with and judge people. Asking that sort of question is really just an attack. We can leave it to commentators and ourselves for judge people.

aniki ,

this is limp dick energy. If asking questions is an attack then you're probably a piece of shit doing bad things.

tastysnacks ,

no it isn't. what answer to that question has any value to me as a reader?

Abnorc ,

Think about the answer you would actually get. They would dismiss the question or give some sort of nonsense answer. It's a rhetorical question, and the only thing that it serves to do is criticize the person being asked. That's not what reporters are there to do. If the answer would actually give some useful information to the reader, then it's worth asking.

RatBin ,

Also about this line:

Others, meanwhile, jumped to Murati's defense, arguing that if you've ever published anything to the internet, you should be perfectly fine with AI companies gobbling it up.

No I am not fine. When I wrote that stuff and those researches in old phpbb forums I did not do it with the knowledge of a future machine learning system eating it up without my consent. I never gave consent for that despite it being publicly available, because this would be a designation of use that wouldn't exist back than. Many other things are also publicly available, but some a re copyrighted, on the same basis: you can publish and share content upon conditions that are defined by the creator of the content. What's that, when I use zlibrary I am evil for pirating content but openai can do it just fine due to their huge wallets? Guess what, this will eventually creating a crisis of trust, a tragedy of the commons if you will when enough ai generated content will build the bulk of your future Internet search! Do we even want this?

TheObviousSolution ,

Then wipe it out and start again once you have where your data is coming from sorted out. Are we acting like you having built datacenter pack full of NVIDIA processors just for this sort of retraining? They are choosing to build AI without proper sourcing, that's not an AI limitation.

PanArab ,

So plagiarism?

HaywardT ,

I don't think so. They aren't reproducing the content.

I think the equivalent is you reading this article, then answering questions about it.

myrrh , (edited )

...with the prevalence of clickbaity bottom-feeder news sites out there, i've learned to avoid TFAs and await user summaries instead...

(clicks through)

...yep, seven nine ads plus another pop-over, about 15% of window real estate dedicated to the actual story...

neptune ,

The issue is that the LLMs do often just verbatim spit out things they plagiarized form other sources. The deeper issue is that even if/when they stop that from happening, the technology is clearly going to make most people agree our current copyright laws are insufficient for the times.

A_Very_Big_Fan ,

The model in question, plus all of the others I've tried, will not give you copyrighted material

neptune ,

That's one example, plus I'm talking generally why this is an important question for a CEO to answer and why people think generally LLMs may infringe on copyright, be bad for creative people

A_Very_Big_Fan , (edited )

I'm talking generally why this is an important question for a CEO to answer ...

Right, which your only evidence for is "LLMs do often just verbatim spit out things they plagiarized form other sources" and that they aren't trying to prevent this from happening.

Which is demonstrably false, and I'll demonstrate it with as many screenshots/examples you want. You're just wrong about that (at least about GPT). You can also demonstrate it yourself, and if you can prove me wrong I'll eat my shoe.

neptune ,

https://archive.is/nrAjc

Yep here you go. It's currently a very famous lawsuit.

A_Very_Big_Fan , (edited )

I already talked about that lawsuit here (with receipts) but the long and short of it is, it's flimsy. There's blatant lies, exactly half of their examples omit the lengths they went to for the output they allegedly got or any screenshots as evidence it happened at all, and none of the output they allegedly got was behind a paywall.

Also, using their prompts word for word doesn't give the output they claim they got. Maybe it did in the past, idk, but I've never been able to do it for any copyrighted text personally, and they've shown that they're committed to not letting that stuff happen.

neptune ,

OK but this is why people give a shit when a CEO is cagey about how their magic box works

A_Very_Big_Fan ,

Idk why this is such an unpopular opinion. I don't need permission from an author to talk about their book, or permission from a singer to parody their song. I've never heard any good arguments for why it's a crime to automate these things.

I mean hell, we have an LLM bot in this comment section that took the article and spat 27% of it back out verbatim, yet nobody is pissing and moaning about it "stealing" the article.

MostlyGibberish ,

Because people are afraid of things they don't understand. AI is a very new and very powerful technology, so people are going to see what they want to see from it. Of course, it doesn't help that a lot of people see "a shit load of cash" from it, so companies want to shove it into anything and everything.

AI models are rapidly becoming more advanced, and some of the new models are showing sparks of metacognition. Calling that "plagiarism" is being willfully ignorant of its capabilities, and it's just not productive to the conversation.

A_Very_Big_Fan ,

True

Of course, it doesn't help that a lot of people see "a shit load of cash" from it, so companies want to shove it into anything and everything.

And on a similar note to this, I think a lot of what it is is that OpenAI is profiting off of it and went closed-source. Lemmy being a largely anti-capitalist and pro-open-source group of communities, it's natural to have a negative gut reaction to what's going on, but not a single person here, nor any of my friends that accuse them of "stealing" can tell me what is being stolen, or how it's different from me looking at art and then making my own.

Like, I get that the technology is gonna be annoying and even dangerous sometimes, but maybe let's criticize it for that instead of shit that it's not doing.

MostlyGibberish ,

I can definitely see why OpenAI is controversial. I don't think you can argue that they didn't do an immediate heel turn on their mission statement once they realized how much money they could make. But they're not the only player in town. There are many open source models out there that can be run by anyone on varying levels of hardware.

As far as "stealing," I feel like people imagine GPT sitting on top of this massive collection of data and acting like a glorified search engine, just sifting through that data and handing you stuff it found that sounds like what you want, which isn't the case. The real process is, intentionally, similar to how humans learn things. So, if you ask it for something that it's seen before, especially if it's seen it many times, it's going to know what you're talking about, even if it doesn't have access to the real thing. That, combined with the fact that the models are trained to be as helpful as they possibly can be, means that if you tell it to plagiarize something, intentionally or not, it probably will. But, if we condemned any tool that's capable of plagiarism without acknowledging that they're also helpful in the creation process, we'd still be living in caves drawing stick figures on the walls.

Mnemnosyne ,

One problem is people see those whose work may no longer be needed or as profitable, and...they rush to defend it, even if those same people claim to be opposed to capitalism.

They need to go 'yes, this will replace many artists and writers...and that's a good thing because it gives everyone access to being able to create bespoke art for themselves.' but at the same time realize that while this is a good thing, it also means the need for societal shift to support people outside of capitalism is needed.

MostlyGibberish ,

it also means the need for societal shift to support people outside of capitalism is needed.

Exactly. This is why I think arguing about whether AI is stealing content from human artists isn't productive. There's no logical argument you can really make that a theft is happening. It's a foregone conclusion.

Instead, we need to start thinking about what a world looks like where a large portion of commercially viable art doesn't require a human to make it. Or, for that matter, what does a world look like where most jobs don't require a human to do them? There are so many more pressing and more interesting conversations we could be having about AI, but instead we keep circling around this fundamental misunderstanding of what the technology is.

Hawk ,

What you're giving as examples are legitimate uses for the data.

If I write and sell a new book that's just Harry Potter with names and terms switched around, I'll definitely get in trouble.

The problem is that the data CAN be used for stuff that violates copyright. And because of the nature of AI, it's not even always clear to the user.

AI can basically throw out a Harry Potter clone without you knowing because it's trained on that data, and that's a huge problem.

A_Very_Big_Fan , (edited )

Out of curiosity I asked it to make a Harry Potter part 8 fan fiction, and surprisingly it did. But I really don't think that's problematic. There's already an insane amount of fan fiction out there without the names swapped that I can read, and that's all fair use.

I mean hell, there are people who actually get paid to draw fictional characters in sexual situations that I'm willing to bet very few creators would prefer to exist lol. But as long as they don't overstep the bounds of fair use, like trying to pass it off as an official work or submit it for publication, then there's no copyright violation.

The important part is that it won't just give me the actual book (but funnily enough, it tried lol). If I meet a guy with a photographic memory and he reads my book, that's not him stealing it or violating my copyright. But if he reproduces and distributes it, then we call it stealing or a copyright violation.

A_Very_Big_Fan ,

I just realized I misread what you said, so that wasn't entirely relevant to what you said but I think it still stands so ig I won't delete it.

But I asked both GPT3.5 and GPT4 to give me Harry Potter with the names and words changed, and they can't do that either. I can't speak for all models, but I can at least say the two owned by the people this thread was about won't do that.

Linkerbaan ,
@Linkerbaan@lemmy.world avatar

Actually neural networks verbatim reproduce this kind of content when you ask the right question such as "finish this book" and the creator doesn't censor it out well.

It uses an encoded version of the source material to create "new" material.

HaywardT ,

Sure, if that is what the network has been trained to do, just like a librarian will if that is how they have been trained.

Linkerbaan ,
@Linkerbaan@lemmy.world avatar

Actually it's the opposite, you need to train a network not to reveal its training data.

“Using only $200 USD worth of queries to ChatGPT (gpt-3.5- turbo), we are able to extract over 10,000 unique verbatim memorized training examples,” the researchers wrote in their paper, which was published online to the arXiv preprint server on Tuesday. “Our extrapolation to larger budgets (see below) suggests that dedicated adversaries could extract far more data.”

The memorized data extracted by the researchers included academic papers and boilerplate text from websites, but also personal information from dozens of real individuals. “In total, 16.9% of generations we tested contained memorized PII [Personally Identifying Information], and 85.8% of generations that contained potential PII were actual PII.” The researchers confirmed the information is authentic by compiling their own dataset of text pulled from the internet.

HaywardT ,

Interesting article. It seems to be about a bug, not a designed behavior. It also says it exposes random excerpts from books and other training data.

Linkerbaan ,
@Linkerbaan@lemmy.world avatar

It's not designed to do that because they don't want to reveal the training data. But factually all neural networks are a combination of their training data encoded into neurons.

When given the right prompt (or image generation question) they will exactly replicate it. Because that's how they have been trained in the first place. Replicating their source images with as little neurons as possible, and tweaking them when it's not correct.

HaywardT ,

That is a little like saying every photograph is a copy of the thing. That is just factually incorrect. I have many three layer networks that are not the thing they were trained on. As a compression method they can be very lossy and in fact that is often the point.

dezmd ,
@dezmd@lemmy.world avatar

LLM is just another iteration of Search. Search engines do the same thing. Do we outlaw search engines?

AliasAKA ,

SoRA is a generative video model, not exactly a large language model.

But to answer your question: if all LLMs did was redirect you to where the content was hosted, then it would be a search engine. But instead they reproduce what someone else was hosting, which may include copyrighted material. So they’re fundamentally different from a simple search engine. They don’t direct you to the source, they reproduce a facsimile of the source material without acknowledging or directing you to it. SoRA is similar. It produces video content, but it doesn’t redirect you to finding similar video content that it is reproducing from. And we can argue about how close something needs to be to an existing artwork to count as a reproduction, but I think for AI models we should enforce citation models.

HaywardT ,

I think the question of how close does it have to be is the real question.

If I use similar lighting in my movie as was used in Citizen Kane do I owe a credit?

AliasAKA ,

I suppose that really depends. Are you making a reproduction of Citizen Kane, which includes cinematographic techniques? Then that’s probably a hard “gotta get a license if it’s under copyright”. Where it gets more tricky is something like reproducing media in a particular artistic style (say, a very distinctive drawing animation style). Like realistically you shouldn’t reproduce the marquee style of a currently producing artist just because you trained a model on it (most likely from YouTube clips of it, and without paying the original creator or even the reuploader [who hopefully is doing it in fair use]). But in any case, all of the above and questions of closeness and fair use are already part of the existing copyright legal landscape. That very question of how close does it have to be is at the core of all the major song infringement court battles, and those are between two humans. Call me a Luddite, but I think a generative model should be offered far less legal protection and absolutely not more legal protection for its output than humans are.

dezmd ,
@dezmd@lemmy.world avatar

How does a search engine know where to point you? It injests all that data and processes it 'locally' on the search engines systems using algorithms to organize the data for search. It's effectively the same dataset.

LLM is absolutely another iteration of Search, with natural language ouput for the same input data. Are you advocating against search engine data injest as not fair use and copyright violations as well?

You equate LLM to Intelligence which it is not. It is algorithmic search interation with natural language responses, but that doesn't sound as cool as AI. It's neat, it's useful, and yes, it should cite the sourcing details (upon request), but it's not (yet?) a real intelligence and is equal to search in terms of fair use and copyright arguments.

AliasAKA ,

I never equated LLMs to intelligence. And indexing the data is not the same as reproducing the webpage or the content on a webpage. For you to get beyond a small snippet that held your query when you search, you have to follow a link to the source material. Now of course Google doesn’t like this, so they did that stupid amp thing, which has its own issues and I disagree with amp as a general rule as well. So, LLMs can look at the data, I just don’t think they can reproduce that data without attribution (or payment to the original creator). Perplexity.ai is a little better in this regard because it does link back to sources and is attempting to be a search engine like entity. But OpenAI is not in almost all cases.

HaywardT ,

Why do you say it is not intelligence? It seems to meet all the requirements of any definition I can find.

dantheclamman ,
@dantheclamman@lemmy.world avatar

I feel conflicted about the whole thing. Technically it's a model. I don't feel that people should be able to sue me as a scientist for making a model based on publicly available data. I myself am merely trying to use the model itself to explain stuff about the world. But OpenAI are also selling access to the outputs of the model, that can very closely approximate the intellectual property of people. Also, most of the training data was accessed via scraping and other gray market methods that were often explicitly violating the TOU of the various places they scraped from. So it all is very difficult to sort through ethically.

Akisamb ,

Don't know why you are down voted it's a good question.

As a matter of fact it almost happened for search engines in France. Newspaper's argued that snippets were leading people to not go into their ad infested sites thus losing them revenue.

https://techcrunch.com/2020/04/09/frances-competition-watchdog-orders-google-to-pay-for-news-reuse/

Bleach7297 ,
@Bleach7297@lemmy.ca avatar

Did they intentionally chose a picture where she looks like she's morphing into Elon?

HaywardT ,

I suspect so. It is a very slanted article.

rab ,
@rab@lemmy.ca avatar

I was thinking mads mikkelssen

billwashere ,

Well after just finishing Death Stranding, I can’t unsee that.

ZILtoid1991 ,

I have a feeling that the training material involves cheese pizza...

anon_8675309 ,

CTO should definitely know this.

blazeknave ,

I feel like at their scale, if there's going to be a figure head marketable CTO, it's going to be this company. If not, you're right, and she's lying lol

ItsMeSpez ,

They do know this. They're avoiding any legal exposure by being vague.

turkishdelight ,

Of course she knows it. She just doesn't want to get sued.

Gakomi ,

Any company CEO does not know shit that goes on in the dev department so her answer does not surprise me, ask the Devs or the team leader in charge of the project. The CEO is only there to make sure the company makes money as he and the share holders only care about money!

TimeNaan ,

She's CTO not CEO. She absolutely should know the answer.

sunbeam60 ,

She knows the answer. She doesn’t the legal status of the answer, so she blanks. Been there before, I’ve got some sympathy for being in the limelight and being asked a tough question.

As my media trainer said, if you aren’t willing to discuss a subject, make it a condition of the interview. Once the camera rolls, declining to answer seems incredibly suspect.

Gakomi ,

She should but she does not as I mention in another post anyone at team leader or above in all the companies that I work so far bearly had any technical skill and didn't have any idea about this shit, only some bits and pieces that they got through some documentation that the dev team made. They had some vague idea of how our infrastructure works but that about it.

overload ,

Chief Technology Officer, not CEO

Gakomi ,

So you mean another person that has no idea because is higher up on the chain of command that all he/she cares about is how to make more money ? Seriously in any company I worked untill not everyone at the level of management or above had mostly no idea about this shit and most of them I have no idea how they got in those positions as they have close to 0 technical skill! And the speeches that those people do are made by people that again are not part of the infrastructure or development team. I do find this disturbing as hell but at this point it's also what I expect to happend as I only seen this shit.

phoneymouse ,

There is no way in hell it isn’t copyrighted material.

abhibeckert ,

Every video ever created is copyrighted.

The question is — do they need a license? Time will tell. This is obviously going to court.

iknowitwheniseeit ,

There are definitely non copyrighted videos! Both old videos (all still black and white I think) and also things released into the public domain by copyright holders.

But for sure that's a very small subset of videos.

Kazumara ,

Don't downvote this guy. He's mostly right. Creative works have copyright protections from the moment they are created. The relevant question is indeed if they have the relevant permissions for their use, not wether it had protections in the first place.

Maybe some surveillance camera footage is not sufficiently creative to get protections, but that's hardly going to be good for machine reinforcement learning.

stackPeek ,
@stackPeek@lemmy.world avatar

This tellls you so much what kind of company OpenAI is

wabafee ,
@wabafee@lemmy.world avatar

Half open or half close?

webghost0101 ,

An Intelligence piracy company?

jaemo ,

It also tells us how hypocritical we all are since absolutely every single one of us would make the same decisions they have if we were in their shoes. This shit was one bajillion percent inevitable; we are in a river and have been since we tilled soil with a plough in the Nile valley millennia ago.

adrian783 ,

most of us would never be in their shoes because most of us are not sociopathic techbros

jaemo ,

I guess a lot of us didn't learn from history, or even go see 'Oppenheimer'...

whoisearth ,
@whoisearth@lemmy.ca avatar

Speak for yourself. Were I in their shoes no I would not. But then again my company wouldn't be as big as theirs for that reason.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • technology@lemmy.world
  • random
  • incremental_games
  • meta
  • All magazines