Welcome to Incremental Social! Learn more about this project here!
Check out lemmyverse to find more communities to join from here!

IvanOverdrive ,

REPORTER: Where does your data come from?

CTO: Bitch, are you trying to get me sued?

_haha_oh_wow_ ,
@_haha_oh_wow_@sh.itjust.works avatar

Gee, seems like something a CTO would know. I'm sure she's not just lying, right?

Bogasse ,
@Bogasse@lemmy.ml avatar

And on the other hand it is a very obvious question to expect. If you have something hide how on the world are you not prepared for this question !? 🤡

Hotzilla ,

To be fair, these datasets are one of their biggest competitive edge. But saying in to interviewer "I cannot tell you", is not very nice, so you can take the americal politician approach and say "I don't know/remember" which you cannot ever be hold accountable for.

VirtualOdour ,

It's a question that is based on a purposeful misunderstanding of the technology, it's like expecting a bee keeper to know each bees name and bedtime. Really it's like asking a bricklayer where each brick came from in the pile, He can tell you the batch but not going to know this brick came from the forth row of the sixth pallet, two from the left. There is no reason to remember that it's not important to anyone.

The don't log it because it would take huge amounts of resources and gain nothing.

zaphod , (edited )
@zaphod@lemmy.ca avatar

What?

Compiling quality datasets is enormously challenging and labour intensive. OpenAI absolutely knows the provenance of the data they train on as it's part of their secret sauce. And there's no damn way their CTO won't have a broad strokes understanding of the origins of those datasets.

Guntrigger ,

[Citation needed]

dezmd ,
@dezmd@lemmy.world avatar

LLM is just another iteration of Search. Search engines do the same thing. Do we outlaw search engines?

AliasAKA ,

SoRA is a generative video model, not exactly a large language model.

But to answer your question: if all LLMs did was redirect you to where the content was hosted, then it would be a search engine. But instead they reproduce what someone else was hosting, which may include copyrighted material. So they’re fundamentally different from a simple search engine. They don’t direct you to the source, they reproduce a facsimile of the source material without acknowledging or directing you to it. SoRA is similar. It produces video content, but it doesn’t redirect you to finding similar video content that it is reproducing from. And we can argue about how close something needs to be to an existing artwork to count as a reproduction, but I think for AI models we should enforce citation models.

HaywardT ,

I think the question of how close does it have to be is the real question.

If I use similar lighting in my movie as was used in Citizen Kane do I owe a credit?

AliasAKA ,

I suppose that really depends. Are you making a reproduction of Citizen Kane, which includes cinematographic techniques? Then that’s probably a hard “gotta get a license if it’s under copyright”. Where it gets more tricky is something like reproducing media in a particular artistic style (say, a very distinctive drawing animation style). Like realistically you shouldn’t reproduce the marquee style of a currently producing artist just because you trained a model on it (most likely from YouTube clips of it, and without paying the original creator or even the reuploader [who hopefully is doing it in fair use]). But in any case, all of the above and questions of closeness and fair use are already part of the existing copyright legal landscape. That very question of how close does it have to be is at the core of all the major song infringement court battles, and those are between two humans. Call me a Luddite, but I think a generative model should be offered far less legal protection and absolutely not more legal protection for its output than humans are.

dezmd ,
@dezmd@lemmy.world avatar

How does a search engine know where to point you? It injests all that data and processes it 'locally' on the search engines systems using algorithms to organize the data for search. It's effectively the same dataset.

LLM is absolutely another iteration of Search, with natural language ouput for the same input data. Are you advocating against search engine data injest as not fair use and copyright violations as well?

You equate LLM to Intelligence which it is not. It is algorithmic search interation with natural language responses, but that doesn't sound as cool as AI. It's neat, it's useful, and yes, it should cite the sourcing details (upon request), but it's not (yet?) a real intelligence and is equal to search in terms of fair use and copyright arguments.

AliasAKA ,

I never equated LLMs to intelligence. And indexing the data is not the same as reproducing the webpage or the content on a webpage. For you to get beyond a small snippet that held your query when you search, you have to follow a link to the source material. Now of course Google doesn’t like this, so they did that stupid amp thing, which has its own issues and I disagree with amp as a general rule as well. So, LLMs can look at the data, I just don’t think they can reproduce that data without attribution (or payment to the original creator). Perplexity.ai is a little better in this regard because it does link back to sources and is attempting to be a search engine like entity. But OpenAI is not in almost all cases.

HaywardT ,

Why do you say it is not intelligence? It seems to meet all the requirements of any definition I can find.

dantheclamman ,
@dantheclamman@lemmy.world avatar

I feel conflicted about the whole thing. Technically it's a model. I don't feel that people should be able to sue me as a scientist for making a model based on publicly available data. I myself am merely trying to use the model itself to explain stuff about the world. But OpenAI are also selling access to the outputs of the model, that can very closely approximate the intellectual property of people. Also, most of the training data was accessed via scraping and other gray market methods that were often explicitly violating the TOU of the various places they scraped from. So it all is very difficult to sort through ethically.

Akisamb ,

Don't know why you are down voted it's a good question.

As a matter of fact it almost happened for search engines in France. Newspaper's argued that snippets were leading people to not go into their ad infested sites thus losing them revenue.

https://techcrunch.com/2020/04/09/frances-competition-watchdog-orders-google-to-pay-for-news-reuse/

TheObviousSolution ,

Then wipe it out and start again once you have where your data is coming from sorted out. Are we acting like you having built datacenter pack full of NVIDIA processors just for this sort of retraining? They are choosing to build AI without proper sourcing, that's not an AI limitation.

ZILtoid1991 ,

I have a feeling that the training material involves cheese pizza...

andrew_bidlaw ,
@andrew_bidlaw@sh.itjust.works avatar

Funny she didn't talked it out with lawyers before that. That's a bad way to answer that.

driving_crooner ,
@driving_crooner@lemmy.eco.br avatar

Or she talked and the lawyers told her to pretend ignorance.

andrew_bidlaw ,
@andrew_bidlaw@sh.itjust.works avatar

Maybe, but it sounds very weak.

anlumo ,

Lawyers aren’t PR people.

andrew_bidlaw ,
@andrew_bidlaw@sh.itjust.works avatar

She didn't even adress them though.

QuaternionsRock ,

It probably means that they don’t scrape and preprocess training data in house. She knows they get it from a garden variety of underpaid contractors, but she doesn’t know the specific data sources beyond the stipulations of the contract (“publicly available or licensed”), and she probably doesn’t even know that for certain.

driving_crooner ,
@driving_crooner@lemmy.eco.br avatar

"Publicly a available" can mean a lot of things. Is youtube publicly available? Is public broadcasting publicly available?

phoneymouse ,

There is no way in hell it isn’t copyrighted material.

abhibeckert ,

Every video ever created is copyrighted.

The question is — do they need a license? Time will tell. This is obviously going to court.

iknowitwheniseeit ,

There are definitely non copyrighted videos! Both old videos (all still black and white I think) and also things released into the public domain by copyright holders.

But for sure that's a very small subset of videos.

Kazumara ,

Don't downvote this guy. He's mostly right. Creative works have copyright protections from the moment they are created. The relevant question is indeed if they have the relevant permissions for their use, not wether it had protections in the first place.

Maybe some surveillance camera footage is not sufficiently creative to get protections, but that's hardly going to be good for machine reinforcement learning.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • technology@lemmy.world
  • random
  • incremental_games
  • meta
  • All magazines