We're this close to getting a personal assistant with all or much of humankind's knowledge. We're also this close to us permanently knee-capping it and losing out on an incredible future because of 1) OpenAI's greed and 2) everybody else's greed.
I don't like OpenAI's direction as it is now, but I also don't like what will happen if NYT wins this.
NYT has to win this and I don’t get the argument in OpenAIs favor. Most of the ones I’ve heard rely on anthropomorphism of LLMs. There is plenty of public domain knowledge to train on and for that that isn’t why shouldn’t there be a payment made? We would still have the assistants coming the only difference is the cost would reflect the underlying work of the initial creators and thus VCs wouldn’t be able to destroy a ton of industries the way they’ve done for the past decade already.
This will also allow more competition as companies don’t have to accept that OpenAI downloaded the internet before it was locked down and thus will always have the most quality data. Instead all of these companies may start opening up their own apis for training on allowing anyone with compute to train on a similar dataset as OpenAI.
The argument, or at least one of the arguments in openAI's favor is that the training is fair use because it is transformative. Is the resulting AI a replacement for the original work? I would argue it is not.
As i mentioned in an earlier comment, it is very easy to make ChatGPT post paragraphs from books, by tricking it. They added some kind of exceptions to not show copyrighted content, but still, it is not impossible to reproduce an exact text. From my point of view this is a clear copyright infringement case.
Posting individual paragraphs verbatim is still ok. An individual paragraph(s) is not a replacement for the whole book. Websites post extracts of books all the time (like the google example in the article), that is not enough of a bar for copyright infringement.
They clearly don’t think so, if I ask it to give me the first few paragraphs of the first Harry Potter it starts to give a dump and then fails saying it can’t give that info. Clearly it’s trained on the book and its creators belief outputting this is iffy.
Yes the plaintiff's disagree but just because they disagree does not make them right. I simply explained why, even if it outputs who whole paragraphs of the copyrighted work, it can still be considered as free use. If I cite a paragraph of an article in my youtube video, I can claim fair use.
I do not know if the judge will see it this way of course and I am not a lawyer.
Your right, but what if I type prompts in order to make ChatGPT to show me the full book, or at least big chunks of it? You agree that right now it is possible this, and they didn’t manage how to stop it doing that, right?
The moment OpenAI learns of the loophole they will patch it. Just like youtube takes down infringing video, OpenAI has to keep the bot from regurgitating the whole work.
OpenAI's business is not based on regurgitating the work. It is based on providing output derivated from those works. No buys an OpenAI subscription to get the AI to give you an existing book. People buy it to have OpenAI generate the next original book.
I get your point. I am reflecting on it, cause in a way it tends to be stronger than mine, specialty discussing about written work (books or in the example of the article, NYT articles)
But what about the images. The italian plumber who looks very much like SuperMario mentioned in the article. How would a judge not punish a company selling visual representations of a character that is copyright owned by someone else?
> But what about the images. The italian plumber who looks very much like SuperMario
That looks indeed like a case where any regurgitation becomes a problem in my opinion. I think the closest thing to this would be Fan-art. If I draw a picture of Mario for fun and post it online, it remains fair use as long as its not commercial or used for promoting a product (from my limited understanding). In the OpenAI case they sell subscriptions to their service. You can therefore make the case that they are selling Fan-art for profit.
This leads me to think companies that own this kind IP have a more solid case against OpenAI. It is entirely possible that New York Times loses the lawsuit but Nintendo wins if they sue.
Even the anthropomorphism argument doesn't hold up under close scrutiny. When I was in high school I was asked to memorize several poems, including a few that are under copyright today. If I regurgitate one of these poems and present it as my own, this clearly infringes copyright, even if I no longer recall where the poem came from or who wrote it.
How is what OpenAI is doing with NYT stories any different, other than the architecture and substrate of the neural network?
I'm all for policing the outputs of generative models and enforcing copyright on their usage.
I am very much against ruling that their training is infringement.
A model which uses old NYT articles to learn the relationship between words and concepts which turns around and is used to identify potentially falsified research papers for review should not be prevented from existing.
If the model is used to reproduce copyrighted material - by all means the person running it should be liable.
This would create a ML industry around copyright identification as a pre filter before outputting (ironically requiring training on copyrighted material to enforce).
How far are we from that right now? If you have an internet connection you basically have access to 95%+ of the information available in the world right now. Is your goal to completely delegate having to think through that information? To what end?
Human brains have a limit on the number of things they can deal with simultaneously called "the rule of seven plus or minus two."
I don't know if you've seen the demo of Gemini 1.5 parsing a video with a 1M context length, but it does things few humans could.
The ability to take all that information and put it into an engine which can identify relationships between data with greater breadth and depth than any individual human will be unfathomably valuable to progress and advancement.
As a trivial example - there's been a number of different diets that have shown success for autoimmune conditions across meta-analyses. But many of the details in the diets are contradictory, such as one being very protein heavy and another being vegetarian. How convenient would it be to ask a model what the common factors are across the half dozen diets that all seem to work?
One day soon it will be feasible for medical trials to do full genome sequencing for participants. Would it be convenient to have a model identify common genes for those where treatment was ineffective vs effective?
I don't like OpenAI's direction as it is now, but I also don't like what will happen if NYT wins this.