Unveiling Hypocrisy: A Review of OpenAI’s Controversial Article in The New York Times

OpenAI is nervous. Very nervous. Their latest blog post is proof. They have realized that the ongoing lawsuit with the New York Times poses a significant threat to their entire business model. Their belief that they could freely harvest the content of the entire internet for training their AI models was, in fact, quite naive.

First of all, I want to clarify that I am extremely excited about ChatGPT and generative AI in general. This invention is monumental, surpassing the invention of electricity, the computer, and the internet combined. If this may seem like an exaggeration, let’s wait a few years and reevaluate.

GPT has been trained on millions of New York Times articles

Obviously, generative AI builds on all three technologies. And it needs a lot of them. Really, a lot! The power consumption, computing resources, and, last but not least, the amount of content needed to train large language models (LLMs) are enormous.

OpenAI approached Microsoft recognizing that only a company of such magnitude could afford the hefty bills for power and computation. Yet, oddly enough, OpenAI seems to assume that a third equally critical resource can be collected at no cost.

Next, I’ll address the contradictory assertions made in OpenAI’s blog post.

Empowering the people

Our aim is to create AI tools which empower individuals to tackle problems that are generally insurmountable. Our technology is currently been utilized by people across the globe to enhance their everyday lives.

Undeniably, just consider this – if you’re a developer for a tech giant like Microsoft, would you be willing to freely offer your code to a company that’s nearly worth $3 trillion? This is all for the sake of “empowering people globally”? How would prestigious universities like MIT, Harvard, or Stanford react to a student who refuses to pay tuition under the pretense of wanting to make the world a better place?

OpenAI is presently valued at around $80 billion. However, without the voluminous content gathered from the internet, its actual worth would plummet to nothing.

Now let’s compare it to Google. Google, over the past two decades, has been a massive force in empowering numerous internet users. It won’t be wrong to suggest that Google has managed to harvest a rich variety of information from the internet, basically for free. Alongside this, Google has also been very instrumental in empowering publishers. Without Google, platforms like 4sysops wouldn’t exist.

So far, the publishers that OpenAI depends upon haven’t really enjoyed any significant empowerment. When you look at the deals that have been stuck between OpenAI and a handful of publishers, it doesn’t seem sustainable. If GPT was trained purely using the content from these few publishers, the resulting model would have undoubtedly been an epic failure.

Fair use of the internet

Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.

I assume that such an argument could impress a judge in the US: it was common practice in the past, so we should continue with it in the future. Slavery was a widespread practice in the past. Was it fair?

OpenAI alludes to “fair use,” as commonly practiced in the Open Source movement. You can use someone else’s work for free as long as you give credit to the creator and contribute to the community.

Imagine if Microsoft pivoted away from selling Windows, choosing instead to offer Linux to businesses at equivalent license costs. Could such a move be considered “fair use” as per the long-established and widely accepted norms?

Following the recent paradigm shift, it seems feasible to monetize AI – a feat that was impracticable until some time ago. One of the prime examples of this is Microsoft’s formidable investment of $13 billion into OpenAI, revealing the untapped potential of AI as a money-making tool. This partnership between OpenAI and Microsoft opens up vast avenues for earning huge profits by leveraging free content, without extending any reciprocity to the original content creators fueling the GPT.

Microsoft seems to have planted its copilots into nearly every software product it has developed. Is OpenAI genuinely under the impression that this is a move designed to uplift people?

My interpretation of “fair use” may be slightly distinct from this.

Insignificance of a publisher

Because models learn from the enormous aggregate of human knowledge, any one sector—including news—is a tiny slice of overall training data, and any single data source—including The New York Times—is not significant for the model’s intended learning.

This one really made me angry. This is not just hypocrisy. This is hubris!

This is how I imagine a superintelligence would argue. “You, humans, are not significant to me, the almighty GPT who has aggregated all the knowledge of the world. And these are the guys who are working on the alignment problem. This is really scary!

But is this really true? OpenAI does not need the New York Times? Then why this anxious blog post? Why even bother?

The impact on the model is perhaps hard to measure. However, if the New York Times opts out for good, OpenAI and Microsoft are in serious trouble.

I’ll bet these few publishers who made a deal with OpenAI are now wondering if they got played. If the outcome of this lawsuit is that GPT has to do without the millions of New York Times articles, other publishers will follow suit.

On the other hand, if the New York Times, OpenAI, and Microsoft come to an agreement, every publisher and hobby blogger will demand comparable compensation. My question to OpenAI is, what exactly is the value of the “enormous aggregate of human knowledge?” Are Microsoft’s pockets deep enough to buy it all?

So, is the New York Times insignificant? I ponder about the author of this OpenAI piece. A lot of the slick platitudes in the piece seem as though they were drafted by ChatGPT.

The opt-out alternative

That being said, having a legal entitlement is not as vital to us as being commendable citizens. We have pioneered in the AI industry in offering a straightforward opt-out procedure for publishers (which The New York Times adopted in August 2023) to disallow our tools from accessing their sites.

Therefore, the team at OpenAI qualifies as “good citizens.” That’s a relief! “Legally,” OpenAI and Microsoft have the privilege to use the “vast compendium of human knowledge” for any intention. Yet, since these gentlemen from Redmond and San Francisco are supremely cordial, they won’t withhold what they already possess and instead permit publishers to opt out. I am genuinely awestruck by such levels of selflessness!

There are many publishers that have started blocking AI crawlers. This is a decision I’ve mulled over for 4sysops as I have been intrigued by the technology ever since I wrote my master’s thesis on neural networks many years ago. Over the past year, my experience with AI tools like ChatGPT has been transformative, almost to the point of granting me superpowers. The shift has indeed been revolutionary

However, the article “OpenAI and Journalism” by OpenAI led me to conclude that the best move for every publisher right now is to opt out. My advice is to block AI bots with robots.txt and to block any other crawlers with no legitimate purpose on your website at your firewall. This is achievable given that it’s simple to identify legitimate crawlers such as those used by Google and other search engines.

Obviously, neither OpenAI nor Microsoft has a solid plan on how to justly reimburse publishers. Linking to publications in the same way as Google does isn’t a viable solution, since LLMs don’t have the ability to accurately attribute the sources that influenced a specific response. Even if a few links are incorporated, publishers will see a significant drop in traffic.

The advent of a third AI winter?

AI is certainly not a new concept. With its origins tracing back to the creation of the first computers, the novelty lies in AI’s newfound practical implications. In fact, to say that it’s “extremely useful” still doesn’t quite do it justice. Past AI winters stemmed from the conceit of the AI research community. The flawed belief that AI could be developed purely through programming, rather than training, led to previous system failures. We’ve since made a paradigm shift, recognizing that AI training is vital to success. Without the limitless content provided by the internet, General AI would be futile. GPT minus the internet is no more intelligent than a loaf of bread.

Nevertheless, there’s a concern that hubris could instigate a third AI winter. When a person claims ownership over the entirety of human knowledge, is that not an act of hubris?

The recent OpenAI blog post reveals that there’s deep-seated fear rippling through the hubs of San Francisco. It’s not a fear of superintelligence leading to a dystopian future where the earth is converted into a massive paperclip production site. Rather, the terror stems from the looming prospect of a third AI winter, which presents a serious quandary.

It seems implausible to remunerate all publishers to incentivize quality content while also reaching a smaller audience. However, if most quality content generators retreat, the hallucination problem will escalate, making an LLM undesirable even at a low price. Generative AI operates efficiently only with the near-infinite data offered by the internet. Without it, like training solely on Wikipedia, we’re left with costly AI garbage.

The prior two instances of AI winters didn’t transpire due to the wavering interest or loss of faith in AI’s capabilities amongst computer scientists. Rather, they happened when investors, after witnessing rampant overpromotion of AI, came to understand that profits were elusive in this field. This is the crux of the legal tussle between New York Times, OpenAI and Microsoft. The core issues at stake here are – the accurate estimations of generative AI’s cost, the determination of the entities bearing these costs and the calculation of the potential profit margins that could be retained.

Currently, the profitability of generative AI is hampered by the substantial costs incurred on computation and power. However, if Microsoft is granted permission to capitalize on the offerings of internet publishers, this technology can potentially serve as a new revenue generator for Redmond. Yet, when you include the expenditure for securing the “enormous aggregate of human knowledge”, which most likely surpasses the initial two expenses by several times, the resultant cost might be exorbitant, even beyond Microsoft’s affordability.


Posted

in

, ,

by

Tags: