Data Wars: Inside The Battle To Protect Online Content From AI

July 10, 2024

Companies are taking drastic measures to stop AI systems from scraping their text for training purposes. This has sparked a fierce battle between content-rich websites and AI developers who need vast amounts of text to improve their models. Major sites like X (formerly Twitter) and Reddit have implemented strict rate limits and blocking rules to fend off these bots, while companies like Cloudflare offer tools to block all known AI scrapers. Meanwhile, some companies are suing AI giants for unauthorized use of their content, making this an intense and ongoing struggle to protect online data from being hijacked by artificial intelligence.

Data Wars: Inside The Battle To Protect Online Content From AI 1

A lot of companies have taken significant precautions to prevent scrapers from attempting to steal their content.

As reported by The Independent, it’s the newest front in an ongoing and seemingly rising struggle between websites that let people read content and AI companies who want to exploit it to create new tools.

The rise of artificial intelligence has resulted in several organizations seeking to train new and smarter AI systems. However, the huge language model systems that support many of them, such as ChatGPT, require enormous quantities of text to train.

As a result, some businesses have begun scraping text from the internet to feed it into the systems used for training. As a result, text-based website owners are frustrated, claiming that not only do the firms not have authorization to utilize their data, but that it is also slowing down the internet’s speed.

Elon Musk, for example, has consistently claimed that X, previously Twitter, receives a significant percentage of traffic from such scraping systems. X is one of several sites that has implemented severe “rate limiting” regulations to prevent bots from reloading its site too frequently – yet others have alleged that this has also been used to conceal difficulties with X’s supposedly ailing website.

Last week, Reddit made several modifications in an attempt to prevent bots from scraping its website. It stated that it will also deploy rate restrictions, as well as barring suspected bots and urging them to avoid its website.

It stated that these limitations could also hinder other automated services that are critical to openness, such as the Internet Archive, which preserves web pages for eventual access. It insisted, however, that critical research tools will continue to be accessible on Reddit.

“Anyone accessing Reddit content must abide by our policies, including those in place to protect Redditors. We are selective about who we work with and trust with large-scale access to Reddit content,” it said when it introduced those new rules.

Some businesses have entered into agreements to grant AI startups access to their or their customers’ data. Both OpenAI and Google have negotiated agreements with Reddit to take its users’ content for training their artificial intelligence systems.

Some businesses have entered into agreements to grant AI startups access to their or their customers’ data. Both OpenAI and Google have negotiated arrangements with Reddit to use its users’ content to train their artificial intelligence systems, for example.

Others have started legal actions. The New York Times has sued OpenAI and Microsoft over their artificial intelligence systems, claiming that utilizing the paper’s articles to teach them violates its copyright.

Cloudflare, an internet infrastructure business, has now introduced several similar capabilities, telling users that they are a way of expressing their “AIndependence”. All Cloudflare clients would have an “easy button” to “block all AI bots,” the company stated.

Cloudflare made a tweak last year to ban AI bots that “behave well”. Even though the mechanism was designed for bots that followed the rules, Cloudflare’s clients “overwhelmingly” chose to restrict them, the company said.

Now, the business has included a function that would forcefully block any recognized bots. It will scan for scrapers’ fingerprints and prevent them from browsing websites, it added.

Last year, GreatGameIndia reported that the Italian National Authority for Personal Data Protection announced Italy’s ban on OpenAI’s ChatGPT over privacy concerns.

Do you have a tip or sensitive material to share with GGI? Are you a journalist, researcher or independent blogger and want to write for us? You can reach us at [email protected].

Data Wars: Inside The Battle To Protect Online Content From AI

Daily Counter-Intelligence Briefing Newsletter

Leave a ReplyCancel reply

get in touch

Follow us

Data Wars: Inside The Battle To Protect Online Content From AI

Daily Counter-Intelligence Briefing Newsletter

Leave a ReplyCancel reply

get in touch

Follow us

Cookies