Is Apple intelligence trained on data illegaly

In a new research paper, Apple doubles down on its claim of not training its Apple Intelligence models on anything scraped illegally from the web.

It’s a fair bet that Artificial Intelligence systems have been scraping every part of the web they can access, whether or not they should. In 2023, both OpenAI and Microsoft were sued by the New York Times for copyright infringement, and that was far from the only suit.

Whereas also in 2023, Apple was reported to have attempted to buy the rights to train its large language models (LLMs) on work from publishers including Conde Nast, and NBC News. Apple was said to have offered publishers millions of dollars, although it was not clear at the time which had agreed or disagreed.

Now in a newly published research paper, Apple says that if a publisher does not agree to its data being scraped for training, Apple won’t scrape it.

Apple details its ethics

“We believe in training our models using diverse and high-quality data,” says Apple. “This includes data that we’ve licensed from publishers, curated from publicly available or open-sourced datasets, and publicly available information crawled by our web-crawler, Applebot.”

“We do not use our users’ private personal data or user interactions when training our foundation models, it continues. “Additionally, we take steps to apply filters to remove certain categories of personally identifiable information and to exclude profanity and unsafe material. “

Most of the research paper is concerned with how Apple goes about doing this scraping, and specifically how its internal Applebot system ensures getting useful information despite “the noisy nature of the web.” But it does return to the overall issues regarding copyright, and each time insists that Apple is respecting rights holders.

“[We] continue to follow best practices for ethical web crawling, including following widely-adopted robots. txt protocols to allow web publishers to opt out of their content being used to train Apple’s generative foundation models,” says Apple. “Web publishers have fine-grained controls over which pages Applebot can see and how they are used while still appearing in search results within Siri and Spotlight.”

The “fine-grained controls” appear to be based around the long-standing robots.txt system. That is not any kind of standard privacy system, but it is widely adopted and involves publishers including a text file called robots.txt on their sites.

A geometric, white, interwoven circular pattern on a teal background, resembling a knot or flower.

ChatGPT logo – image credit: OpenAI

If an AI system sees that file, it is supposed to not scrape the site or specific pages that the file details. It’s as simple as that.

What companies say and what they do

It’s easy to say that a company’s AI systems will respect robots.txt, and OpenAI implies — but only implies — that it does too.

“Decades ago, the robots.txt standard was introduced and voluntarily adopted by the Internet ecosystem for web publishers to indicate what portions of websites web crawlers could access,” said OpenAI in a May 2024 blog post called “Our approach to data and AI.”

“Last summer,” it continued, “OpenAI pioneered the use of web crawler permissions for AI, enabling web publishers to express their preferences about the use of their content in AI. We take these signals into account each time we train a new model.”

Even that last part about taking signals into account is not the same as saying OpenAI respects these signals. Then that key paragraph about signals directly follows the one about robots.txt, but does not explicitly say it pays any attention.

And seemingly a great many AI companies do not adhere to any robots.txt instructions. Market analysis firm TollBit said that in March 2025, there were over 26 million disallowed scrapes where AI firms ignored robots.txt entirely.

The same firm also reports that the number is rising. In Q4 2024, 3.3% of AI scrapes ignored robots.txt, and in Q1 2025 it was around 13%.

While TollBit does not speculate on the reasons for this, it’s likely that the entire available internet has already been scraped. So the companies are pressing on, and in June 2025, a US District Court said they could.

Robots.txt is more than a simple no

When any AI system attempts to scrape a website, it identifies itself. So when Google does it, the site registers that Googlebot is accessing it, and returns a comprehensive list of permissions.

That list comprises which sections of the site the bot is not allowed to access. When Apple’s system, Applebot, was revealed in 2015, Apple said that if a site doesn’t recognize it, Applebot would follow any guidelines included for Googlebot.

The BBC said in 2023 that “we have taken steps to prevent web crawlers like those from OpenAI and Common Crawl from accessing BBC websites.” Around the same time, a study of 1,156 news publishers found that 626 had blocked AI scraping, including that by OpenAI and Google AI.

Text 'Anthropic' overlaid on code, gavel, and blurred background.

A court case against Anthropic has concluded that AI can train on any material

But a company changed the name of its scraping tool, and it can just ignore blocks — or at least be accused of doing so.

Perplexity.ai — which Apple is repeatedly rumored to be buying — marketed itself as an ethical AI too, with a detailed blog post about why ethics are so necessary.

But that was published in November 2024, and in the June before it, Forbes threatened Perplexity over it having scraped anyway. Perplexity CEO Aravind Srinivas later admitted to its search and scraping having some “rough edges.”

Apple stands out in AI

Unless Apple’s claims on ethical AI training are challenged legally, as Forbes at least started to do with Perplexity.ai, we will never know if they are true.

But OpenAI has been sued over this, Microsoft has, and Perplexity has been called out for doing it. So far, no one has claimed Apple has done anything unethical.

That’s not the same thing as publishers being happy with any firm training its LLMs on the data, but so far, Apple may be the only one doing it all legally.

Source link