
Apple has published a technical paper detailing the models it has developed to power Apple Intelligence, a suite of generative AI capabilities coming to iOS, macOS, and iPadOS in the coming months.
In the paper, Apple disputes accusations that it took an ethically questionable approach to training some of its models, reiterating that it did not use personal user data and that it combined publicly available and licensed data for Apple Intelligence.
“[The pretraining dataset]consists of data licensed from publishers, curated publicly available or open-source datasets, and publicly available information crawled by our web crawler, Applebot,” Apple wrote in the paper. “Because we focus on user privacy, no personal Apple user data is included in the data mix.”
In July, Proof News reported that Apple had trained a family of models designed for on-device processing using a dataset called The Pile, which contains captions from hundreds of thousands of YouTube videos. Many YouTube creators whose captions were swept into The Pile were unaware of this and did not consent to it. Apple later issued a statement saying that it had no intention of using the models to power AI features in its products.
Technical documentation unveiling Apple Foundation Models (AFM), the models first unveiled by Apple at WWDC 2024 in June, emphasizes that the training data for AFM models is sourced “responsibly” (at least by Apple’s definition).
The training data for the AFM model includes publicly available web data and undisclosed publisher licensed data. According to The New York Times, Apple has signed multiyear deals worth at least $50 million with several publishers, including NBC, Condé Nast, and IAC, through the end of 2023 to train models on the publishers’ news archives. Apple’s AFM model was also trained on open source code hosted on GitHub, specifically Swift, Python, C, Objective-C, C++, JavaScript, Java, and Go.
Training models on code without permission, even open code, is controversial among developers. Some open source codebases are unlicensed or their terms of use do not allow AI training, some developers argue. However, Apple says it “license filters” code to include only repositories with minimal usage restrictions, such as those under the MIT, ISC, or Apache licenses.
According to the paper, to improve the mathematical skills of the AFM model, Apple specifically included mathematical questions and answers from web pages, math forums, blogs, tutorials, and seminars in the training set. The company also utilized a “high-quality, publicly available” dataset (not named in the paper) that was “licensed to be used to train the model,” which it filtered to remove sensitive information.
In total, the training dataset for the AFM model amounts to about 6.3 trillion tokens. (A token is typically a bite-sized piece of data that is easy for a generative AI model to ingest.) For comparison, that’s less than half the 15 trillion tokens that Meta used to train its flagship text generation model, Llama 3.1 405B.
Apple collected additional data, including human feedback data and synthetic data, to fine-tune the AFM model and mitigate undesirable behaviors, such as toxic substance spewing.
“Our models are designed to help users perform everyday activities using Apple products.
Apple said, “This is grounded in Apple’s core values and is based on responsible AI principles at every step.”
The paper lacks any hard evidence or shocking insights. This is due to careful design. It is rare for such papers to be very revealing due to competitive pressures, but they are published. ~ too Companies may find themselves in legal trouble.
Some companies claim that their practices are protected by fair use doctrines, as they scrape public web data to train their models. But this is a very controversial issue and is increasingly the subject of litigation.
Apple notes in the paper that it allows webmasters to block crawlers from scraping their data. However, individual creators are left in a predicament. For example, what should an artist do if their portfolio is hosted on a site that doesn’t block Apple from scraping their data?
The legal battle will determine the fate of generative AI models and training methods, but for now, Apple is trying to position itself as an ethical player while avoiding unwanted legal scrutiny.