Apple Launches Core AI for Apple-Silicon Optimized On-Machine Generative AI

At WWDC 26, Apple introduced the Core AI framework, the official successor to Core ML. It’s designed to permit builders to run massive language fashions and generative AI completely on-device, supporting each custom-converted PyTorch fashions and pre-optimized open-source fashions.

Apple says the brand new Core AI framework supplies a unified structure for deploying fashions starting from compact 3B-parameter vision models to large-scale LLMs, including reasoning models with up to 70B-parameter reasoning models, throughout the iPhone, iPad, Mac, and Apple Imaginative and prescient Professional.

Core AI is the expertise underpinning Apple Intelligence, and with the following launch of its OSes and toolchain, Apple is making it accessible to builders to construct what it calls “{custom} intelligence”. Core AI, which may solely run on Apple Silicon, ensures person knowledge privateness, zero server dependencies, and nil per-token cloud prices.

Key Core AI capabilities embody unified {hardware} entry, permitting workloads to seamlessly run throughout the CPU, GPU, and Neural Engine underneath one API; a memory-safe Swift API enabling zero-copy knowledge paths and fine-grained management over inference reminiscence; and ahead-of-time (AOT) compilation, which shifts work off the person’s gadget, yielding near-instant load occasions.

As talked about, you possibly can convert a PyTorch mannequin right into a Core AI mannequin utilizing Core AI PyTorch. The best strategy is exporting a PyTorch as a torch.export.ExportedProgram and convert it to a CoreAI AIProgram utilizing TorchConverter().add_exported_program(ep).to_coreai().

Alternatively, you possibly can creator a brand new Core AI mannequin from a PyTorch one utilizing built-in composite ops offered by the library, reminiscent of consideration, RoPE embeddings, RMSNorm, and gather-matmul, registering {custom} decreasing perform to map new PyTorch ops to Core AI IR, and even creating custom Metal kernels for lower-level optimization.

When changing a PyTorch mannequin, an essential step is compressing it for deployment on Apple hardware. This course of applies optimization methods reminiscent of quantization and palettization, that are designed to align with the execution patterns of the Core AI runtime by default, making certain environment friendly on-device efficiency.

Mannequin compression will help scale back the reminiscence footprint of your mannequin (disk dimension and at runtime), scale back inference latency, scale back energy consumption, or optimize them all of sudden.

One vital side of working an AIModel is its automated specialization to the present {hardware} and OS model, which is carried by way of when the mannequin is first loaded into the mannequin cache. In consequence, the primary try to make use of a mannequin could take considerably longer than subsequent runs, as soon as the mannequin has been already cached. Builders can management how and when this course of occurs by customizing SpecializationOptions, accessing the AICacheModel to test whether or not a mannequin is already accessible or delete cached ones, and even sharing the mannequin cache throughout an app group.

With the introduction of Core AI, Apple is offering assist for 3 distinct approaches to run ML/AI on its working methods: Core ML, Core AI, and MLX Swift. Primarily based on developer discussions primarily based on Hacker Information, Apple seems to suggest utilizing Core ML for “basic, non-neural ML”, reminiscent of choice bushes or tabular function engineering, Core AI for neural networks and transformers, and MLX for working with {custom} mannequin weights—although doubtlessly with lower performance. Community feedback also notes that whereas Core AI “makes it simpler to include high-performance LLMs”, its long-term worth will rely “on the the longer term progress of the official Core AI/neighborhood”.