On-Device LLM Inference via KMP and llama.cpp
--- title: "On-Device LLMs via KMP: A Production Architecture with llama.cpp" published: true description: "Build a KMP shared module wrapping llama.cpp with mmap loading, hardware acceleration, an...

Source: DEV Community
--- title: "On-Device LLMs via KMP: A Production Architecture with llama.cpp" published: true description: "Build a KMP shared module wrapping llama.cpp with mmap loading, hardware acceleration, and thermal management for 3B-parameter models on mobile." tags: kotlin, mobile, architecture, android canonical_url: https://blog.mvp-factory.com/on-device-llms-via-kmp-production-architecture --- ## What We Will Build In this workshop, I will walk you through a Kotlin Multiplatform shared module that wraps llama.cpp to run 3B-parameter LLMs directly on iOS and Android. By the end, you will understand mmap-based model loading, hardware accelerator delegation across Apple Neural Engine and Android NNAPI, quantization format tradeoffs, and the thermal throttling patterns that separate a demo from a shippable feature. Let me show you a pattern I use in every project that touches on-device inference. ## Prerequisites - Kotlin Multiplatform project configured for iOS and Android targets - llama.cpp