The On-Device AI Revolution
Until recently, AI meant cloud computing. Voice assistants, transcription services, and text generation all required sending data to powerful remote servers.
That's changing. Modern smartphones have specialized hardware that can run AI models locally. TokKong takes advantage of this to deliver transcription and text processing without ever connecting to the cloud.
Here's how it works.
Apple's AI Hardware
The Neural Engine
Since 2017's A11 chip, every iPhone has included a Neural Engine - dedicated hardware designed specifically for machine learning operations. The latest chips have dramatically increased its capabilities:
- A17 Pro (iPhone 15 Pro): 35 trillion operations per second
- M-series chips (iPad Pro, Mac): Even faster performance
The Neural Engine excels at the mathematical operations neural networks require: matrix multiplications and tensor operations at scale.
Core ML Framework
Apple's Core ML framework provides a bridge between AI models and their hardware. When TokKong uses Core ML:
- Models are converted to Apple's optimized format
- The framework automatically uses the best available hardware
- Operations route to Neural Engine, GPU, or CPU as appropriate
- Memory is managed efficiently for mobile constraints
This lets TokKong run sophisticated AI models with better battery life and performance than generic implementations.
Two AI Systems in TokKong
TokKong uses two distinct AI systems:
1. Whisper for Transcription
OpenAI's Whisper is a speech recognition model trained on 680,000 hours of multilingual audio. It handles:
- Speech-to-text conversion
- Automatic language detection
- Multilingual transcription
- Handling various accents and audio quality
TokKong runs Whisper models optimized for Apple hardware. The model processes audio in chunks, converting speech to text in near real-time on recent devices.
2. LLMs for Text Processing
Large Language Models (LLMs) power TokKong's text processing features:
- Summarization
- Translation
- Reformatting
- Question answering about transcripts
TokKong supports several models:
Phi-3: Microsoft's efficient small model, good for faster processing
Gemma: Google's lightweight model, balanced performance
Llama: Meta's model, available in various sizes
These models are downloaded once and run entirely locally.
Technical Deep Dive
Audio Processing Pipeline
When you record in TokKong:
- Audio Capture: iOS audio framework captures at 16kHz sample rate
- Preprocessing: Audio is normalized and chunked into segments
- Feature Extraction: Mel spectrogram generation for Whisper input
- Inference: Whisper model processes spectrograms to tokens
- Decoding: Tokens convert to text with timestamp alignment
- Post-processing: Text is cleaned and formatted
This happens in a background thread while you continue recording.
Model Quantization
Running large AI models on mobile devices requires optimization. TokKong uses quantized models:
- Full precision: 32 bits per parameter (too large for phones)
- FP16: 16 bits per parameter (smaller, good quality)
- INT8: 8 bits per parameter (much smaller, slightly less accurate)
- INT4: 4 bits per parameter (smallest, optimized for mobile)
Quantization reduces memory requirements by 4-8x with minimal quality loss.
Memory Management
iPhones have limited RAM compared to servers:
- iPhone 15 Pro: 8GB
- Standard iPhones: 6GB
- Compare to servers: 64GB+
TokKong manages memory by:
- Loading model weights on-demand
- Streaming inference (processing chunks, not entire files at once)
- Releasing memory when models aren't active
- Using Apple's memory-mapped file features
This lets large models run on devices with limited memory.
Performance Expectations
Transcription Speed
On recent iPhones (A14 chip and newer):
- 1 minute of audio transcribes in ~10-30 seconds
- Longer files show better efficiency
- Speed varies with model size selection
Older devices work but take longer.
LLM Processing Speed
For text generation (summarization, reformatting):
- Phi-3 (smallest): ~10-20 tokens/second
- Larger models: 5-15 tokens/second
- Depends on output length requested
A paragraph summary generates in a few seconds; longer outputs take proportionally more time.
Battery Impact
AI processing is computationally intensive. During active transcription or text generation:
- Battery drain increases noticeably
- Approximately 5-10% per hour of continuous use
- Device may warm slightly
Idle with models loaded has minimal impact.
Storage Requirements
Model Sizes
Whisper models:
- Tiny: ~75MB
- Base: ~150MB
- Small: ~500MB
LLM models:
- Phi-3 Mini (quantized): ~2GB
- Gemma 2B (quantized): ~1.5GB
- Llama 7B (quantized): ~4GB
Managing Storage
TokKong downloads models on-demand. You can:
- Choose which models to keep
- Delete models to free space
- Re-download when needed
Total storage for full functionality: 3-6GB depending on model choices.
Privacy Architecture
No Network Calls
TokKong's AI features make zero network requests:
- Model files are self-contained
- No telemetry or analytics
- No "call home" for any feature
- Works in airplane mode
Data Flow
When you transcribe audio:
- Audio stays in app memory
- Processed by local Whisper model
- Text output stored locally
- Nothing transmitted anywhere
When you process text:
- Transcript feeds to local LLM
- Processing happens in device RAM
- Output displayed and stored locally
- Your prompts and results remain private
Comparison with Cloud AI
Cloud Advantages
- More powerful models available
- Faster processing on slow devices
- No storage requirements
- Continuous model improvements
Local Advantages
- Complete privacy
- Works offline
- No ongoing costs
- No service dependencies
- No processing limits
TokKong prioritizes privacy and independence. For users who value these, local processing is worth the tradeoffs.
The Future of On-Device AI
Apple continues to improve AI hardware:
- Neural Engine gets faster each chip generation
- Memory increases enable larger models
- Metal optimizations improve GPU utilization
As hardware improves, on-device AI will handle increasingly complex tasks. What requires a data center today may run on your phone tomorrow.
TokKong's architecture is designed to take advantage of these improvements. As better models become practical for mobile, the app can adopt them while maintaining its privacy-first approach.
Summary
TokKong's on-device AI relies on:
- Apple's Neural Engine: Specialized hardware for ML operations
- Core ML: Optimized framework for Apple hardware
- Whisper: State-of-the-art speech recognition model
- Quantized LLMs: Compressed language models for mobile
- Careful engineering: Memory management and optimization
The result is professional-quality transcription and AI text processing that never needs the cloud. Your voice and your words stay on your device, processed by AI running in your pocket.
Technology that seemed impossible a few years ago is now practical, private, and in your hands.


