Back to Blog
ResearchApril 13, 2026 · 8 min read

Gemma 4 Just Made On-Device AI Real — Here's What You Can Build

Apache 2.0 license, 2B to 31B parameters, runs under 1.5GB RAM. Gemma 4 brings function calling and multi-step reasoning to the edge.

Published by GitIntel Research

TLDR

Why Gemma 4 Is Different

On-device AI has been "almost ready" for three years. Every release promised edge deployment; every release required compromises that made it impractical. Models were too large, too slow, or too dumb for real workloads.

Gemma 4, released April 2, 2026, changes the equation in three specific ways:

Size. The E2B variant runs at 2.3 billion effective parameters. With LiteRTLM's 2-bit quantization, it fits under 1.5GB of RAM. That's a high-end phone, a Raspberry Pi 5, or any laptop made in the last five years.

Capability. The 31B dense model ranks #3 on Arena AI with an ELO of 1452. It scores 89.2% on AIME 2026 (math) and 80.0% on LiveCodeBench (coding). These numbers compete with models at 200B+ parameters.

Function calling. Previous small models could generate text. Gemma 4 can call functions, produce structured JSON, follow system instructions, and chain multi-step reasoning. This is the gap that kept edge models out of agent workflows.

The Four Variants, Explained

| Variant | Parameters | Active | RAM (quantized) | Best For | |---------|-----------|--------|-----------------|----------| | E2B | ~2.3B | 2.3B | <1.5 GB | Mobile, IoT, embedded | | E4B | ~4.5B | 4.5B | ~2.5 GB | Phones, tablets, RPi 5 | | 26B MoE | 26B | ~4B | ~3 GB | Laptops, desktops | | 31B Dense | 31B | 31B | ~18 GB | Workstations, servers |

The MoE (Mixture of Experts) variant is the sweet spot for most developers. At 26 billion total parameters with only 4 billion active at inference time, it delivers close to 31B-quality output at E4B-level resource requirements.

Getting Started: Three Deployment Paths

Path 1: Local Development with Ollama

# Install Ollama if you haven't
curl -fsSL https://ollama.com/install.sh | sh

# Pull Gemma 4 (choose your variant)
ollama pull gemma4:2b      # E2B - fastest, smallest
ollama pull gemma4:4b      # E4B - balanced
ollama pull gemma4:26b     # MoE - best quality/resource ratio
ollama pull gemma4:31b     # Dense - maximum quality

# Run it
ollama run gemma4:26b

Path 2: Python with Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/gemma-4-26b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to parse CSV files with error handling."},
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Path 3: Mobile with LiteRT (Android/iOS)

// Android with LiteRT
val model = LiteRTModel.load("gemma4-e2b-q2.tflite")
val response = model.generate(
    prompt = "Summarize this meeting transcript:",
    input = transcriptText,
    maxTokens = 256
)

The E2B variant at 2-bit quantization runs inference on a Pixel 8 in under 2 seconds for a 100-token response. That's fast enough for real-time mobile features.

Five Things You Can Build Now

1. Offline-First Coding Assistant

Gemma 4's 26B MoE scores 80.0% on LiveCodeBench — enough to handle code completion, refactoring suggestions, and bug explanations without an internet connection. Build a VS Code extension that runs Gemma 4 locally:

{
  "editor.inlineSuggest.enabled": true,
  "gemma4.modelPath": "~/.models/gemma4-26b-q4.gguf",
  "gemma4.maxTokens": 256,
  "gemma4.temperature": 0.1
}

The privacy angle sells itself: your code never leaves your machine. For enterprises with air-gapped development environments, this is the first viable AI coding assistant.

2. On-Device Voice Agent

Combine Gemma 4 E4B with Whisper tiny for a fully offline voice assistant:

import whisper
from gemma4 import Gemma4Local

stt = whisper.load_model("tiny")
llm = Gemma4Local("gemma4-e4b-q4")

# Transcribe audio
audio_text = stt.transcribe("recording.ogg")["text"]

# Process with Gemma 4
response = llm.generate(
    system="You are a personal assistant. Be concise.",
    user=audio_text,
)

Total resource footprint: ~3GB RAM, no network, no API keys. The entire pipeline runs on a $35 Raspberry Pi 5.

3. Smart Home Controller with Function Calling

Gemma 4's native function calling makes it a natural fit for home automation. Define tools, and the model decides when to call them:

tools = [
    {
        "name": "set_lights",
        "description": "Set room lights to a brightness level",
        "parameters": {
            "room": {"type": "string"},
            "brightness": {"type": "integer", "min": 0, "max": 100}
        }
    },
    {
        "name": "set_thermostat",
        "description": "Set temperature in Fahrenheit",
        "parameters": {
            "temperature": {"type": "integer"}
        }
    }
]

response = llm.generate(
    user="It's getting dark, dim the living room lights and bump the heat up a bit",
    tools=tools,
)
# Output: [set_lights(room="living room", brightness=30), set_thermostat(temperature=72)]

No cloud dependency. The controller processes natural language commands locally with sub-second latency.

4. Document Processing Pipeline

For businesses handling sensitive documents (medical records, legal contracts, financial statements), Gemma 4 processes everything locally:

The Apache 2.0 license means no usage restrictions, no data reporting, no compliance concerns about document content leaving your infrastructure.

5. Edge AI for Retail/Manufacturing

Gemma 4's multimodal capabilities (the larger variants) process images alongside text. Deploy on edge devices in physical locations:

The Limitations to Know About

Gemma 4 is real, but it's not magic:

Context window. The edge variants (E2B, E4B) support shorter context windows than cloud models. Don't expect to process a 100-page document in a single pass with the 2B variant.

Reasoning depth. The 31B model competes with cloud models on benchmarks. The 2B model does not. For complex multi-step reasoning, you need the larger variants. The E2B shines at classification, extraction, and short-form generation.

Multimodal. Vision capabilities are available on the larger variants but not the smallest edge models. If you need image understanding on mobile, target E4B or higher.

Latency vs. cloud. Gemma 4 on a laptop is slower than Claude via API. The tradeoff is privacy, cost, and offline capability. For latency-sensitive applications, quantize aggressively and keep prompts short.

What Changed

A year ago, running a capable AI model on a phone meant a 10-second wait for a mediocre response. Gemma 4 E2B delivers a usable response in under 2 seconds on current hardware, with function calling that actually works.

The Apache 2.0 license removes every business constraint. Fork it, fine-tune it, embed it in a commercial product, sell it as a service. Google is betting that widespread adoption of Gemma models grows the ecosystem that pays for Vertex AI and Google Cloud.

For developers, the practical impact is this: any feature that currently requires an API call to a cloud model can now be evaluated as an on-device feature. Not every feature should move to the edge — cloud models are still more capable for complex tasks. But the default assumption that "AI requires an API key" no longer holds.

Gemma 4 makes on-device AI a design choice, not a technical limitation. What you build with that choice is up to you.