EXO: Running Large-Scale AI Models on Everyday Devices

A Comprehensive Guide and Hands-On Tutorial

Nov 12, 2024

EXO: Running Large-Scale AI Models on Everyday Devices

In the evolving world of AI, large language models (LLMs) have become increasingly prominent, showcasing abilities close to human-like intelligence in text generation, reasoning, and various other tasks. However, these models are resource-intensive, often requiring specialized hardware setups like racks of GPUs, which are prohibitively expensive for most individuals or small organizations. Enter EXO – an open-source library that transforms everyday devices like laptops, phones, and even Apple Watches into a distributed AI cluster capable of running these powerful models. Here’s a closer look at EXO, its underlying technology, and how it democratizes access to AI by using consumer hardware.

What is EXO?

EXO is an open-source library designed to run large AI models on clusters of edge devices. It enables anyone to use local hardware—whether it’s a MacBook, iPhone, iPad, or Android phone—to collectively process large AI models, which are typically limited to data centers with GPUs. Setting up EXO is straightforward: simply clone the EXO repository, run the installation script, and launch the library. EXO’s magic lies in its ability to link devices on the same network or through other forms of connection, aggregating their computing power to support demanding AI tasks.

This setup allows for running sophisticated language models like Meta’s LLaMA, with billions of parameters, on common consumer devices without requiring expensive, high-end GPUs. Devices pool their resources together to split model layers, processing them collaboratively, which lets users work with advanced AI models even on hardware with limited memory and processing power.

The Vision Behind EXO

The fundamental goal of EXO is to democratize AI access. There are two possible futures in AI: one in which a few companies control the best AI models and limit their accessibility, and another where powerful models are open-source and widely available. While the first option may offer advantages in terms of control and centralization, an open-source AI world aligns more closely with a democratized approach to technology.

However, unlike traditional open-source software, where anyone with a laptop and ideas can contribute, open-source AI development requires substantial up-front investment in hardware and infrastructure. Training and even using AI models can be incredibly expensive due to their high computational needs. By enabling distributed processing on everyday devices, EXO bridges this gap, making open-source AI usable without needing massive data centers or exclusive hardware.

Technical Overview: How EXO Works

Single Device Scenario

When running a model on a single device, the setup is straightforward. If the model size fits into the device’s VRAM, it can run locally without any complex setup. For example, a model with 4 billion parameters might only require around 4 GB of memory if it’s quantized to 4 bits. Such a model can be processed directly on a laptop or phone with adequate VRAM.

However, for models that exceed a single device’s memory, EXO can still run them by dynamically loading and unloading parts of the model as needed. While this approach works, it’s slow, as it requires the system to load and unload model layers with each word or token generated. This process is computationally expensive and results in lower processing speeds, particularly for models with over 70 billion parameters, which demand significant memory.

Multi-Device Setup: The Power of Clusters

The real power of EXO shines in a multi-device setup. By linking multiple devices in a local network (using protocols like gRPC) or connecting them via cables like Thunderbolt, EXO distributes the model layers across these devices. Each device stores and processes specific layers, meaning that memory loading and unloading aren’t necessary. Instead, each device performs its computations and only transfers small embeddings (usually just a few kilobytes) to the next device in the chain.

In this distributed setup, EXO efficiently processes data and dramatically speeds up performance by minimizing load times. For example, a single device setup might achieve a speed of 0.3 tokens per second, while a multi-device cluster could reach up to 12 tokens per second, depending on network speed and device performance.

Balancing Latency and Throughput

EXO’s distributed setup is especially useful for balancing latency and throughput:

Latency: Adding more devices to a cluster improves response times up to a certain point, beyond which additional devices may actually slow down processing due to added communication overhead.
Throughput: EXO’s multi-device architecture shines when processing multiple requests simultaneously. If you need to process numerous AI tasks in parallel, adding more devices can lead to nearly linear scaling, maximizing throughput.

Use Cases and Applications of EXO

Running Large Models on Consumer Hardware: EXO’s primary use case is running high-parameter models, such as LLaMA-70B, on consumer devices by pooling resources from multiple phones, laptops, or desktops. In practice, users can create a local network of Macs, iPhones, and other devices, pooling them to handle models traditionally run on specialized GPU clusters. This enables applications that would normally be restricted to cloud-based solutions to run locally.
Privacy-Focused AI Applications: Running AI models on a local device cluster offers enhanced privacy since data doesn’t need to be sent to the cloud. For example, personal assistant models that handle sensitive information, like scheduling or messaging, can run entirely on a personal device network, keeping data secure and localized.
Smart Agent Deployment: With EXO, users can deploy AI agents that perform complex tasks, such as managing calendar events, messaging contacts, and even responding intelligently to specific user commands. The local agent uses visual and language models to interpret device screenshots, take actions, and automate tasks—all without needing cloud processing.
Specialized Edge Applications: EXO supports smaller devices like Apple Watches, which can run lightweight models such as voice-to-text transcriptions fully on-device. Though the computational capacity of these devices may not contribute significantly to large-scale model processing, they can still serve specific, low-power applications within the EXO cluster.

Challenges and Limitations of EXO

While EXO opens up exciting possibilities, there are limitations:

Device-Specific Constraints: EXO currently operates within the limitations of device-specific hardware. For example, Apple’s Neural Engine remains partially locked for developers, limiting EXO’s access to some of the iPhone’s potential processing power.
Performance Bottlenecks: In a distributed setup, performance depends heavily on network speeds and device connections. Optimally connecting devices (e.g., via Thunderbolt rather than Wi-Fi) improves speed, but users with slower network connections may face delays.
Limited Compatibility with Dedicated AI Accelerators: While EXO can leverage existing hardware like CPUs and GPUs, it cannot yet fully exploit the potential of specialized accelerators like Apple’s Neural Engine on iOS devices, as access to this hardware remains restricted.

Why EXO Matters: The Road Ahead for Democratized AI

EXO represents a significant step toward making powerful AI accessible and practical for the average user, without the need for expensive, high-end GPUs or cloud computing costs. With EXO, anyone with a few devices can run models that previously required thousands of dollars in hardware.

Furthermore, EXO emphasizes local, privacy-centric computing by running models on devices directly within a user’s control. This opens doors for a range of applications where data sensitivity and privacy are paramount, like in medical or personal assistant AI.

EXO’s continued development will likely push the boundaries of what is possible with consumer devices, potentially integrating with more types of hardware and supporting even larger models in distributed environments.

How to Get Started with EXO

The easiest way to start using EXO is to install it from source. Below is a step-by-step guide that will walk you through the process of getting EXO up and running, turning your everyday devices into an AI cluster.

Prerequisites

Python: Ensure you have Python 3.12.0 or newer installed. This version is required due to issues with asyncio in previous versions.
Linux with NVIDIA Card: If you plan to use EXO on a Linux machine with an NVIDIA GPU, make sure you have the following installed:
- NVIDIA Driver: You can verify installation with nvidia-smi.
- CUDA: Follow NVIDIA's CUDA Installation Guide and test it using nvcc --version.
- cuDNN: Download and configure cuDNN.

Hardware Requirements

The primary requirement to run EXO is to have enough memory across all devices to fit the entire model. For instance, if you are running LLaMA 3.1 with 8 billion parameters (in FP16 precision), you need 16GB of memory in total across your devices. Different combinations work, such as:

2 x 8GB MacBook Airs
1 x 16GB NVIDIA RTX 4070 Ti Laptop
2 x Raspberry Pi 400 with 4GB RAM each (running on CPU) + 1 x 8GB Mac Mini

EXO is designed for devices with heterogeneous capabilities, meaning you can have a mix of powerful GPUs, integrated GPUs, or even CPUs. Note that adding less powerful devices may slow down the overall latency but will increase throughput for parallel workloads.

Installation From Source

To get started with EXO from source:

# Clone the EXO repository
git clone https://github.com/exo-explore/exo.git

# Change directory to the exo folder
cd exo

# Install EXO in editable mode
pip install -e .

# Alternatively, use a virtual environment
source install.sh

Troubleshooting

MacOS-Specific Issues: If you are running EXO on a Mac, you may encounter some installation issues related to MLX. The EXO repository contains a guide to help with troubleshooting those issues.
Performance Optimization: For users on Apple Silicon Macs, upgrading to MacOS 15 and running the provided ./configure_mlx.sh script can help optimize GPU memory allocation.
SSL Certificate Issues: Some versions of MacOS and Python may have improperly installed certificates, leading to SSL errors (e.g., with Hugging Face). To fix this, run the Install Certificates command typically located at /Applications/Python 3.x/Install Certificates.command.

Example Usage

EXO provides an easy way to start utilizing your devices for running AI models, and you can use it in several configurations.

Running EXO on Multiple Devices

Once EXO is installed, running it on multiple devices is as simple as executing the exo command on each device. The system will automatically discover other devices on the same network—no manual configuration required.

Example:

Device 1 (MacOS):

exo

Device 2 (Linux):

exo

After launching EXO on each device, it will start a ChatGPT-like WebUI (powered by Tinygrad's Tinychat) on

http://localhost:8000

. You can also access a ChatGPT-compatible API on http://localhost:8000/v1/chat/completions for integrating EXO into your applications.

Using cURL for API Access

EXO provides a ChatGPT-compatible API, which allows easy integration. Here are a few examples:

LLaMA 3.2 (3B parameters):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3.2-3b",
     "messages": [{"role": "user", "content": "What is the meaning of EXO?"}],
     "temperature": 0.7
   }'

LLaVA 1.5 (7B Vision Language Model):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llava-1.5-7b-hf",
     "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What are these?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "http://images.cocodataset.org/val2017/000000039769.jpg"
            }
          }
        ]
      }
    ],
     "temperature": 0.0
   }'

Model Storage and Debugging

Model Storage: By default, models are stored in ~/.cache/huggingface/hub. You can set a different storage location by defining the HF_HOME environment variable.
Debugging: To enable debug logs, set the DEBUG environment variable (range: 0-9). For the Tinygrad inference engine, you can use the TINYGRAD_DEBUG flag.

DEBUG=9 exo
TINYGRAD_DEBUG=2 exo

Known Issues and Upcoming Features

MacOS SSL Errors: Occasionally, SSL certificate errors occur on MacOS, especially when interacting with external services like Hugging Face. These can often be solved by re-running the Install Certificates script.
iOS Implementation Lag: Due to rapid development, the iOS version of EXO has lagged behind the Python implementation. The EXO team is working on a proper solution and will make an announcement once the iOS version is ready for public use. If you need early access, contact alex@exolabs.net.

Supported Inference Engines and Networking Modules

Inference Engines:
- MLX
- Tinygrad
- (Upcoming) PyTorch, llama.cpp
Networking Modules:
- GRPC (Fully supported)
- (Upcoming) Radio, Bluetooth

EXO makes it easy to run AI clusters with everyday devices. With just a little configuration, you can set up a distributed AI environment, leverage community contributions, and experiment with powerful models—all while keeping costs low and retaining control over your own data.

Prompt Injection

Discussion about this post