Quickly Run AI Large Language Models on Windows - Llama3

This article was last updated on: June 29, 2026 pm

Overview

Meta recently released the latest Llama3 model and open-sourced the code. Meta Llama 3 is now available in 8B and 70B pre-trained and instruction-tuned versions, supporting a wide range of applications.

Llama 3 excels at language nuances, contextual understanding, and complex tasks such as translation and dialogue generation.

We can quickly run the Llama3 8B model on Windows.

│ 📝Notes
│
│ Disclaimer: This article contains almost no original content. The main materials are sourced from the internet. I have only summarized the detailed steps that actually worked for me, for readers’ reference.
│ Additionally, my expertise is limited, and suggestions for improvement are welcome. There are currently several ways to quickly run AI models on Windows, including but not limited to:
│
│ - Llmafile
│ - Chat With RTX | Nvidia
│ - WSL2 + WASMEdge
│
│ Due to personal experience limitations, this article provides a hands-on guide using WSL + WASMEdge.

📚️References

Tech Stack

Llama3
LlamaEdge + WASMEdge
NVIDIA GPU Driver + CUDA
WSL2

Prerequisites

Windows 10/11
Hardware: NVIDIA GPU (exact model requirements are uncertain — any GPU that supports CUDA should work; my RTX 2060 worked without issues)

Key Concepts

Llama 3

Model Performance

Meta released the next-generation language model Llama 3, including 8B and 70B parameter pre-trained and instruction-tuned models.
Llama 3 demonstrates state-of-the-art performance across a wide range of industry benchmarks and offers new capabilities, including improved reasoning.
Llama 3 models outperform other models of comparable size on standard benchmarks and are optimized for real-world scenarios.

Model Architecture

Llama 3 uses an improved decoder-only transformer architecture with grouped query attention (GQA) to improve inference efficiency.
Llama 3 was pre-trained on over 15T tokens — seven times the training dataset of Llama 2 — and includes more code.
Llama 3 uses various data filtering pipelines to ensure training data quality, including heuristic filters, NSFW filters, semantic deduplication methods, and text classifiers.
Llama 3 uses detailed scaling laws to scale pre-training and selects the optimal combination of training data.

Instruction Tuning

Llama 3 introduces innovations in instruction tuning, including supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO).

Availability

Llama 3 will soon be available on all major platforms, including cloud providers, model API providers, and more. Llama 3 will be everywhere.

Future Plans

The Llama 3 8B and 70B models mark just the beginning of what Meta plans to release for Llama 3. Much more is coming. Meta’s largest model has over 400B parameters, and while these models are still in training, the team is excited about the trends they’re seeing.
In the coming months, Meta will release multiple models with new capabilities, including multimodality, the ability to converse in multiple languages, longer context windows, and stronger overall capabilities. Once Llama 3 training is complete, Meta will also publish a detailed research paper.

LlamaEdge

The LlamaEdge project makes it easy to run LLM inference applications locally and create OpenAI-compatible API services for the Llama2-3 family of LLMs.
LlamaEdge uses the Rust+Wasm stack, providing a more robust AI inference alternative compared to Python.
LlamaEdge supports all large language models (LLMs) based on the llama2-3 framework, and model files must be in GGUF format.
Compiled Wasm files are cross-platform — the same Wasm file can run on different operating systems, CPUs, and GPUs.
LlamaEdge provides a detailed troubleshooting guide to help users resolve common issues.

WASMEdge

WasmEdge Overview

The WasmEdge runtime provides a well-defined execution sandbox for contained WebAssembly bytecode programs.
WasmEdge can run standard WebAssembly bytecode programs compiled from C/C++, Rust, Swift, AssemblyScript, or Kotlin source code.
WasmEdge supports all standard WebAssembly features, as well as many proposed extensions.
WasmEdge also supports extensions tailored for cloud-native and edge computing use cases (e.g., WasmEdge network sockets, Postgres and MySQL-based database drivers, and WasmEdge AI extensions).
WasmEdge can be launched from the CLI as a new process or from an existing process.
WasmEdge is currently not thread-safe.
WasmEdge can be integrated with Go, Rust, or C applications.
The WasmEdge project is open source, and contributions are welcome.
The WasmEdge community holds a monthly community meeting to showcase new features, demonstrate new use cases, and conduct Q&A sessions.

It’s also worth mentioning WASMEdge’s advantages. At KubeCon NA 2024, WASMEdge highlighted its key strengths:

Unlike Java, Wasm has unique advantages in the cloud-native space, especially for running AI applications on GPUs.
Currently, building AI applications typically requires components like API servers, large language models, and orchestration frameworks. The entire process leans more toward research and is difficult to tightly integrate with high security.
WASMEdge aims to build more compact application servers that integrate prompt engineering, RAG frameworks, and other capabilities into the application server, orchestrated via Kubernetes.
WASMEdge collaborated with W3C to define a new abstraction layer called WASI Neural Network, which defines GPU access and AI inference primitives as bytecode-level APIs.
Developers only need to write applications targeting the WASI API and compile them to bytecode, which can then be deployed and run on any WASM-capable device without recompilation.
WASMEdge provides live demos where attendees can install WASM on their laptops and download large language models, interacting with LLMs without a network connection to verify cross-platform portability.

If you’re interested, check out this video: WasmEdge, portable and lightweight runtime for AI/LLM workloads | Project Lightning Talk

NVIDIA CUDA

The CUDA computing platform is more than just a programming model — it includes thousands of general-purpose computing processors in the GPU computing architecture, parallel computing extensions for many popular languages, powerful plug-and-play accelerated libraries, and turnkey applications and cloud-based computing devices.
CUDA is not limited to the popular CUDA Toolkit and CUDA C/C++ programming language.
Since its introduction in 2006, CUDA has been widely deployed in thousands of applications and published research papers, supported by CUDA-compatible GPUs installed in over 500 million laptops, workstations, computing clusters, and supercomputers.
Many researchers and developers use the CUDA platform to advance the state of the art in their work. Read some of their stories in the CUDA In Action Spotlight series.
Drop in a GPU-accelerated library to replace or augment CPU-only libraries such as MKL BLAS, IPP, FFTW, and other widely used libraries.
Use OpenACC directives to automatically parallelize loops in Fortran or C code for acceleration.
Develop custom parallel algorithms and libraries using familiar programming languages such as C, C++, C#, Fortran, Java, Python, and more. Start accelerating your applications today.

WSL (Windows Subsystem for Linux)

Windows Subsystem for Linux (WSL) is a feature of Microsoft Windows that allows developers to run a Linux environment without the need for a separate virtual machine or dual-boot setup. WSL has two versions: WSL 1 and WSL 2.
By default, WSL is not available to all Windows 10 users. It can be obtained by joining the Windows Insider Program or by manually installing it via the Microsoft Store or Winget.
WSL 1 was first released on August 2, 2016, as a compatibility layer for running Linux binary executables (ELF format) by implementing Linux system calls on the Windows kernel. It is available on Windows 10, Windows 10 LTSB/LTSC, Windows 11, Windows Server 2016, Windows Server 2019, and Windows Server 2022.
In May 2019, WSL 2 was released, introducing significant changes such as a real Linux kernel via a Hyper-V feature subset. WSL 2 differs from WSL 1 in that WSL 2 runs inside a managed virtual machine that implements a full Linux kernel. As a result, WSL 2 is compatible with more Linux binaries than WSL 1, since not all system calls were implemented in WSL 1. Since June 2019, WSL 2 has been available to Windows 10 customers through the Windows Insider Program, including the Home edition.

Step-by-Step Guide

1. Install the Latest NVIDIA Windows GPU Driver

Steps omitted.

2. Install WSL

│ 📚️Reference
│
│ How to install Linux on Windows with WSL

Prerequisites

You must be running Windows 10 version 2004 or higher (Build 19041 or higher) or Windows 11 to use the commands below.

Open your preferred Windows Terminal / Command Prompt / PowerShell and install WSL:

wsl.exe --install
Make sure you have the latest WSL kernel:

wsl.exe --update

3. Install CUDA Toolkit in WSL Ubuntu

│ 📚️Reference
│
│ CUDA Toolkit 12.4 Update 1 Downloads

Enter WSL Ubuntu and install the CUDA Toolkit using the following commands:

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda-repo-wsl-ubuntu-12-4-local_12.4.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-4-local_12.4.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4

4. Run a Local AI Large Language Model with LlamaEdge + WasmEdge - Llama3

│ 📚️References
│
│ - Some Say Open-Source Models Will Fall Behind — Try Llama 3 | LlamaEdge | Second State
│
│ The following is an excerpt from the original article:
│ Through this article, you’ll be able to develop and deploy Llama-3-8B applications on your own computer using LlamaEdge[2] (Rust + Wasm stack). No need to install complex Python packages or C++ toolchains! See why this tech stack was chosen.
│
│ If you want to get started quickly, just run the following command in your terminal. This CLI tool will automatically download the required software: the LLM runtime, the Llama-3-8B model, and the LLM inference program.

1	`bash <(curl -sSfL 'https://raw.githubusercontent.com/LlamaEdge/LlamaEdge/main/run-llm.sh') --model llama-3-8b-instruct`

│ 🐾Warning
│
│ Friendly reminder: make sure you have a stable network connection before running this step.
│

The output looks like this:

[+] Downloading the selected model from https://huggingface.co/second-state/Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q5_K_M.gguf
######################################################################################################### 100.0%
[+] Extracting prompt type: llama-3-chat
[+] No reverse prompt required
[+] Install WasmEdge with wasi-nn_ggml plugin ...

Using Python: /home/casey/.pyenv/shims/python3
INFO    - CUDA detected via nvcc
WARNING - Experimental Option Selected: plugins
WARNING - plugins option may change later
INFO    - Compatible with current configuration
INFO    - Running Uninstaller
WARNING - Uninstaller did not find previous installation
WARNING - SHELL variable not found. Using zsh as SHELL
INFO    - shell configuration updated
INFO    - Downloading WasmEdge
|============================================================|100.00 %INFO    - Downloaded
INFO    - Installing WasmEdge
INFO    - WasmEdge Successfully installed
INFO    - Downloading Plugin: wasi_nn-ggml-cuda
|============================================================|100.00 %INFO    - Downloaded
INFO    - Downloading Plugin: wasmedge_rustls
|============================================================|100.00 %INFO    - Downloaded
INFO    - Run:
source /home/casey/.zshrc

    The WasmEdge Runtime is installed in /home/casey/.wasmedge/bin/wasmedge.


[+] Downloading the latest llama-api-server.wasm ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 8070k  100 8070k    0     0  2575k      0  0:00:03  0:00:03 --:--:-- 7030k

[+] Downloading Chatbot web app ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100 1721k  100 1721k    0     0   703k      0  0:00:02  0:00:02 --:--:-- 10.2M


[+] Will run the following command to start the server:

    wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3-8B-Instruct.Q5_K_M.gguf llama-api-server.wasm --prompt-template llama-3-chat --model-name Meta-Llama-3-8B-Instruct.Q5_K_M.gguf --socket-addr 0.0.0.0:8080 --log-prompts --log-stat

    Chatbot web app can be accessed at http://0.0.0.0:8080 after the server is started


*********************************** LlamaEdge API Server ********************************

[2024-04-19 23:54:06.043] [error] instantiation failed: module name conflict, Code: 0x60
[2024-04-19 23:54:06.043] [error]     At AST node: module

[INFO] LlamaEdge version: 0.8.3
[INFO] Model name: Llama-3-8B
[INFO] Model alias: default
[INFO] Context size: 4096
[INFO] Prompt template: llama-3-chat
[INFO] Number of tokens to predict: 1024
[INFO] Number of layers to run on the GPU: 100
[INFO] Batch size for prompt processing: 512
[INFO] Temperature for sampling: 1
[INFO] Top-p sampling (1.0 = disabled): 1
[INFO] Penalize repeat sequence of tokens: 1.1
[INFO] Presence penalty (0.0 = disabled): 0
[INFO] Frequency penalty (0.0 = disabled): 0
[INFO] Enable prompt log: false
[INFO] Enable plugin log: false
[INFO] Socket address: 0.0.0.0:8080
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5, VMM: yes
[INFO] Wasi-nn-ggml plugin: b2636 (commit 5dc9dd71)
[INFO] LlamaEdge API server listening on http://0.0.0.0:8080

If the one-click execution fails, you can also run the steps individually. See the original article for details.

🎉🎉🎉 At this point, Llama3 is successfully running locally. 🎉🎉🎉

You can access it at: http://localhost:8080

Results

Accessing Llama3 via API Server

As shown above, Llama3 is successfully running locally. The user experience is the same as using ChatGPT online.

Additionally, during runtime, you can see through the Windows Task Manager that GPU utilization has reached 100%:

"One prompt to max out the GPU at 100%" 😂😂😂

Furthermore, we can leverage the OpenAI-compatible API it provides to enable a variety of scenarios:

Call the API directly

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'accept:application/json' \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"system", "content": "You are a sentient, superintelligent artificial general intelligence, here to teach and assist me."}, {"role":"user", "content": "Write a short story about Goku discovering kirby has teamed up with Majin Buu to destroy the world."}], "model":"Llama-3-8B"}'

Integrate with various clients that support custom AI API server addresses, such as:
- Obsidian Text Generator Plugin
- Immersive Translate
- Various ChatGPT-related browser extensions
- Various ChatGPT desktop applications

This enables a rich variety of AI application scenarios.

Conclusion

This article walked through the hands-on steps for quickly running a local AI large language model — Llama3 — on Windows, using WSL + WasmEdge + LlamaEdge.
It’s easy to get started with a low barrier to entry.
Give it a try if you’re interested.

Thanks to WSL, NVIDIA, CUDA, WASMEdge, and Llama3 for making this possible.

#AI #Windows #LLM #ChatGPT #Llama #WSL #WASM #WASMEdge

Quickly Run AI Large Language Models on Windows - Llama3

https://e-whisper.com/posts/30014/

Author

east4ming

Posted on

April 20, 2024

Licensed under

🍾 I Integrated AI 🤖 into My WeChat Official Account Previous

How to Monitor OpenSearch in K8s Next