Exploring LM Studio: Running Local AI on Your Desktop or Server

LM Studio is a cost-free tool that enables you to operate an AI on your computer using open-source Large Language Models (LLMs) that are installed locally. It comes with a browser for searching and downloading LLMs from Hugging Face, a built-in Chat UI, and a local server runtime that is compatible with the OpenAI API. This server can be used to establish a development environment prior to deploying a more comprehensive LLM system, or to run your own version of ChatGPT without providing your business data to other entities.

Setting up LM Studio

If you want to run LM Studio, you need either a Mac with an M1 (or subsequent) CPU, or a Windows system with a CPU that offers support for AVX2 extensions. The requirements for both memory (RAM and hard drive space) are dependent on your selected LLM. Some are relatively small, coming in at just a few GB, whilst larger models might consume substantial storage and memory. While GPU acceleration is currently considered to be experimental, it’s not a necessity in order to operate most LLMs. The newest Mac and Windows installers can be downloaded from https://lmstudio.ai/, and a version for Linux is currently being trialled.

Procuring a model

LM Studio has a built-in search function that uses the Hugging Face AI community API to search for models. For demonstration purposes, I will use Meta’s Llama 2 model. Note that model availability and performance change daily. Enter the keywords llama gguf in the search bar. Because Hugging Face also contains models incompatible with LM Studio, I added the GGUF specification. GPT-Generated Unified Format (GGUF) is a file format for LLMs.

LM’s model search interface

Searching LLMs in LM Studio

The models you will see are filtered by potential compatibility with your system and ordered by popularity.

The most popular selection currently is TheBloke/Llama-2-7B-Chat-GGUF as per Meta. Details about the model can be obtained from the Model Card which is a source of information from the developer of the model. This model is available with multiple quantization settings, a technique for streamlining a model by using fewer resources by downscaling the model’s precision.

In my situation, I will opt for Q5_K_M, which necessitates approximately 8GB of RAM and 5GB of disk space. Select the most suitable option based on your situation and resources. After selection, click Download.

Inspect the Model Card button to find out more about the model

How to Run an LLM on your desktop

Click the AI Chat icon in the navigation panel on the left side. At the top, select a model to load and click the llama 2 chat option. It takes a few seconds to load.

LM Studio may ask whether to override the default LM Studio prompt with the prompt the developer suggests. A prompt suggests specific roles, intent, and limitations to the model, e.g., “You are a helpful coding AI assistant,” which the model will then use as context for answering your prompts. Note that accepting a prompt (input) from a third party may have unintended side effects; you can review the prompt in the model settings.

On the right side, you also have various settings for your specific models, which can be used to fine-tune both performance and resource consumption. Experimental GPU support is available. You can now converse with the downloaded model.

Chatting with Llama 2

Running a local LLM server

LM Studio also allows you to run a downloaded model as a server. This is useful for developing and testing your code. The server is compatible with the OpenAI API, currently the most popular API for LLMs.

Click on the Local Server option in the navigation bar. Note that starting a server disables the Chat option. Once again, you can change various features or use a preset, but the defaults should work fine. Click Start Server.

The LM Server runs on port 1234 by default, and you access the API via HTTP.

Running a local LLM server

If you installed the OpenAI SDK for Python, you can access the LM Server by simply replacing client = OpenAI() with client = OpenAI(base_url=”http://localhost:1234/v1″, api_key=”not-needed”). The best part is you don’t need an OpenAI API key.

Accessing the OpenAI API of LM Studio

Instead of conversing with one of the OpenAI models, you will access a model installed on your desktop. In our case, this is Meta’s Llama 2.

Monitoring the model response in the Server Logs

You can view the model’s response in the LM Studio Server Logs.

Subscribe to 4sysops newsletter!

Conclusion

Posted

January 30, 2024

Ai, Articles, Llama 2, Openai

tony

Tags: