Hands-on Small Language Models (Alexander Thomas) (z-library.sk, 1lib.sk, z-lib.sk)

Author: Alexander Thomas

后端

Large language models have reshaped what's possible in AI—but their size, cost, and complexity can make them difficult to use in real-world production. Small language models (SLMs) offer a more practical alternative: they're efficient, focused, and built for applications that demand agility, privacy, and control. Hands-On Small Language Models provides a hands-on guide to understanding, building, and deploying these compact models to power specialized agentic applications. Author Alex Thomas, principal data scientist at John Snow Labs, draws on years of experience in natural language processing and applied AI to show how SLMs are democratizing the generative AI landscape. Through clear explanations and guided projects, including the development of a multi-functional movie chatbot, you'll learn how to combine, deploy, and monitor SLMs both locally and in the cloud.

📄 File Format: PDF

💾 File Size: 4.7 MB

Views

Downloads

0.00

Total Donations

📖 Read Online ⬇️ Download

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

📄 Page 1

(This page has no text content)

📄 Page 2

Hands-On Small Language Models Practical Patterns for Building Efficient Applications with SLMs Alexander N. Thomas

📄 Page 3

Hands-on Small Language Models by Alex Thomas Copyright © 2026 Alexander Thomas. All rights reserved. Published by O’Reilly Media, Inc., 141 Stony Circle, Suite 195, Santa Rosa, CA 95401. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (https://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Nicole Butterfield Development Editor: Michele Cronin Production Editor: Jonathon Owen Cover Designer: Karen Montgomery Interior Designer: David Futato Interior Illustrator: Kate Dullea January 2027: First Edition Revision History for the Early Release 2026-01-21: First Release See https://oreilly.com/catalog/errata.csp?isbn=9798341670723 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Hands- on Small Language Models, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.

📄 Page 4

The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 979-8-341-67068-6 [FILL IN]

📄 Page 5

Brief Table of Contents (Not Yet Final) Chapter 1: Language Models (unavailable) Chapter 2: Getting Started (available) Chapter 3: Selecting the Right Small Language Model (available) Chapter 4: Agentic Applications with MCP (unavailable) Chapter 5: Agentic Application with Multiple SLMs (unavailable) Chapter 6: Testing and Compliance (unavailable) Chapter 7: Deployment (unavailable) Chapter 8: Monitoring (unavailable)

📄 Page 6

Chapter 1. Getting Started A NOTE FOR EARLY RELEASE READERS With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles. This will be the 2nd chapter of the final book. If you’d like to be actively involved in reviewing and commenting on this draft, please reach out to the editor at mcronin@oreilly.com. In the last chapter we talked about what Small Language models are, what makes them different from large language models, and how we can think about them when incorporating them into applications. In this chapter, we will prepare for the project that will work as a throughline of this book. We will install the software and libraries, acquire the data, and set up the environment for the project. Project: Theοros θεωρός spectator, envoy at a festival [1] envoy sent to consult an oracle [2] Throughout this book, we’ll use an agentic system for searching for movies, learning about movies, and other movie-related tools. I decided to use movies for a few reasons. First of all, the data does not require special security, since there is no personally identifying information. This is all publicly available data. This means that we can focus on learning about

📄 Page 7

using SLMs with agentic applications. That being said, for many real life use cases, this will not be the case. Another reason is that it does not require special domain knowledge, since most people are at least acquainted with the movies they like. Finally, it offers a good mix of different kinds of data, from structured relational data, to free-text of various lengths, to potentially non-text data. We will focus on text data in this book, but this project can also be grown by you to include processing and generating audio, images, or even video. The Theoros application will be made available for use in a few different ways. The first will be through a simple MCP client. We will use this to perform functionality tests. From here, we will use LibreChat to test the chat experience. Later in the book, we will look at different ways of deploying models. For now, we will be using externally hosted models and models we launch with Ollama. Installing the Software In this book, I try to make the requirements as light as possible. Although SLMs can be used from commodity machines, and even on mobile and edge devices, some people may have older machines or other limitations. So to help ensure everyone can follow along, the code in this book should work locally for most machines, as well as on Google Colab. We will be using some tools and frameworks that we will need to deploy in our environment. For running locally, we will generally use docker to simplify matters. This book will also include instructions for deploying the server in a Google Colab environment. In the rest of this section, we’ll run through the various tools and frameworks we’ll be using. We will include a brief description of what they are and how to install them. Operating System The code in this book is developed on Linux, either directly on a Linux machine or using the Windows Subsystem for Linux (WSL). The Linux

📄 Page 8

flavor I am using is Ubuntu which is also what Google Colab runs on. Installing miniconda These instructions can be skipped if you will only be using a Google Colab environment. Everyone else who is developing on their own machine should install miniconda which we will use as our package manager. The first thing we want to do is create our environment. I will be using conda from Anaconda to manage the environment. So let’s install conda. From the Anaconda website [3]. mkdir -p ~/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux- x86_64.sh -O ~/miniconda3/miniconda.sh bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 rm ~/miniconda3/miniconda.sh After installing, close and reopen your terminal application, then initialize conda. conda init --all Now, you can create your environment. conda create --name slmbook python=3.12 Finally, activate your environment. conda activate slmbook Now you are ready to begin installing libraries. In the repository for the code for this book, there is also a YAML file that can be used to set up the conda environment.

📄 Page 9

Python Data Science Stack Firstly, let’s install Python Data Science Stack, also known as the Python Data Stack and the Pydata Stack. It is a collection of popular open source Python libraries used in data science. The stack commonly includes the following NumPy: A numerical analysis library which powers many of the other libraries in this stack, as well as many other libraries. NumPy provides the main implementations of arrays and matrices used in Python. SciPy: A scientific computing library with implementations of large variety algorithms for data science tasks. For example, it includes modules for optimization algorithms for linear programming, numerical differentiation, and statistics and probability. pandas: Also a scientific computing library, but instead of a focus on algorithms, it provides important data structures. The most important of these data structures are the data frame and the series. The data frame is a table-like structure, and the series is an array- like structure. Both are implemented using NumPy’s arrays for performance. Matplotlib: A plotting library that is foundational to many other Python plotting libraries. Scikit-learn: A machine learning library that contains a large variety of machine learning algorithms. The library also contains many other functions that are useful, feature extraction and metric calculation among others. Jupyter: The classic notebook environment for Python. We will also be using Google Colab, which is a notebook environment originally based on Jupyter Notebooks.

📄 Page 10

We will also be using some other libraries that are not usually included in the Python Data Science Stack but are common data science libraries nonetheless. Natural Language Toolkit (NLTK): An NLP library that has a large variety of NLP algorithms. Although it is not performant compared to many other NLP libraries, it has a broad set of functions, and has few dependencies. NetworkX: A library for processing and displaying graphs and networks. This library will help us analyze some of our data sets, as well as display graph-like or network-like data. Let’s install these libraries. Here is the conda command to install the libraries discussed above. conda install numpy scipy pandas matplotlib scikit-learn jupyter nltk networkx MCP Model Context Protocol (MCP) from Anthropic “is an open-source standard for connecting AI applications to external systems” [4.] This is referring to the actual protocol itself. “MCP” is also used to refer to the library used to build MCP servers and clients. At the time of writing, there are many libraries and frameworks for implementing agentic applications. MCP is growing in popularity for a number of reasons. First, it offers a standardized way to organize prompts, tools, and data for use in agentic applications. Second, it does much of the boilerplate necessary for making the agentic application available to other applications and services. Finally, since it is from the people at Anthropic, it is easily integrated with models and applications from Anthropic. For our purposes, we will not rely on the integration with Anthropic, since we will be experimenting and using models from different families. By using MCP to organize the structure of our agentic workflows, this will

📄 Page 11

allow us to focus on optimizing the prompts and models. Additionally, since it is popular, deploying our application on MCP-compatible servers will be much easier. Let’s install the MCP framework. conda install mcp We will be exploring MCP more in chapter 4. LiteLLM LitelLLM is a framework that simplifies access to different models. LiteLLM will allow us to access models even if they do not have an OpenAI compatible API. Another aspect of LiteLLM is the LiteLLM Proxy Server. This proxy server can help many aspects of deployment. You can configure the model to handle custom behavior, and secrets management. Here we will install LiteLLM with pip since it is not available in the mainline conda repository. pip install litellm OpenRouter OpenRouter is a platform which allows you to have single API key access to a wide range of hosted models. Although we will mostly be working with models that host locally (or on Google Colab), we will be comparing them to some hosted models. For accessing these hosted models, we will use OpenRouter. Additionally, if you are in a situation where security or privacy is not a concern, but you cannot run models locally, then switching to OpenRouter can be a viable option. This is also made easy by using LiteLLM which makes changing from local models to hosted models a minor configuration change.

📄 Page 12

Docker Docker is a tool for creating and managing containers, which are a kind of virtual machine. For our purposes, Docker will allow us to run other servers, like Ollama or a vector database, without needing to install it on the system. These instructions can be skipped if you will only be using a Google Colab environment. For either Linux or WSL, you will need to check if you have the prerequisites set up on the machine. These can be complicated, so we won’t discuss Docker prerequisites here. The instructions for installing Linux (Ubuntu) - https://docs.docker.com/desktop/setup/install/linux/ubuntu/ [5] WSL - https://docs.docker.com/desktop/setup/install/windows- install/ [6] LibreChat LibreChat is an open-source chat application. In order to really test an application like we will be building, we need to test it in a realistic user scenario - a chat application. MCP integrates into the Claude Desktop application, being that both are originally from Anthropic. However, the Claude Desktop application is built with the use of Anthropic models in mind, so we will use LibreChat because it is not tied to a particular model family, and it does have MCP integration. Even if you are using the Google Colab environment, you will want to set up Libre Here are the installation instructions using Docker [7]. First, clone the LibreChat project. git clone https://github.com/danny-avila/LibreChat.git cd LibreChat

📄 Page 13

Next create a .env file from their example. cp .env.example .env Finally, launch the containers with Docker compose. docker compose up -d Now the LibreChat application is available at localhost:3080. To log in, create a new account. There are default models available, if you wish to explore this application, you can edit the .env file to add API keys to use the default models. Ollama Ollama is a framework for running language models locally, and making them available through an OpenAI-compatible API. We will be using Ollama for running models locally or on Google Colab. Locally, we will run Ollama via Docker. Before running the container, make sure you all the prerequisites [8]. To start the Ollama container run the following. docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama Acquiring the Data Now that we have installed the libraries we’ll need, we can begin acquiring the data. Some of the data sources are traditional datasets that we will stage in the environment. Others are APIs or pre-built queries. Wikipedia

📄 Page 14

Wikipedia is a community edited encyclopedia on a broad range of topics. Generally, it is accurate enough for our product. On controversial or highly technical topics, it may not be reliable enough. For our use case, Wikipedia articles can be a great way of supporting a Retrieval Augmented Generation (RAG) system. Additionally, we may use it to extract attributes or relationships concerning movies or related entities. We will primarily be pulling individual articles on our use case. In fact we will be using an MCP agent made for searching Wikipedia. If you want to download everything, you can download a dump of all the articles at - https://dumps.wikimedia.org/. Wikidata Wikidata is a Resource Description Framework (RDF) knowledge graph supported by the Wikimedia Foundation like Wikipedia [9]. Knowledge graphs are a great source of data about entities from movies to genetic sequence variants. The data is in the RDF format. This format represents properties (e.g. name, release data, etc.) and relationships between entities as semantic triples. The parts of these triples are the subject, predicate, and object. Each entity is represented by a Uniform Resource Identifier (URI). Predicates are URIs that represent relationships. Apart from URIs, there are also literals which are used to represent values and properties. So there are, generally speaking, two kinds of triples, i.e. three graph entities or values making up a statement. Entity-predicate-entity, also known as truthy triples [10] Entity-predicate-value, also known as properties [10] Using Wikidata allows us to identify relationships and properties of many different kinds of entities without needing to use any information extraction. As we use Wikidata, we will go over more about how to query it

📄 Page 15

using SPARQL, an RDF-format query language. You can try the Wikidata query service here. Huggingface Huggingface is a platform built for data scientists and machine learning practitioners. You can version control models and datasets, create leaderboards on datasets, and even host models there. We will explore more functionalities of Huggingface later when we discuss deployment. In building this application, we will often need to find task specific datasets to measure performance for specific features. Huggingface is a great resource for this. We can search for and download datasets posted by people in the huggingface community. Obtaining data from Huggingface can be done by the huggingface client library. You will also need to set up an account [11]. conda install -c huggingface -c conda-forge datasets huggingface_hub[cli] Kaggle Finally, Kaggle is a platform for machine learning competitions. It has grown to support machine learning development more broadly by allowing the publication of datasets and models, and offering an online notebook- based development platform [12] Similar to Huggingface, Kaggle is an excellent source for open datasets. Since the licensing and access requirements can be quite different between datasets, we will be downloading manually instead of via their API. You will also need to have an account setup for Kaggle. Setting Up the Environment

📄 Page 16

As a final step, let’s make sure everything is ready. We will write a function that takes a movie title, and will return the director. Jupyter Environment The first step is to start your jupyter server. jupyter lab --no-browser Now, navigate to localhost:8888. In the File Browser on the left hand side, navigate to the folder you are working in. Finally, create a new notebook for this test. [screenshot of jupyterlab] Docker Containers Make sure that Docker is running. If you are on Windows, you should make sure that Docker Desktop is running then navigate to Containers to see what containers are running. If you are on Linux, run the following to see what containers are running. docker ps Output CONTAINER ID IMAGE ... 1a73854073fc ollama/ollama "/bin/ollama server" ... Ollama If you do not see the Ollama server running in the previous step, re-run the command from earlier.

📄 Page 17

docker run -d --gpus=all -v ollama:/root/.ollama \ -p 11434:11434 --name ollama ollama/ollama Once this is running, we can go and check about accessing the model from the notebook. In the notebook let’s try and call the Ollama model. from litellm import completion response = completion( model="ollama_chat/llama3", messages=[{ "content": "Briefly answer the following question, please. What would be the best movie to show an alien?", "role": "user" }], api_base="http://localhost:11434" ) print(response.choices[0]["message"].content) Output What a fascinating question! I think the best movie to show an alien would be "E.T. the Extra- Terrestrial" (1982) directed by Steven Spielberg. This iconic film tells the story of a young boy who befriends an alien, E.T., who is stranded on Earth. The movie explores themes of friendship, empathy, and understanding, which are universal languages that can transcend cultural and intergalactic boundaries. The film's message of kindness, compassion, and human connection would likely resonate with an extraterrestrial audience, making it a great choice for an alien movie night! Querying Wikidata Now, let us build our function to find a director.

📄 Page 18

def get_director(movie): # first we create the prompt prompt = """ Please write a Wikidata query to find who is the director of " {movie}". Return only the query, not explanation. """.format(movie=movie).strip() print("Prompt:") print(prompt) print() # we will use the same call from above. The temperature is set to 0.1, # If you want to experiment with trying to get better results, perhaps # add a function argument for temperature. response = completion( model="ollama_chat/llama3", messages=[{ "content": prompt, "role": "user" }], temperature=0.1, api_base="http://localhost:11434" ) # This query often return markdown ticks, so we remove that. print("Query:") return response.choices[0]["message"].content.strip("`").strip() Now, let us try a movie - “John Carter”. print(get_director("John Carter")) Output Prompt: Please write a Wikidata query to find who is the director of "John Carter". Return only the query, not explanation.

📄 Page 19

Query: PREFIX wdt: <http://www.wikidata.org/prop/direct/> SELECT ?director WHERE { wd:Q113745 (film) . film wdt:director ?director . } This is not a good result! This is not even valid SPARQL. Since most models are trained on some portion of the general internet, and SPARQL is not a popular topic, the model is unlikely to have seen much of that language. Combine that with Wikidata having some idiosyncratic patterns, and the model is not likely to be able to create valid Wikidata queries. Now we have our environment set up and running, we are familiar with some of our data sources, and we have tried a first experiment. In the next chapter we will learn how measure and compare models from different families. Conclusion Now we have our environment set up and running, we are familiar with some of our data sources, and we have tried a first experiment. In the next chapter we will learn how measure and compare models from different families. Citations 1. Robert Beekes, Etymological Dictionary of Greek (Leiden: Brill, 2010). s.v. “θέα”, via https://archive.org/details/etymological- dictionary-of-greek_202306/page/537/mode/2up 2. Henry George Liddell et al., A Greek - English Lexicon (Oxford: Clarendon Press, 1996). s.v. “θεωρός”, via https://www.perseus.tufts.edu/hopper/text? doc=Perseus:text:1999.04.0057:entry=qewro/s

📄 Page 20

3. Anaconda. “Installing Miniconda - Anaconda,” n.d. https://www.anaconda.com/docs/getting- started/miniconda/install#linux-2. 4. Model Context Protocol. “What Is the Model Context Protocol (MCP)? - Model Context Protocol,” n.d. https://modelcontextprotocol.io/docs/getting-started/intro. 5. Docker Documentation. “‘Ubuntu,’” April 3, 2025. https://docs.docker.com/desktop/setup/install/linux/ubuntu/. 6. Docker Documentation. “‘Windows,’” October 24, 2025. https://docs.docker.com/desktop/setup/install/windows-install/. .렄ocker,” n.d. https://www.librechat.ai/docs/local/dockerߐ“ .7 8. Ollama. “Ollama’s Documentation - Ollama,” n.d. https://docs.ollama.com/. 9. “Wikidata,” n.d. https://www.wikidata.org/wiki/Wikidata:Main_Page. 10. “Wikidata:SPARQL Query Service/Queries - Wikidata,” n.d. https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/qu eries. 11. “Hugging Face – the AI Community Building the Future.,” n.d. https://huggingface.co/. 12. “Kaggle: Your Machine Learning and Data Science Community,” n.d. https://www.kaggle.com/. .

The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00

Total Amount (¥)

Donation Count

Recommended for You

Loading recommended books...

Failed to load, please try again later

← Back to List

Hands-on Small Language Models (Alexander Thomas) (z-library.sk, 1lib.sk, z-lib.sk)

📄 Text Preview (First 20 pages)

Registered users can read the full content for free

💝 Support Author

Recommended for You

{{title}}