With SQL Server 2025 embracing AI as a core focus, I was inspired by Bob Ward’s demo and Davide Mauri’s post to try out the new vector features myself. In this two-part series, I explore how SQL now supports semantic search and hybrid RAG setups — all running locally with Ollama.
Part 1: vector basics in SQL Server
Part 2 (coming soon): hybrid RAG with local LLMs
Lets begin 🙂

First, what’s all the fuss about ?
SQL 2025 ships with some new features enabling RAG applications to be built on top of the familiar sql server ecosystem. It’s Microsoft trying to keep sql relevant when everything under the sun seems to be eating into its market share.
RAG ( Retrieval-Augmented Generation ) is an architecture that combines retrieval of external data with generation using a large language model (LLM), like GPT or LLaMA.
What RAG Does:
Instead of asking an LLM to “guess” or hallucinate facts, it retrieves relevant context from an external data source (like a database, PDF, or website), and then asks the LLM to generate an answer based on that real-world context. So in simple terms and in SQL Server’s case you can search based on AI enhanced semantic reasoning. It’s actually pretty cool.
RAG Architecture
- User Question → “What are the symptoms of Lyme disease?”
- Embed the Question → Convert the question into a vector.
- Vector Search → Look up the most semantically similar documents/snippets from a vector store (SQL 2025, Azure SQL, Managed Instance)
- Context Retrieval → Get top N relevant documents/snippets.
- Prompt Construction → Inject retrieved context into an LLM prompt.
- LLM Generation → “Based on these docs, answer the user question.”
Why is RAG useful ?
- Reduces hallucinations (answers grounded in facts)
- Keeps LLMs up-to-date (by connecting to live data)
- Enables domain-specific Q&A (using internal documents)
- Lightweight alternative to fine-tuning
SQL 2025’s AI Features to enable RAG
Vector Data Type
Store vector data optimized for operations such as similarity search and machine learning applications. Vectors are stored in an optimized binary format but are exposed as JSON arrays for convenience. Each element of the vector is stored as a single-precision (4-byte) floating-point value.
Vector Functions
New scalar functions perform operations on vectors in binary format, allowing applications to store and manipulate vectors in the SQL Database Engine.
Vector Index
Create and manage approximate vector index to quickly and efficiently find similar vectors to a given reference vector.
External AI Models
Manage external AI model objects for embedding tasks (creating vector arrays) accessing REST AI inference endpoints.
So in SQL Server’s context there’s nothing fancy going on in the engine, we have a a new data type and the ability to call external REST API’s within SQL (that is pretty cool)
Vector Search Demo
The intention here is to get the vector search running on my own PC ,using the demo from Bob Ward, utilising Ollama to generate the embeddings. Once that’s working I tried to wrap it in some natural language so a user could ‘chat’ with the embedded data.
The RAG flow looks like this:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
+—————————+ | User Question | <— e.g., “What is the best bike for mountain trails?” +—————————+ | v +——————————————–+ | Vector Search via SQL Proc | | EXEC find_relevant_products_vector_search | | Inputs: @prompt, @stock, @top | +——————————————–+ | v +—————————————————+ | Vector Search Results (Products) | | → productID, ProductName, | | ProductDescription, StockLevel | +—————————————————+ | v +———————————————–+ | Format Results as Text | | → Top–k products in readable | | format (Name, Description) | +———————————————–+ | v +——————————————+ | TinyLlama (LLM) Call | | Prompt: | | “Answer user’s question | | using these product results…” | +——————————————+ | v +—————————————+ | Final Natural Language | | Answer from TinyLlama | +—————————————+ | v +——————————–+ | Return Answer to UI | +——————————–+ |
Setup
Below are some hugely simplified steps to get the basics of embedding generation going locally, leaning heavily on bob’s work.
- Install SQL 2025 public preview.
- Install Ollama.
- pull nomic-embed-text.
- Create proxy API with using HTTPS (SQL can only use external api’s running over https).
- Create external ai model.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
USE master; GO sp_configure ‘external rest endpoint enabled’, 1; GO RECONFIGURE WITH OVERRIDE; GO DROP EXTERNAL MODEL MyOllamaEmbeddingModel; GO — Create the EXTERNAL MODEL CREATE EXTERNAL MODEL MyOllamaEmbeddingModel WITH ( LOCATION = ‘https://localhost:8000/api/embed’, API_FORMAT = ‘Ollama’, MODEL_TYPE = EMBEDDINGS, MODEL = ‘nomic-embed-text’, ); GO |
- create table with vector data type.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
DROP TABLE IF EXISTS Production.ProductDescriptionEmbeddings; GO CREATE TABLE Production.ProductDescriptionEmbeddings ( ProductDescEmbeddingID INT IDENTITY NOT NULL PRIMARY KEY CLUSTERED, — Need a single column as cl index to support vector index reqs ProductID INT NOT NULL, ProductDescriptionID INT NOT NULL, ProductModelID INT NOT NULL, CultureID nchar(6) NOT NULL, Embedding vector(1536) ); — Populate rows with embeddings — Need to make sure and only get Products that have ProductModels INSERT INTO Production.ProductDescriptionEmbeddings SELECT p.ProductID, pmpdc.ProductDescriptionID, pmpdc.ProductModelID, pmpdc.CultureID, AI_GENERATE_EMBEDDINGS(pd.Description USE MODELMyOllamaEmbeddingModel) FROM Production.ProductModelProductDescriptionCulture pmpdc JOIN Production.Product p ON pmpdc.ProductModelID = p.ProductModelID JOIN Production.ProductDescription pd ON pd.ProductDescriptionID = pmpdc.ProductDescriptionID ORDER BY p.ProductID; GO — Create an alternate key using an ncl index CREATE UNIQUE NONCLUSTERED INDEX [IX_ProductDescriptionEmbeddings_AlternateKey] ON [Production].[ProductDescriptionEmbeddings] ( [ProductID] ASC, [ProductModelID] ASC, [ProductDescriptionID] ASC, [CultureID] ASC ); GO |
Adventureworks – the dummy data
Production.ProductDescription contains a bunch of text data relating to different products, and forms the source of the embeddings.

If we take a look at it we can see it seems to be bike related, so lets query it using traditional methods.
Are there any actual bike products ?

Yes, so a very simple wildcard search can work. But what if we want to expand on that ?
Whats the fastest bike ?

Natural language prompts don’t work well at all. Existing data types and indexes struggle with semantic lookups.
Now lets use the new index :


You can see here the proc brings back results that are ranked in closest approximate similarity the natural language expression. We are able to convert any sentence into something sql can actually filter on.
Whats going on under the hood here
Embeddings are vectors that represent important features of data. Embeddings are often learned by using a deep learning model, and machine learning and AI models utilize them as features. Embeddings can also capture semantic similarity between similar concepts. For example, in generating an embedding for the words person and human, we would expect their embeddings (vector representation) to be similar in value since the words are also semantically similar.
In our example i have used nomic-embed-text to generate embeddings based on product descriptions.
Vector search refers to the process of finding all vectors in a dataset that are similar to a specific query vector. Therefore, a query vector for the word human searches the entire dataset for similar vectors, and thus similar words: in this example it should find the word person as a close match. This closeness, or distance, is measured using a distance metric such as cosine distance. The closer vectors are, the more similar they are.
SQL Server provides built-in support for vectors via the vector data type. Vectors are stored in an optimized binary format but exposed as JSON arrays for convenience. Each element of the vector is stored using single-precision (4 bytes) floating-point value
Exact Nearest Neighbor (ENN)
Exact Nearest Neighbour search compares a query vector to every vector in the database to find the most similar ones. It uses a brute-force method that guarantees perfect accuracy by computing exact distances (e.g., cosine similarity or Euclidean distance). This approach is reliable for small datasets or when absolute precision is required, but it becomes inefficient and slow as the dataset grows, making it unsuitable for real-time or large-scale use cases.
Approximate Nearest Neighbour (ANN)
Approximate Nearest Neighbor search is designed to return very similar — but not guaranteed exact — results by using efficient indexing algorithms like HNSW, IVF, or product quantization. It significantly speeds up similarity searches by reducing the number of comparisons, making it ideal for large datasets and real-time applications such as semantic search or recommendation engines. While slightly less accurate, ANN offers a powerful trade-off between performance and precision.
| Feature | ENN (Exact Nearest Neighbor) | ANN (Approximate Nearest Neighbor) |
|---|---|---|
| Accuracy | Perfect | Close, not guaranteed |
| Speed | Slower | Fast |
| Scalability | Limited (small datasets) | Excellent (large datasets) |
| Use case | High-precision queries | Real-time, large-scale search |
| Method | Full scan | Indexed/partitioned search |
| Tools/Examples | Plain SQL, pgvector (no index) | FAISS, Milvus, pgvector with ivfflat |
The index we used uses the diskann type, which is approximate nearest neighbour, as well as the proc.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
CREATE VECTOR INDEX product_vector_index ON Production.ProductDescriptionEmbeddings (Embedding) WITH (METRIC = ‘cosine’, TYPE = ‘diskann’, MAXDOP = 8); GO CREATE OR ALTER PROCEDURE [find_relevant_products_vector_search] @prompt NVARCHAR(max), — NL prompt @stock SMALLINT = 500, — Only show product with stock level of >= 500. User can override @top INT = 10, — Only show top 10. User can override @min_similarity DECIMAL(19,16) = 0.3 — Similarity level that user can change but recommend to leave default AS IF (@prompt is null) RETURN; DECLARE @retval INT, @vector VECTOR(1536); SELECT @vector = AI_GENERATE_EMBEDDINGS(@prompt USE MODEL MyAzureOpenAIEmbeddingModel) IF (@retval != 0) RETURN; SELECT p.Name as ProductName, pd.Description AS ProductDescription, p.SafetyStockLevel AS StockLevel FROM vector_search( TABLE = Production.ProductDescriptionEmbeddings AS t, COLUMN = Embedding, similar_to = @vector, metric = ‘cosine’, top_n = @top ) AS s JOIN Production.ProductDescriptionEmbeddings pe ON t.ProductDescEmbeddingID = pe.ProductDescEmbeddingID JOIN Production.Product p ON pe.ProductID = p.ProductID JOIN Production.ProductDescription pd ON pd.ProductDescriptionID = pe.ProductDescriptionID WHERE (1–s.distance) > @min_similarity AND p.SafetyStockLevel >= @stock ORDER by s.distance; GO |
Here’s a concise version of the python behind the scenes which creates an API that a blazor webapp can use, ChatGPT is amazing 😉
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
from fastapi import FastAPI, Request, HTTPException from pydantic import BaseModel import pyodbc import requests import logging import time app = FastAPI() # Configure logging with timestamps and levels logging.basicConfig( level=logging.DEBUG, format=“%(asctime)s [%(levelname)s] %(message)s”, handlers=[logging.StreamHandler()] ) class QuestionRequest(BaseModel): question: str def get_relevant_context(question: str) -> str: try: logging.info(“Connecting to SQL Server…”) conn = pyodbc.connect( “DRIVER={ODBC Driver 17 for SQL Server};” “SERVER=GMXX-DBASQN15\\DBA_2025;” “DATABASE=AdventureWorks2022;” “Trusted_Connection=yes;” ) cursor = conn.cursor() logging.debug(f“Running stored procedure with question: {question}”) cursor.execute(“EXEC find_relevant_products_vector_search ?, ?, ?”, question, 100, 1) rows = cursor.fetchall() conn.close() context = “\n”.join(str(row[0]) for row in rows) logging.debug(f“Retrieved context from SQL:\n{context[:500]}”) # Show up to 500 chars return context except Exception as e: logging.error(f“Error retrieving context: {e}”) raise def call_tinyllama(context: str, question: str) -> str: # Wrap context in a clear section prompt = ( “You are an expert assistant. Use ONLY the following information to answer the question.\n\n” “==== BEGIN DATA ====\n” f“{context}\n” “==== END DATA ====\n\n” f“Question: {question}\n” “Answer carefully based ONLY on the data above:” ) payload = { “model”: “tinyllama”, “prompt”: prompt, “stream”: False } try: logging.debug(f“Calling TinyLlama with prompt (truncated): {prompt[:500]}”) start_time = time.time() response = requests.post(http://localhost:11434/api/generate, json=payload) response.raise_for_status() duration = time.time() – start_time answer = response.json()[“response”] logging.debug(f“TinyLlama response (truncated): {answer[:500]}”) logging.info(f“TinyLlama call took {duration:.2f} seconds”) return answer.strip() except Exception as e: logging.error(f“Error calling TinyLlama: {e}”) raise @app.post(“/ask”) def ask(question_req: QuestionRequest): logging.info(f“Received question: {question_req.question}”) try: context = get_relevant_context(question_req.question) answer = call_tinyllama(context, question_req.question) return {“answer”: answer} except Exception as e: logging.error(f“Failed to generate answer: {e}”) raise HTTPException(status_code=500, detail=“Internal server error”) |
And the embedding api used by SQL:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
from fastapi import FastAPI, Request, HTTPException import logging import requests import time app = FastAPI() logging.basicConfig(level=logging.DEBUG) @app.post(“/api/embed”) async def embed(request: Request): body = await request.json() logging.debug(f“Received request body: {body}”) model = body.get(“model”) input_text = body.get(“input”) if not model or not input_text: raise HTTPException(status_code=400, detail=“Missing model or input”) try: start_time = time.time() response = requests.post(http://localhost:11434/api/embed, json={ “model”: model, “input”: input_text }) duration = time.time() – start_time response.raise_for_status() data = response.json() embeddings_list = data.get(“embedding”) or data.get(“embeddings”) if not embeddings_list or not isinstance(embeddings_list, list): raise ValueError(“Invalid embeddings format returned”) # Ensure it’s nested as [[…]] embedding = [embeddings_list] if isinstance(embeddings_list[0], float) else embeddings_list logging.debug(f“Received embedding: {embedding}”) except Exception as e: logging.error(f“Error calling embedding API: {e}”) raise HTTPException(status_code=500, detail=“Embedding service error”) return { “model”: model, “embeddings”: embedding, “total_duration”: int(duration * 1_000_000), # microseconds “load_duration”: 1019500, # You can dynamically get this if Ollama supports it “prompt_eval_count”: len(input_text.split()) # Approximate token count } |
With just a few lines of SQL and FastAPI glue, SQL Server 2025 can now power smart, semantically aware applications — no fine-tuning required. That’s all well and good, but what I found most interesting was the ability to use open source models through Ollama. If anything this was an excuse to start playing around with local llm’s, which itself is where the brains for all of this lies, and in truth the most interesting part. In Part 2, I’ll take this a step further by combining traditional precise sql queries with semantic ones for hybrid precision. Stay tuned!
Comments
One response to “SQL 2025 and AI – chatting with your data part 1”
[…] my pet human’s last thrilling installment, he casually mentioned how SQL Server 2025 and AI were becoming best friends. Well, plot twist: […]