SQL 2025 and AI - chatting with your data part 1 - Conversion failed when converting sanity from job..

With SQL Server 2025 embracing AI as a core focus, I was inspired by Bob Ward’s demo and Davide Mauri’s post to try out the new vector features myself. In this two-part series, I explore how SQL now supports semantic search and hybrid RAG setups — all running locally with Ollama.

Part 1: vector basics in SQL Server

Part 2 (coming soon): hybrid RAG with local LLMs

Lets begin 🙂

First, what’s all the fuss about ?

SQL 2025 ships with some new features enabling RAG applications to be built on top of the familiar sql server ecosystem. It’s Microsoft trying to keep sql relevant when everything under the sun seems to be eating into its market share.

RAG ( Retrieval-Augmented Generation ) is an architecture that combines retrieval of external data with generation using a large language model (LLM), like GPT or LLaMA.

What RAG Does:

Instead of asking an LLM to “guess” or hallucinate facts, it retrieves relevant context from an external data source (like a database, PDF, or website), and then asks the LLM to generate an answer based on that real-world context. So in simple terms and in SQL Server’s case you can search based on AI enhanced semantic reasoning. It’s actually pretty cool.

RAG Architecture

User Question → “What are the symptoms of Lyme disease?”
Embed the Question → Convert the question into a vector.
Vector Search → Look up the most semantically similar documents/snippets from a vector store (SQL 2025, Azure SQL, Managed Instance)
Context Retrieval → Get top N relevant documents/snippets.
Prompt Construction → Inject retrieved context into an LLM prompt.
LLM Generation → “Based on these docs, answer the user question.”

Why is RAG useful ?

Reduces hallucinations (answers grounded in facts)
Keeps LLMs up-to-date (by connecting to live data)
Enables domain-specific Q&A (using internal documents)
Lightweight alternative to fine-tuning

SQL 2025’s AI Features to enable RAG

Vector Data Type

Store vector data optimized for operations such as similarity search and machine learning applications. Vectors are stored in an optimized binary format but are exposed as JSON arrays for convenience. Each element of the vector is stored as a single-precision (4-byte) floating-point value.

Vector Functions

New scalar functions perform operations on vectors in binary format, allowing applications to store and manipulate vectors in the SQL Database Engine.

Vector Index

Create and manage approximate vector index to quickly and efficiently find similar vectors to a given reference vector.

External AI Models

Manage external AI model objects for embedding tasks (creating vector arrays) accessing REST AI inference endpoints.

So in SQL Server’s context there’s nothing fancy going on in the engine, we have a a new data type and the ability to call external REST API’s within SQL (that is pretty cool)

Vector Search Demo

The intention here is to get the vector search running on my own PC ,using the demo from Bob Ward, utilising Ollama to generate the embeddings. Once that’s working I tried to wrap it in some natural language so a user could ‘chat’ with the embedded data.

The RAG flow looks like this:

+——————+           
|  User Question   |   <– e.g., “What is the best bike for mountain trails?”
+——————+
          |
          v
+—————————–+
| Vector Search via SQL Proc |
| EXEC find_relevant_products_vector_search |
| Inputs: @prompt, @stock, @top           |
+—————————–+
          |
          v
+———————————-+
| Vector Search Results (Products) |
| → productID, ProductName,        |
|   ProductDescription, StockLevel |
+———————————-+
          |
          v
+——————————-+
| Format Results as Text        |
| → Top-k products in readable  |
|   format (Name, Description)  |
+——————————-+
          |
          v
+—————————-+
| TinyLlama (LLM) Call       |
| Prompt:                   |
| “Answer user’s question    |
| using these product results…” |
+—————————-+
          |
          v
+————————–+
| Final Natural Language   |
| Answer from TinyLlama    |
+————————–+
          |
          v
+———————+
| Return Answer to UI |
+———————+

+—————————+

| User Question | <— e.g., “What is the best bike for mountain trails?”

+—————————+

+——————————————–+

| Vector Search via SQL Proc |

| EXEC find_relevant_products_vector_search |

| Inputs: @prompt, @stock, @top |

+——————————————–+

+—————————————————+

| Vector Search Results (Products) |

| → productID, ProductName, |

| ProductDescription, StockLevel |

+—————————————————+

+———————————————–+

| Format Results as Text |

| → Top–k products in readable |

| format (Name, Description) |

+———————————————–+

+——————————————+

| TinyLlama (LLM) Call |

| Prompt: |

| “Answer user’s question |

| using these product results…” |

+——————————————+

+—————————————+

| Final Natural Language |

| Answer from TinyLlama |

+—————————————+

+——————————–+

| Return Answer to UI |

+——————————–+

Setup

Below are some hugely simplified steps to get the basics of embedding generation going locally, leaning heavily on bob’s work.

Install SQL 2025 public preview.
Install Ollama.
pull nomic-embed-text.
Create proxy API with using HTTPS (SQL can only use external api’s running over https).
Create external ai model.

USE master;
GO
sp_configure ‘external rest endpoint enabled’, 1;
GO
RECONFIGURE WITH OVERRIDE;
GO


DROP EXTERNAL MODEL MyOllamaEmbeddingModel;
GO


— Create the EXTERNAL MODEL
CREATE EXTERNAL MODEL MyOllamaEmbeddingModel
WITH (

      LOCATION = ‘https://localhost:8000/api/embed’,

      API_FORMAT = ‘Ollama’,

      MODEL_TYPE = EMBEDDINGS,

      MODEL = ‘nomic-embed-text’,
);
GO

USE master;

sp_configure ‘external rest endpoint enabled’, 1;

RECONFIGURE WITH OVERRIDE;

DROP EXTERNAL MODEL MyOllamaEmbeddingModel;

— Create the EXTERNAL MODEL

CREATE EXTERNAL MODEL MyOllamaEmbeddingModel

WITH (

LOCATION = ‘https://localhost:8000/api/embed’,

API_FORMAT = ‘Ollama’,

MODEL_TYPE = EMBEDDINGS,

MODEL = ‘nomic-embed-text’,

);

create table with vector data type.

DROP TABLE IF EXISTS Production.ProductDescriptionEmbeddings;
GO
CREATE TABLE Production.ProductDescriptionEmbeddings
( 
  ProductDescEmbeddingID INT IDENTITY NOT NULL PRIMARY KEY CLUSTERED, — Need a single column as cl index to support vector index reqs
  ProductID INT NOT NULL,
  ProductDescriptionID INT NOT NULL,
  ProductModelID INT NOT NULL,
  CultureID nchar(6) NOT NULL,
  Embedding vector(1536)
);

— Populate rows with embeddings
— Need to make sure and only get Products that have ProductModels
INSERT INTO Production.ProductDescriptionEmbeddings
SELECT p.ProductID, pmpdc.ProductDescriptionID, pmpdc.ProductModelID, pmpdc.CultureID, 

AI_GENERATE_EMBEDDINGS(pd.Description USE MODELMyOllamaEmbeddingModel)

FROM Production.ProductModelProductDescriptionCulture pmpdc
JOIN Production.Product p
ON pmpdc.ProductModelID = p.ProductModelID
JOIN Production.ProductDescription pd
ON pd.ProductDescriptionID = pmpdc.ProductDescriptionID
ORDER BY p.ProductID;
GO

— Create an alternate key using an ncl index
CREATE UNIQUE NONCLUSTERED INDEX [IX_ProductDescriptionEmbeddings_AlternateKey]
ON [Production].[ProductDescriptionEmbeddings]
(
    [ProductID] ASC,
    [ProductModelID] ASC,
    [ProductDescriptionID] ASC,
    [CultureID] ASC
);
GO

DROP TABLE IF EXISTS Production.ProductDescriptionEmbeddings;

CREATE TABLE Production.ProductDescriptionEmbeddings

(

ProductDescEmbeddingID INT IDENTITY NOT NULL PRIMARY KEY CLUSTERED, — Need a single column as cl index to support vector index reqs

ProductID INT NOT NULL,

ProductDescriptionID INT NOT NULL,

ProductModelID INT NOT NULL,

CultureID nchar(6) NOT NULL,

Embedding vector(1536)

);

— Populate rows with embeddings

— Need to make sure and only get Products that have ProductModels

INSERT INTO Production.ProductDescriptionEmbeddings

SELECT p.ProductID, pmpdc.ProductDescriptionID, pmpdc.ProductModelID, pmpdc.CultureID,

AI_GENERATE_EMBEDDINGS(pd.Description USE MODELMyOllamaEmbeddingModel)

FROM Production.ProductModelProductDescriptionCulture pmpdc

JOIN Production.Product p

ON pmpdc.ProductModelID = p.ProductModelID

JOIN Production.ProductDescription pd

ON pd.ProductDescriptionID = pmpdc.ProductDescriptionID

ORDER BY p.ProductID;

— Create an alternate key using an ncl index

CREATE UNIQUE NONCLUSTERED INDEX [IX_ProductDescriptionEmbeddings_AlternateKey]

ON [Production].[ProductDescriptionEmbeddings]

(

[ProductID] ASC,

[ProductModelID] ASC,

[ProductDescriptionID] ASC,

[CultureID] ASC

);

Adventureworks – the dummy data

Production.ProductDescription contains a bunch of text data relating to different products, and forms the source of the embeddings.

If we take a look at it we can see it seems to be bike related, so lets query it using traditional methods.

Are there any actual bike products ?

Yes, so a very simple wildcard search can work. But what if we want to expand on that ?

Whats the fastest bike ?

Natural language prompts don’t work well at all. Existing data types and indexes struggle with semantic lookups.

Now lets use the new index :

You can see here the proc brings back results that are ranked in closest approximate similarity the natural language expression. We are able to convert any sentence into something sql can actually filter on.

Whats going on under the hood here

https://learn.microsoft.com/en-us/sql/relational-databases/vectors/vectors-sql-server?view=sql-server-ver17

Embeddings are vectors that represent important features of data. Embeddings are often learned by using a deep learning model, and machine learning and AI models utilize them as features. Embeddings can also capture semantic similarity between similar concepts. For example, in generating an embedding for the words person and human, we would expect their embeddings (vector representation) to be similar in value since the words are also semantically similar.

In our example i have used nomic-embed-text to generate embeddings based on product descriptions.

Vector search refers to the process of finding all vectors in a dataset that are similar to a specific query vector. Therefore, a query vector for the word human searches the entire dataset for similar vectors, and thus similar words: in this example it should find the word person as a close match. This closeness, or distance, is measured using a distance metric such as cosine distance. The closer vectors are, the more similar they are.

SQL Server provides built-in support for vectors via the vector data type. Vectors are stored in an optimized binary format but exposed as JSON arrays for convenience. Each element of the vector is stored using single-precision (4 bytes) floating-point value

Exact Nearest Neighbor (ENN)

Exact Nearest Neighbour search compares a query vector to every vector in the database to find the most similar ones. It uses a brute-force method that guarantees perfect accuracy by computing exact distances (e.g., cosine similarity or Euclidean distance). This approach is reliable for small datasets or when absolute precision is required, but it becomes inefficient and slow as the dataset grows, making it unsuitable for real-time or large-scale use cases.

Approximate Nearest Neighbour (ANN)

Approximate Nearest Neighbor search is designed to return very similar — but not guaranteed exact — results by using efficient indexing algorithms like HNSW, IVF, or product quantization. It significantly speeds up similarity searches by reducing the number of comparisons, making it ideal for large datasets and real-time applications such as semantic search or recommendation engines. While slightly less accurate, ANN offers a powerful trade-off between performance and precision.

Feature	ENN (Exact Nearest Neighbor)	ANN (Approximate Nearest Neighbor)
Accuracy	Perfect	Close, not guaranteed
Speed	Slower	Fast
Scalability	Limited (small datasets)	Excellent (large datasets)
Use case	High-precision queries	Real-time, large-scale search
Method	Full scan	Indexed/partitioned search
Tools/Examples	Plain SQL, pgvector (no index)	FAISS, Milvus, pgvector with `ivfflat`

The index we used uses the diskann type, which is approximate nearest neighbour, as well as the proc.

CREATE VECTOR INDEX product_vector_index
ON Production.ProductDescriptionEmbeddings (Embedding)
WITH (METRIC = ‘cosine’, TYPE = ‘diskann’, MAXDOP = 8);

GO


CREATE OR ALTER PROCEDURE [find_relevant_products_vector_search]

@prompt NVARCHAR(max), — NL prompt

@stock SMALLINT = 500, — Only show product with stock level of >= 500. User can override

@top INT = 10, — Only show top 10. User can override

@min_similarity DECIMAL(19,16) = 0.3 — Similarity level that user can change but recommend to leave default

AS

IF (@prompt is null) RETURN;

 

DECLARE @retval INT, @vector VECTOR(1536);

 

SELECT @vector = AI_GENERATE_EMBEDDINGS(@prompt USE MODEL MyAzureOpenAIEmbeddingModel)

 

IF (@retval != 0) RETURN;

 

SELECT p.Name as ProductName, pd.Description AS ProductDescription, p.SafetyStockLevel AS StockLevel

FROM vector_search(

              TABLE = Production.ProductDescriptionEmbeddings AS t,

              COLUMN = Embedding,

              similar_to = @vector,

              metric = ‘cosine’,

              top_n = @top

              ) AS s

JOIN Production.ProductDescriptionEmbeddings pe

ON t.ProductDescEmbeddingID = pe.ProductDescEmbeddingID

JOIN Production.Product p

ON pe.ProductID = p.ProductID

JOIN Production.ProductDescription pd

ON pd.ProductDescriptionID = pe.ProductDescriptionID

WHERE (1-s.distance) > @min_similarity

AND p.SafetyStockLevel >= @stock

ORDER by s.distance;

GO

CREATE VECTOR INDEX product_vector_index

ON Production.ProductDescriptionEmbeddings (Embedding)

WITH (METRIC = ‘cosine’, TYPE = ‘diskann’, MAXDOP = 8);

CREATE OR ALTER PROCEDURE [find_relevant_products_vector_search]

@prompt NVARCHAR(max), — NL prompt

@stock SMALLINT = 500, — Only show product with stock level of >= 500. User can override

@top INT = 10, — Only show top 10. User can override

@min_similarity DECIMAL(19,16) = 0.3 — Similarity level that user can change but recommend to leave default

IF (@prompt is null) RETURN;

DECLARE @retval INT, @vector VECTOR(1536);

SELECT @vector = AI_GENERATE_EMBEDDINGS(@prompt USE MODEL MyAzureOpenAIEmbeddingModel)

IF (@retval != 0) RETURN;

SELECT p.Name as ProductName, pd.Description AS ProductDescription, p.SafetyStockLevel AS StockLevel

FROM vector_search(

TABLE = Production.ProductDescriptionEmbeddings AS t,

COLUMN = Embedding,

similar_to = @vector,

metric = ‘cosine’,

top_n = @top

) AS s

JOIN Production.ProductDescriptionEmbeddings pe

ON t.ProductDescEmbeddingID = pe.ProductDescEmbeddingID

JOIN Production.Product p

ON pe.ProductID = p.ProductID

JOIN Production.ProductDescription pd

ON pd.ProductDescriptionID = pe.ProductDescriptionID

WHERE (1–s.distance) > @min_similarity

AND p.SafetyStockLevel >= @stock

ORDER by s.distance;

Here’s a concise version of the python behind the scenes which creates an API that a blazor webapp can use, ChatGPT is amazing 😉

from fastapi import FastAPI, Request, HTTPException

from pydantic import BaseModel

import pyodbc

import requests

import logging

import time

 

app = FastAPI()

 

# Configure logging with timestamps and levels

logging.basicConfig(

    level=logging.DEBUG,

    format=”%(asctime)s [%(levelname)s] %(message)s”,

    handlers=[logging.StreamHandler()]

)

 

class QuestionRequest(BaseModel):

    question: str

 

def get_relevant_context(question: str) -> str:

    try:

        logging.info(“Connecting to SQL Server…”)

        conn = pyodbc.connect(

            “DRIVER={ODBC Driver 17 for SQL Server};”

            “SERVER=GMXX-DBASQN15\\DBA_2025;”

            “DATABASE=AdventureWorks2022;”

            “Trusted_Connection=yes;”

        )

        cursor = conn.cursor()

        logging.debug(f”Running stored procedure with question: {question}”)

        cursor.execute(“EXEC find_relevant_products_vector_search ?, ?, ?”, question, 100, 1)

        rows = cursor.fetchall()

        conn.close()

 

        context = “\n”.join(str(row[0]) for row in rows)

        logging.debug(f”Retrieved context from SQL:\n{context[:500]}”)  # Show up to 500 chars

        return context

    except Exception as e:

        logging.error(f”Error retrieving context: {e}”)

        raise

 

def call_tinyllama(context: str, question: str) -> str:

    # Wrap context in a clear section

    prompt = (

        “You are an expert assistant. Use ONLY the following information to answer the question.\n\n”

        “==== BEGIN DATA ====\n”

        f”{context}\n”

        “==== END DATA ====\n\n”

        f”Question: {question}\n”

        “Answer carefully based ONLY on the data above:”

    )

   

    payload = {

        “model”: “tinyllama”,

        “prompt”: prompt,

        “stream”: False

    }

    try:

        logging.debug(f”Calling TinyLlama with prompt (truncated): {prompt[:500]}”)

        start_time = time.time()

        response = requests.post(http://localhost:11434/api/generate, json=payload)

        response.raise_for_status()

        duration = time.time() – start_time

        answer = response.json()[“response”]

        logging.debug(f”TinyLlama response (truncated): {answer[:500]}”)

        logging.info(f”TinyLlama call took {duration:.2f} seconds”)

        return answer.strip()

    except Exception as e:

        logging.error(f”Error calling TinyLlama: {e}”)

        raise

 

 

@app.post(“/ask”)

def ask(question_req: QuestionRequest):

    logging.info(f”Received question: {question_req.question}”)

    try:

        context = get_relevant_context(question_req.question)

        answer = call_tinyllama(context, question_req.question)

        return {“answer”: answer}

    except Exception as e:

        logging.error(f”Failed to generate answer: {e}”)

        raise HTTPException(status_code=500, detail=”Internal server error”)

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

from fastapi import FastAPI, Request, HTTPException

from pydantic import BaseModel

import pyodbc

import requests

import logging

import time

app = FastAPI()

# Configure logging with timestamps and levels

logging.basicConfig(

level=logging.DEBUG,

format=“%(asctime)s [%(levelname)s] %(message)s”,

handlers=[logging.StreamHandler()]

)

class QuestionRequest(BaseModel):

question: str

def get_relevant_context(question: str) -> str:

try:

logging.info(“Connecting to SQL Server…”)

conn = pyodbc.connect(

“DRIVER={ODBC Driver 17 for SQL Server};”

“SERVER=GMXX-DBASQN15\\DBA_2025;”

“DATABASE=AdventureWorks2022;”

“Trusted_Connection=yes;”

)

cursor = conn.cursor()

logging.debug(f“Running stored procedure with question: {question}”)

cursor.execute(“EXEC find_relevant_products_vector_search ?, ?, ?”, question, 100, 1)

rows = cursor.fetchall()

conn.close()

context = “\n”.join(str(row[0]) for row in rows)

logging.debug(f“Retrieved context from SQL:\n{context[:500]}”) # Show up to 500 chars

return context

except Exception as e:

logging.error(f“Error retrieving context: {e}”)

raise

def call_tinyllama(context: str, question: str) -> str:

# Wrap context in a clear section

prompt = (

“You are an expert assistant. Use ONLY the following information to answer the question.\n\n”

“==== BEGIN DATA ====\n”

f“{context}\n”

“==== END DATA ====\n\n”

f“Question: {question}\n”

“Answer carefully based ONLY on the data above:”

)

payload = {

“model”: “tinyllama”,

“prompt”: prompt,

“stream”: False

}

try:

logging.debug(f“Calling TinyLlama with prompt (truncated): {prompt[:500]}”)

start_time = time.time()

response = requests.post(http://localhost:11434/api/generate, json=payload)

response.raise_for_status()

duration = time.time() – start_time

answer = response.json()[“response”]

logging.debug(f“TinyLlama response (truncated): {answer[:500]}”)

logging.info(f“TinyLlama call took {duration:.2f} seconds”)

return answer.strip()

except Exception as e:

logging.error(f“Error calling TinyLlama: {e}”)

raise

@app.post(“/ask”)

def ask(question_req: QuestionRequest):

logging.info(f“Received question: {question_req.question}”)

try:

context = get_relevant_context(question_req.question)

answer = call_tinyllama(context, question_req.question)

return {“answer”: answer}

except Exception as e:

logging.error(f“Failed to generate answer: {e}”)

raise HTTPException(status_code=500, detail=“Internal server error”)

And the embedding api used by SQL:

from fastapi import FastAPI, Request, HTTPException

import logging

import requests

import time

 

app = FastAPI()

logging.basicConfig(level=logging.DEBUG)

 

@app.post(“/api/embed”)

async def embed(request: Request):

    body = await request.json()

    logging.debug(f”Received request body: {body}”)

 

    model = body.get(“model”)

    input_text = body.get(“input”)

 

    if not model or not input_text:

        raise HTTPException(status_code=400, detail=”Missing model or input”)

 

    try:

        start_time = time.time()

        response = requests.post(http://localhost:11434/api/embed, json={

            “model”: model,

            “input”: input_text

        })

        duration = time.time() – start_time

 

        response.raise_for_status()

        data = response.json()

 

        embeddings_list = data.get(“embedding”) or data.get(“embeddings”)

        if not embeddings_list or not isinstance(embeddings_list, list):

            raise ValueError(“Invalid embeddings format returned”)

 

        # Ensure it’s nested as [[…]]

        embedding = [embeddings_list] if isinstance(embeddings_list[0], float) else embeddings_list

 

        logging.debug(f”Received embedding: {embedding}”)

    except Exception as e:

        logging.error(f”Error calling embedding API: {e}”)

        raise HTTPException(status_code=500, detail=”Embedding service error”)

 

    return {

        “model”: model,

        “embeddings”: embedding,

        “total_duration”: int(duration * 1_000_000),  # microseconds

        “load_duration”: 1019500,  # You can dynamically get this if Ollama supports it

        “prompt_eval_count”: len(input_text.split())  # Approximate token count

    }

from fastapi import FastAPI, Request, HTTPException

import logging

import requests

import time

app = FastAPI()

logging.basicConfig(level=logging.DEBUG)

@app.post(“/api/embed”)

async def embed(request: Request):

body = await request.json()

logging.debug(f“Received request body: {body}”)

model = body.get(“model”)

input_text = body.get(“input”)

if not model or not input_text:

raise HTTPException(status_code=400, detail=“Missing model or input”)

try:

start_time = time.time()

response = requests.post(http://localhost:11434/api/embed, json={

“model”: model,

“input”: input_text

})

duration = time.time() – start_time

response.raise_for_status()

data = response.json()

embeddings_list = data.get(“embedding”) or data.get(“embeddings”)

if not embeddings_list or not isinstance(embeddings_list, list):

raise ValueError(“Invalid embeddings format returned”)

# Ensure it’s nested as [[…]]

embedding = [embeddings_list] if isinstance(embeddings_list[0], float) else embeddings_list

logging.debug(f“Received embedding: {embedding}”)

except Exception as e:

logging.error(f“Error calling embedding API: {e}”)

raise HTTPException(status_code=500, detail=“Embedding service error”)

return {

“model”: model,

“embeddings”: embedding,

“total_duration”: int(duration * 1_000_000), # microseconds

“load_duration”: 1019500, # You can dynamically get this if Ollama supports it

“prompt_eval_count”: len(input_text.split()) # Approximate token count

}

With just a few lines of SQL and FastAPI glue, SQL Server 2025 can now power smart, semantically aware applications — no fine-tuning required. That’s all well and good, but what I found most interesting was the ability to use open source models through Ollama. If anything this was an excuse to start playing around with local llm’s, which itself is where the brains for all of this lies, and in truth the most interesting part. In Part 2, I’ll take this a step further by combining traditional precise sql queries with semantic ones for hybrid precision. Stay tuned!

Comments

One response to “SQL 2025 and AI – chatting with your data part 1”

How I Accidentally Built a Better Database Admin Than Myself – Conversion failed when converting sanity from job..

July 4, 2025

[…] my pet human’s last thrilling installment, he casually mentioned how SQL Server 2025 and AI were becoming best friends. Well, plot twist: […]