Neo4j Vector Indexing: A Technical Deep Dive with Concrete Examples

Jan 26, 2025·
Jiyuan (Jay) Liu
Jiyuan (Jay) Liu
· 7 min read

Introduction

Neo4j’s vector indexing system bridges graph databases and semantic search through the HNSW (Hierarchical Navigable Small World) algorithm. This post walks through the technical execution of vector indexing operations using a concrete example: 10 movies embedded into 7 dimensions.

Instead of abstract theory, we’ll use real numbers and track exactly what happens in memory as your data flows through the system.


The Dataset: Our 7-Dimensional Vector Space

Imagine each movie in your database has a vector representing seven semantic traits:

Dimensions: [Action, Sci-Fi, Romance, Comedy, Horror, Drama, Mystery]

Sample Movies:

  • Node A (Star Wars): [0.9, 0.8, 0.0, 0.1, 0.0, 0.3, 0.2]
  • Node B (The Notebook): [0.1, 0.0, 0.9, 0.2, 0.0, 0.8, 0.1]
  • Nodes C–J: (8 additional movies with their own 7D vectors)

These coordinates represent how “Action-heavy,” “Sci-Fi focused,” or “Romantic” each movie is. Star Wars sits in the top-right “Sci-Fi/Action” corner of our 7D space, while The Notebook occupies the “Romance/Drama” region.


Step 1: CREATE VECTOR INDEX – The Initialization

Technical Action

Your database claims a metadata segment in off-heap memory (OS Memory) and pre-registers a specialized Apache Lucene HNSW structure.

What Happens Numerically

You tell Neo4j:

“I am going to provide 7-dimensional LIST objects. Build the HNSW hierarchy to support semantic search across this space.”

The HNSW Scaffold

Neo4j creates an empty Skip List structure with configured parameters:

  • Dimensions: 7
  • Similarity Metric: Cosine Distance
  • HNSW Parameters: m=16, ef_construction=200

Think of this as defining the grid of a map without yet drawing the roads or placing any cities.

Memory Footprint

The index claims only a small metadata allocation (~1–5 MB) in off-heap memory. This is not pre-allocating space for all 10 movies—that grows dynamically as you add data.

Result

A new entry appears in SHOW VECTOR INDEXES:

Index: movie_tagline_embeddings | State: ONLINE | PopulationPercent: 0%

Step 2: genai.vector.encode – The Translation Layer

Technical Action

Your data flows through an external REST API that extracts semantic meaning and converts text into numerical coordinates.

The Transformation in Action

# Input (Human Language)
movie_tagline = "In a galaxy far, far away..."

# OpenAI Embedding API processes this
# Output (7-dimensional Vector)
vector = [0.92, 0.81, 0.02, 0.11, 0.01, 0.28, 0.19]

This vector is now a Python LIST object in RAM, ready to be inserted into the database.

Resource Cost

  • Network I/O: HTTP payload to OpenAI
  • API Tokens: Consumed per vector (tracked against your OpenAI quota)
  • Neo4j Resources: Minimal—the list is held temporarily in transaction state RAM

Complexity: O(1) per API call, with latency dependent on OpenAI’s response time.


Step 3: db.create.setNodeVectorProperty – The Registration & Indexing

Technical Action

This is where the “heavy lifting” occurs. Your vector is:

  1. Written to disk (ACID-compliant node property)
  2. Handed to the HNSW Worker for intelligent neighborhood linking

Dual Storage Mechanism

Storage Layer 1: The Node Store

The 7D vector is saved as a persistent property on the movie node:

Node A Property Block:
{
  name: "Star Wars",
  tagline: "In a galaxy far, far away...",
  taglineEmbedding: [0.92, 0.81, 0.02, 0.11, 0.01, 0.28, 0.19]
}

Storage Layer 2: The Lucene Index (HNSW)

The HNSW Worker receives the vector and:

  1. Placement: Positions the new vector [0.92, 0.81…] in the 7D coordinate space
  2. Neighbor Search: Searches existing movie vectors (up to 9 others) to find the mathematically closest neighbors
    • Calculates cosine similarity with each existing movie
    • Identifies the ~4 nearest neighbors (controlled by m=16 parameter)
  3. Edge Creation: Creates “friendship links” between this movie and its closest neighbors
  4. Hierarchy Updates: The HNSW algorithm performs a “coin flip” (probabilistic) to determine if this node should be promoted to the “Highway Layer” (Upper layers) for faster future searches

The Neighbor Linkage Example

When Star Wars is added:

Star Wars [0.92, 0.81, 0.02, 0.11, 0.01, 0.28, 0.19]
    ↓ (Cosine Similarity ≈ 0.98)
Star Trek [0.88, 0.79, 0.01, 0.09, 0.02, 0.25, 0.21]
    ↓ (Cosine Similarity ≈ 0.94)
Interstellar [0.85, 0.75, 0.05, 0.08, 0.03, 0.30, 0.25]

These links allow future searches to “jump” directly to the Sci-Fi cluster instead of scanning the Drama cluster.

Memory Growth

With each new movie added, the index grows:

  • After 1 movie: ~100 KB (minimal structure + 1 vector + metadata)
  • After 5 movies: ~300 KB (vectors + edges between neighbors)
  • After 10 movies: ~500 KB (complete 7D neighborhood graph)

The index lives in OS RAM (off-heap), not in Neo4j’s Page Cache.


Step 4: Semantic Search – db.index.vector.queryNodes

You execute a query: “Fast-paced space adventure”

The Query Encoding

OpenAI encodes your search query into 7 dimensions:

query = "Fast-paced space adventure"
query_vector = [0.89, 0.82, 0.03, 0.10, 0.02, 0.26, 0.18]

The HNSW Navigation

Instead of checking all 10 movies, the index uses a greedy traversal algorithm:

  1. High-Level Layer: Start at the “Highway” level—a sparse layer containing only the most central nodes
  2. Skip the Drama Cluster: The query vector [0.89, 0.82…] is clearly Sci-Fi heavy, so the algorithm bypasses The Notebook and other Drama nodes
  3. Jump to Sci-Fi Cluster: Follow edges to Star Wars, Star Trek, and Interstellar (all high Sci-Fi, high Action scores)
  4. Local Search: Check these neighbors’ neighbors for the closest match
  5. Return Results: Sorted by cosine similarity scores

Cosine Similarity Calculation

For query vector and Star Wars:

Similarity = (0.89×0.92 + 0.82×0.81 + 0.03×0.02 + ... + 0.18×0.19)
           ÷ (√(0.89² + 0.82² + ...) × √(0.92² + 0.81² + ...))
           ≈ 0.99

Result: Star Wars returned with similarity score 0.99


Technical Summary Table

Script StepNumeric Context (10 Movies, 7D)ComplexityMemory Impact
CREATE INDEXDefines 7D coordinate system mapO(1)~5 MB metadata
genai.vector.encodeReturns 7-element float array per movieO(1) networkTemporary RAM
setNodeVectorPropertyWrites vector to disk; connects to ~4 closest neighborsO(log N) indexing~50 KB per vector
queryNodesGreedy HNSW traversal to find closest matchesO(log N) searchNo additional memory

Python Script: Monitoring Index Population Progress

Since HNSW indexing happens asynchronously in the background, use this script to verify all 10 movies are fully indexed before running semantic searches:

from neo4j import GraphDatabase

def check_index_status(uri, user, password, index_name):
    """
    Monitor the population progress of a vector index.
    
    Args:
        uri: Neo4j connection string (e.g., "bolt://localhost:7687")
        user: Database username
        password: Database password
        index_name: Name of the vector index to monitor
    """
    driver = GraphDatabase.driver(uri, auth=(user, password))
    
    try:
        with driver.session() as session:
            # Query index metadata
            result = session.run(
                "SHOW INDEXES YIELD name, state, populationPercent WHERE name = $index_name",
                index_name=index_name
            )
            
            for record in result:
                print(f"Index: {record['name']}")
                print(f"Status: {record['state']}")
                print(f"Population Progress: {record['populationPercent']}%")
                
                # Wait for 100% population before querying
                if record['populationPercent'] < 100:
                    print("⚠️  Index still building. Wait before querying.")
                else:
                    print("✅ Index fully populated. Ready for semantic search.")
    finally:
        driver.close()

# Example usage
check_index_status(
    "bolt://localhost:7687",
    "neo4j",
    "your_password",
    "movie_tagline_embeddings"
)

Expected Output

Index: movie_tagline_embeddings
Status: ONLINE
Population Progress: 100%
✅ Index fully populated. Ready for semantic search.

Key Insights: What You Must Understand

1. CREATE VECTOR INDEX Claims Memory, Not Data

It does not pre-allocate space for 1 million vectors. It merely registers the structure and reserves a small off-heap footprint.

2. genai.vector.encode Is the Semantic Brain

This step is the only part that understands meaning. It converts human language (“galaxy far away”) into mathematical coordinates that computers can compare.

3. setNodeVectorProperty Is the Critical Registration Step

If you save a vector as a regular property (SET n.embedding = vector), the HNSW index will never see it. You must use setNodeVectorProperty to actually make vectors searchable.

4. HNSW Builds a Navigable Hierarchy

Instead of scanning all 10 movies on every search, HNSW creates “highways” (upper layers) that allow greedy algorithms to skip unrelated data clusters. This reduces search time from O(N) to O(log N).

5. Indexing Is Asynchronous

After you call setNodeVectorProperty, the HNSW worker updates the index in the background. Always check populationPercent before running production semantic searches.


Conclusion

Neo4j’s vector indexing system is a three-stage pipeline:

  1. Index Creation: Define the 7D space and prepare the HNSW infrastructure
  2. Vector Encoding: Convert semantic text into mathematical coordinates
  3. Intelligent Registration: Place vectors in the index and link them to neighbors for fast traversal

Using HNSW, your semantic search scales from O(N) brute force to O(log N) navigable hierarchies—enabling lightning-fast searches across millions of high-dimensional vectors.


Further Reading