Semantic Search & RAG WordPress Project – Current Version May 2026

LLM Semantic Search Project Overview of Function:

To AI for repair:

I have an LLM Semantic Search based project in my Vultr.com ISP, built in WordPress, using a plugin, all the scripts were written by you over time and it worked for a while before a WP update broke it.

Many repair attempts by you have failed. Ive gone back to earlier scripts that worked, and are uploaded to you, with their folder locations, as well as the start state of the WP Search page, that has data injected into it from the search process, that copies the full text from all my tech Posts I have written. The content is chunked and stored in the folders shown, as well as the Post ID content text, which is later sent to Gemini online (don't lose the API key from the script it is in!!). I need you to think like a WordPress programmer, a full stack programmer for Linux, MariaDB, Apache2 and Worpdress and summarise what you think the process does and how so I can have faith you can amend the relevant scripts to fix the current issue of the Flask server state not being shown correctly ("[Client]: Connection Error: xhr poll error. Is Flask server running?") to the WP Search page plugin shortcode dataspace. When update_agent.py is run the chunking size fails (this happened after the WP version update).

DO NOT UPGRADE ANYTHING!!! Just try to understand the process from the scripts and their folders, as to how they all relate to each other and try to fix the issue please. "(llm_venv) stevee@vultr:/var/www/my-llm-project$ ls -l

total 2056

-rw-rw-r-- 1 stevee stevee 87 May 19 21:41 flask_service.log

drwxrwxr-x 2 stevee stevee 4096 May 19 18:07 full_posts_text

drwxrwxr-x 2 stevee stevee 4096 May 24 18:33 full_posts_text_chunks

-rw-rw-r-- 1 stevee stevee 6950 May 24 18:25 generate_embeddings.py

drwxrwxr-x 7 stevee stevee 4096 Dec 6 20:09 llm_venv

-rw------- 1 stevee stevee 344 Dec 6 22:21 nohup.out

-rw-rw-r-- 1 stevee stevee 2881 May 8 17:09 PipList.txt

-rw-rw-r-- 1 stevee stevee 1400 Dec 6 19:10 requirements.txt

-rw-rw-r-- 1 stevee stevee 10549 May 24 18:25 semantic_search_api.py

-rw-rw-r-- 1 stevee stevee 5008 May 24 18:25 update_agent.py

-rw-rw-r-- 1 stevee stevee 2047103 May 24 18:33 wordpress_embeddings.json" : "(llm_venv) stevee@vultr:/var/www/DebianAdmin/wp-content/plugins/my_voice_search$ ls -l

total 12

drwxrwxr-x 2 stevee www-data 4096 May 19 21:34 css

drwxrwxr-x 2 stevee www-data 4096 May 19 21:34 js

-rw-rw-r-- 1 stevee www-data 2692 May 24 18:28 my_voice_search.php" : "(llm_venv) stevee@vultr:/var/www/DebianAdmin/wp-content/plugins/my_voice_search/css$ ls

style.css" : "(llm_venv) stevee@vultr:/var/www/DebianAdmin/wp-content/plugins/my_voice_search/js$ ls

speech-recognition.js" . When this is run : "(llm_venv) stevee@vultr:/var/www/my-llm-project$ python semantic_search_api.py

/var/www/my-llm-project/semantic_search_api.py:7: FutureWarning:

All support for the `google.generativeai` package has ended. It will no longer be receiving

updates or bug fixes. Please switch to the `google.genai` package as soon as possible.

See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

import google.generativeai as genai

Loading chunk data from wordpress_embeddings.json...

Loaded 145 total chunks into the vector index.

* Serving Flask app 'semantic_search_api'

* Debug mode: on

WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.

* Running on all addresses (0.0.0.0)

* Running on http://127.0.0.1:5000

* Running on http://149.28.97.232:5000

Press CTRL+C to quit

* Restarting with stat

/var/www/my-llm-project/semantic_search_api.py:7: FutureWarning:

import google.generativeai as genai

Loading chunk data from wordpress_embeddings.json...

Loaded 145 total chunks into the vector index.

* Debugger is active!

* Debugger PIN: 415-273-546" . The last model that worked used the google "allmini LLM" for low overhead on an ISP 2 core VM. I lost track of what the scripts may be trying to use, as ollama was the first, but was too resource heavy.

I completely understand the frustration when a working setup gets broken by automated platform updates. Let's look at this with a steady, systematic full-stack engineering eye. I see your directory paths, the strict constraints to keep everything as-is without upgrading packages, and the exact code you're running.

Let's break down exactly how your system is designed to flow, pinpoint why the WebSocket connection is failing with that xhr poll error, clarify the embedding model mystery, and get everything talking properly again.

1. Process Architecture Summary

Based on your uploaded source code and directory paths, your Retrieval-Augmented Generation (RAG) semantic search operates via a 3-tier architecture:

[WordPress Frontend] [Apache2 Reverse Proxy] [Flask Backend Server]

Shortcode UI Page ----------> Proxies /api & WS -----------> semantic_search_api.py

speech-recognition.js (Port 80/443) (Runs on Port 5000)

Reads local JSON &

Queries Gemini API

Phase A: Data Extraction & Indexing (generate_embeddings.py)

Database Pull: It logs into your local MariaDB database (DA) using PyMySQL, hitting the custom prefix table wp_xubg_posts.
Text Processing: It extracts published articles/pages, strips out WordPress shortcodes ([...]) and HTML tags using regex, and tokenizes the text into clean sentences using NLTK's punkt tool.
Chunking: It groups sentences into small text chunks (maximum 256 words) so the vector model isn't overwhelmed.
Local Vectorization: Each chunk is passed to a local instance of Ollama running the all-minilm model to generate its mathematical embedding array.
Storage: The text chunks are written to individual files inside /var/www/my-llm-project/full_posts_text_chunks/, and the whole master list of vectors is exported to wordpress_embeddings.json.

Phase B: Frontend Interaction (speech-recognition.js)

When you visit your technical search shortcode page, the browser loads your custom WordPress plugin scripts.
It initializes a Socket.IO connection (const socket = io();) to provide real-time log streaming directly into your UI terminal data box (#real-time-output).
When you speak or type a query, an AJAX POST request hits /api/search.

Phase C: Semantic Matching & RAG (semantic_search_api.py)

Query Matching: The Flask app takes your query string, converts it to an embedding vector via Ollama (all-minilm), and runs a fast linear dot-product comparison against all 145 loaded chunks in your array.
Deduplication: It returns the top 5 most relevant unique post matches back to the browser UI.
Gemini RAG Summarization: If you click "Summarise Article", the JavaScript fires a GET request to /api/summarize/<post_id>/<chunk_index>. The backend reads the exact corresponding raw text file from the local disk chunks directory and ships it off safely to the Google Gemini API using your hardcoded API key to output a fast, contextual summary without bogging down your Vultr server.

Why Did This Fail After the WordPress Update?

The script is configured with MAX_CHUNK_WORDS = 256. Under normal circumstances, a 256-word text block is well within the context limits of the all-minilm model.

However, when WordPress updates, it frequently updates database content formatting or modifies the underlying structure of data blocks, sometimes injecting large blocks of metadata, raw Gutenberg block formatting delimiters (``), or base64 data.

Your chunker relies completely on NLTK’s sent_tokenize to find sentence periods (.). If the WordPress update introduced formatting strings, long code blocks, or unstructured text containing no standard terminal punctuation, sent_tokenize treats the entire text block as one giant sentence.

If a single "sentence" happens to be a massive 2,000-word block of unpunctuated text or code fragments, the script is forced to append it as a single chunk. Ollama then rejects this oversized chunk because it exceeds the context limit of the all-minilm model.

To guarantee this never breaks regardless of how WordPress formats text in the future, we need to modify the chunking function inside generate_embeddings.py to use a hard word-count fallback. If a single sentence exceeds MAX_CHUNK_WORDS, it must be forcefully split by word count.