WP Posts Word Count Data Science Exercise – Linux Admin, WebDev, Comms & IT

WP Posts Word Count Data Science Exercise

This document summarizes the process we followed to extract data from your WordPress posts, analyze their word counts by year, and visualize the findings in a pie chart. The key focus was to perform this exercise without interfering with your existing, working voice-activated semantic search project files.

Overview of the Goal

The primary goal was to gain insights into your WordPress content by calculating the average word count of posts per year and visualizing this distribution. This provides a data-driven perspective on your content creation patterns over time.

Tools Used and Why

Linux Bash Commands (mkdir, cp, python3, source): Used for managing directories, copying files, executing Python scripts, and activating virtual environments. They provide the command-line interface for orchestrating the workflow.
Python: The primary scripting language used for data extraction, manipulation, and visualization.
wordpress-export.xml (from WP Admin Tools): This is the XML file containing your WordPress post data, manually exported from the WordPress dashboard. We used this specific export because it proved to be consistently parsable by our Python script, unlike the WP-CLI export for this particular task.
Python venv (Virtual Environment): (venv_ollama_rag) Used to create an isolated Python environment for installing project-specific libraries (pandas, matplotlib) without conflicting with system-wide Python packages or your semantic search project's dependencies.
pip: The Python package installer, used to install necessary libraries within the virtual environment.
extract_stats_data.py (Custom Python Script): Our first custom script designed to parse the WordPress XML export, clean the content, calculate word counts, extract publication years, and save this structured data into a CSV file. It's a dedicated data extraction tool.
pandas (Python Library): A powerful library for data manipulation and analysis. It's excellent for reading CSV files, grouping data, and performing aggregations (like calculating average word counts).
matplotlib (Python Library): A widely used plotting library in Python, ideal for creating static visualizations like pie charts.
analyze_and_plot.py (Custom Python Script): Our second custom script that reads the processed data from the CSV, performs the yearly average word count calculations, and generates the pie chart visualization.

Step-by-Step Process

Project Setup & Data Preparation
- Purpose: To create a dedicated, isolated directory for this data science exercise and place the input data there.
- Linux Commands:
- mkdir -p /var/www/data_science_project
- # Manually export "Posts" only from WordPress Admin Tools > Export
- # Then copy the downloaded XML file to the new directory:
- cp /path/to/your/downloaded/wordpress-export.xml /var/www/data_science_project/wordpress-export.xml

- Explanation: mkdir -p safely creates the data_science_project folder. The cp command ensures your raw WordPress data is available for the new scripts without touching your existing plugin files.
Data Extraction into CSV (extract_stats_data.py)
- Purpose: To parse the wordpress-export.xml file, extract post titles, publication years, and word counts, and save this clean, structured data into a CSV file.
- Script: The content of extract_stats_data.py (provided in a previous turn) was placed in /var/www/data_science_project/. This script handles XML parsing, HTML cleaning, date extraction, and word counting.
- Linux Commands:
- cd /var/www/data_science_project/
- python3 extract_stats_data.py

- Explanation: This command executes the Python script. The script reads the XML, processes each post, calculates word counts, extracts the year from the publication date, and writes these details into /var/www/data_science_project/extracted_stats.csv.
Environment Setup for Analysis & Plotting
- Purpose: To install the necessary Python libraries (pandas, matplotlib) into your existing virtual environment, ensuring they are available for the analysis script without affecting your main system or other projects.
- Linux Commands:
- # Ensure you are in the data_science_project directory:
- cd /var/www/data_science_project/
- # Activate the virtual environment (using its full path for robustness):
- source /var/www/DebianAdmin/wp-content/plugins/my-voice-search/content_data/venv_ollama_rag/bin/activate
- # Install the libraries:
- pip install pandas matplotlib

- Explanation: Activating the virtual environment isolates the pip install commands to that environment. pandas is for data handling, and matplotlib is for creating the plots.
Data Analysis & Visualization (analyze_and_plot.py)
- Purpose: To read the extracted data, perform calculations (average word count per year), and generate the pie chart visualization.
- Script: The content of analyze_and_plot.py (provided in a previous turn, with customizations for word count display and clockwise order) was placed in /var/www/data_science_project/.
- Linux Commands:
- # Ensure you are in the data_science_project directory and venv is active
- python3 analyze_and_plot.py

- Explanation: This command runs the analysis script. It reads extracted_stats.csv, groups the data by year, calculates the average word count for each year, and then generates and saves the pie chart as average_word_count_per_year_pie_chart_v2.png in the same directory.

Outcome

You successfully generated a pie chart showing the distribution of your average post word counts per year, with the actual word counts displayed and ordered clockwise from 2009. This provides a clear visual representation of your content length trends over time.

This entire exercise was performed in a completely isolated environment, ensuring no impact on your existing WordPress semantic search functionality.