WordPress xml cleanup



✅ Running this will create two files in the same folder:

cleaned_blog_posts.json → structured archive

cleaned_blog_posts.csv → spreadsheet-friendly (Excel, Google Sheets, etc.)

import re
import json
import csv
from lxml import etree

# Path to your WordPress export XML
xml_path = “thescottogrottoorg.WordPress.2025-09-04.xml”

def clean_content(html_text):
    “””Remove WordPress block tags, HTML, and excess whitespace.”””
    if not html_text:
        return “”
    text = re.sub(r”<!–.*?–>”, “”, html_text, flags=re.DOTALL)  # WP comments
    text = re.sub(r”<[^>]+>”, “”, text)  # strip HTML
    text = (text.replace(“&nbsp;”, ” “)
                .replace(“&amp;”, “&”)
                .replace(“&lt;”, “<“)
                .replace(“&gt;”, “>”)
                .replace(“&quot;”, ‘”‘)
                .replace(“&#39;”, “‘”))
    text = re.sub(r”\s+”, ” “, text).strip()
    return text

# Parse XML with recovery mode
parser = etree.XMLParser(recover=True)
tree = etree.parse(xml_path, parser)
root = tree.getroot()

posts = []
for item in root.findall(“./channel/item”):
    post_type = item.find(“{http://wordpress.org/export/1.2/}post_type”)
    status = item.find(“{http://wordpress.org/export/1.2/}status”)

    if post_type is not None and post_type.text == “post”:
        if status is not None and status.text == “publish”:
            title_el = item.find(“title”)
            date_el = item.find(“pubDate”)
            content_el = item.find(“{http://purl.org/rss/1.0/modules/content/}encoded”)

            title = title_el.text if title_el is not None else “Untitled”
            date = date_el.text if date_el is not None else “Unknown”
            content = content_el.text if content_el is not None else “”

            if content.strip():
                posts.append({
                    “title”: title.strip(),
                    “date”: date.strip(),
                    “content”: clean_content(content)
                })

# Save all cleaned posts to JSON
json_file = “cleaned_blog_posts.json”
with open(json_file, “w”, encoding=”utf-8″) as f:
    json.dump(posts, f, ensure_ascii=False, indent=2)

print(f”Saved {len(posts)} posts to {json_file}”)

# Save all cleaned posts to CSV
csv_file = “cleaned_blog_posts.csv”
with open(csv_file, “w”, encoding=”utf-8″, newline=””) as f:
    writer = csv.DictWriter(f, fieldnames=[“title”, “date”, “content”])
    writer.writeheader()
    writer.writerows(posts)

print(f”Saved {len(posts)} posts to {csv_file}”)

## scottobear: backyard zoo & grotto musings

sometimes the smallest windows into the world are the ones that linger the longest. the **scottobear youtube channel** is one of those windows. gregory scott von berg — better known online as scottobear — calls himself an *author / blogger / coder / friendly dude. just this guy, you know?* it fits.

### backyard zoo
most of the channel is made up of short glimpses of the neighbors we don’t always notice. deer padding through the grass. skunks toddling in at dusk. chipmunks darting in and out of view. on a recent video, a monarch butterfly unfolded its wings for the first time, caught in quiet close-up.

a few recent favorites:

– [skunks visit 8-31-2025](https://www.youtube.com/watch?v=-64uG9cIvRI) (#backyardzoo #roanokeva #skunk)
– [deer and chipmunk 8-28-2025](https://www.youtube.com/watch?v=UASUmtxevGs)
– [monarch hatching august 28 2025](https://www.youtube.com/watch?v=NeaC7_pEPLU)

there’s nothing overproduced here, no polish. just the ordinary magic of nature slipping into view, the kind you’d catch if you left a trail cam running in your own yard.

### the grotto
alongside the videos, there’s a journal at [svonberg.org](https://svonberg.org), affectionately called **the scotto grotto**. it’s been running since 2000, which means there’s an entire archive of wandering thoughts, small joys, odd links, and philosophical asides.

the **about page** greets you with a smile: *gregory scott von berg – friendly bear. will not maul you. probably.* (ʕ •ᴥ•ʔ)

the **contract** encourages readers to take things lightly, to let ideas sit in the “maybe” pile before calling them true or false.

entries can be whimsical or weighty, sometimes both at once. in one, he imagines *if i were a tree, i’d be a redwood. if i were a book, i’d be dandelion wine.*

### why peek in
taken together, the youtube clips and grotto pages paint a picture of someone who notices things. the soft shuffle of skunk paws at night. the play of light in memory. the shape of a thought trying to form.

it’s not about being impressive. it’s about being present.

– **youtube**: [scottobear channel](https://www.youtube.com/@scottobear)
– **journal**: [the scotto grotto](https://svonberg.org)

drop by, if you’d like. watch the deer, read the words, take a moment. there’s a gentle kind of company here.