Post-processing Soupault with Python

2023-09-03

Needless to say, I’m at this point still very new to Soupault.app, so there may be a native way to do this, but after skimming the very long documentation, the GitHub blog blueprint, and a personal site that uses a pretty bog-standard Soupault content model, I didn’t have any leads.

So, what follows might be the worst way to do this, but I really wanted chronological nav links on every blog post.

The section index model Soupault uses always seemed kinda weird to me, especially for a blog, where new pages are being added all the time. Sure, there’s stuff like a “last 10 posts” view, or hashtag plugins to make navigating content easier, but that feels like beating around the bush. If you have hundreds of blog entries, that’s a very long index.html! Surely there are performance implications to loading that whole thing? (If there aren’t, don’t come for me, I’m a humble back-end dev). And performance aside, it’s annoying to navigate; it’s a lot of click to view blog post, read, click back, scroll, click to view next blog post. I’m slightly irritated just thinking about it. And if you’re on a mobile browser that often forgets your position in a long page? Rage inducing. Being able to move from post to post without having to go back to a central “hub” or index, and without excessive scrolling sounds a lot more pleasant. Of course, to someone with my background, limit and offset sound even better, but also impossible for a static site.

So, after a little thinking, an arrangement like << newest | < newer | older > | oldest >> seemed a good enough solution to this perhaps made-up problem, and also doable. Here’s how I do it. (My solution is implemented in Python, but you can use anything.)

The process

First, we have Soupault build the site, because we’ll be using the metadata it extracts to modify the pages it produces. At bare minimum, we need url, date, and nav_path. date is the only one of these you have to configure yourself. I also use title for the title attribute in the link. After Soupault finishes, we walk through index.json and build a dict (or what other languages might call a map) where the key is the section, and the value is a list of objects, with each object holding the metadata for one page. The section key is made by just joining everything in the nav_path array with a slash to form something like a/b/c. Since I have Soupault sorting by date already, what we have by the end of this step is essentially a collection of metadata sorted by section, then by date (descending). If your index.json is not already sorted by date, you have some extra sorting to do. It will be obvious why we did this in the next step.

Metadata dict code:

sections = {}
with open('index.json') as index_file:
    index = json.load(index_file)
for page in index:
    url = page['url']
    title = page['title']
    date = page['date']
    section = '/'.join(page['nav_path'])
    if section not in sections:
        sections[section] = []
    sections[section].append(PageMeta(url=url, title=title, date=date))

Now, we walk through the collection we just made, section by section, and page by page, while looking ahead (i + 1) and behind (i - 1) to get the chronological newer and older page metadata respectively. Since we’re in a date-sorted list, index 0 is the newest page in this section, and len - 1 is the oldest. That’s really the meat of this whole thing: using url to create our hyperlinks, and inserting them into each HTML document. url is also used to find the actual HTML artifact: just split it on the /, prepend the build directory, and Bob’s your uncle.

Look-ahead, look-behind link-building loop:

for _, metas in sections.items():
    if len(metas) > 1:
        for i in range(len(metas)):
            on_newest = i == 0
            on_oldest = i == len(metas) - 1
            newest = html.escape('<< newest')
            newer = html.escape('< newer')
            older = html.escape('older >')
            oldest = html.escape('oldest >>')
            if not on_newest:
                newest = make_anchor(metas[0], newest)
                newer = make_anchor(metas[i - 1], newer)
            if not on_oldest:
                older = make_anchor(metas[i + 1], older)
                oldest = make_anchor(metas[-1], oldest)

            nav_path = get_parent_path(metas[i].url) + ['index.html']
            nav_block = f'<p class="post-nav">\n' + \
                        f' |\n'.join([newest, newer, older, oldest]) + \
                        '\n</p>'
            # insert nav_block into HTML artifact
            link(metas[i].url, nav_block)

All that’s left is to rewrite the HTML. I just duplicate the original, open the dupe in read mode, and the original in write mode, and begin writing the content of the dupe line-by-line back into the original file. When it encounters the tag that comes immediately after the position where I want the links, it of course first writes out the link block before continuing copying. Very simple in retrospect, it just took me quite a while to figure out what I wanted, and how to work with Soupault in harmony, neither of us stepping on each other’s toes. The result is this post-processor.