Markup normalization

This evening I (almost) normalized every possible variant of shitty markup entered by copy and paste online editors into 35,000 articles over the last eight years.

A small selection of the Vim statements required to normalize every possible variant of shitty markup entered by copy and paste online editors into 35,000 articles over the last eight years:

:%s/,|||,/=nr2char(11)/g
:%s/,|||/=nr2char(11)/g
:%s/""/=nr2char(21)/g
:%s/"//g
:exe '%s/' . nr2char(11) . '/","/g'
:exe '%s/' . nr2char(21) . '/"/g'
:exe '%s/"$//g' # add a
:%s/^/"/g
:%s/<br /><p><br />/</p><p>/g
:%s/<p></p><p>/<p>/g
:%s/<BR><br />/<br />/g
:%s/<br /><p></p><p>/<p>/g
:%s/<p><br />/<p>/g
:%s/</p></p><p>/</p><p>/g
:%s/</p><br /><p>/</p><p>/g
:%s/<br /><p>/</p><p>/g
:%s/<p><p>/<p>/g
:%s/<P>/</p><p>/g
:%s/</p></p>/</p>/g
:%s/<br /></p><p>/</p><p>/g

Two more things: 1) Anyone who’s ever tried to tell you to use find and replace in bbEdit for large files is dead wrong. 2) College Publisher, you suck ****. ‘,|||,’ is not a valid delimiting character. Quit being malicious.

Lastly, if I’ve thought ahead, I would’ve tracked invalid markup against prevalence and date range. That would’ve made for a fascinating anthropological study.

Author: Daniel Bachhuber

Proud father x2. Principal, Hand Built. Maintainer, WP-CLI.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s