opello.{com,net,org}

Python and HTML5 Entities

Saturday, November 16, 2013 categories: code, html5, python

I was presented with the task of extracting the plain text from some XML formatted closed captions. I was in a "quick and dirty" problem solving mood as opposed, so clearly regular expressions were going to be involved. As such, I started out with:

sed -r 's/<\/?[^>]+>/\r\n/g' data.xml | grep -v '^$' > data.txt

Since this was XML, of course there were some entities. And to make matters worse, there were not only the XML named entities (apos, gt, lt, etc.) but there were also hex encoded entities for things like music notes. Because music notes are very commonly used in closed captions to tell the viewer that music is playing. This is one of the big differences between closed captions and subtitles.

My first thought was that Python should be able to help me solve this problem. It's a "web friendly" language. People must do this all the time! And apparently they must, because I found this snippet on this blog post:

import re, htmlentitydefs

##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

This didn't actually work because my text included &apos; which is not in the htmlentitydefs.name2codepoint dictionary. That fact led me to Python Issue #11113 and an html5 dictionary that included all of the desired entities. The issue indicated that the changes were added in Python 3 somewhere along the line, and the html5 dictionary I mentioned was available here.

At this point, my problem was solved. But I couldn't help but think that there was a better way. After a little searching I found that something as simple as the following solved my problem:

import HTMLParser
HTMLParser.HTMLParser().unescape(text)

All of this seemed like a good exercise in playing with Python, and seemed worth recording.