Python and HTML5 Entities
Saturday, November 16, 2013I was presented with the task of extracting the plain text from some XML formatted closed captions. I was in a "quick and dirty" problem solving mood as opposed, so clearly regular expressions were going to be involved. As such, I started out with:
sed -r 's/<\/?[^>]+>/\r\n/g' data.xml | grep -v '^$' > data.txt
Since this was XML, of course there were some entities. And to make matters worse, there were not only the XML named entities (apos, gt, lt, etc.) but there were also hex encoded entities for things like music notes. Because music notes are very commonly used in closed captions to tell the viewer that music is playing. This is one of the big differences between closed captions and subtitles.
My first thought was that Python should be able to help me solve this problem. It's a "web friendly" language. People must do this all the time! And apparently they must, because I found this snippet on this blog post:
import re, htmlentitydefs
##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.
def unescape(text):
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
pass
else:
# named entity
try:
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
This didn't actually work because my text included ' which is not in the
htmlentitydefs.name2codepoint dictionary. That fact led me to Python
Issue #11113 and an html5 dictionary that
included all of the desired entities. The issue indicated that the changes were
added in Python 3 somewhere along the line, and the html5 dictionary I
mentioned was available here.
At this point, my problem was solved. But I couldn't help but think that there was a better way. After a little searching I found that something as simple as the following solved my problem:
import HTMLParser
HTMLParser.HTMLParser().unescape(text)
All of this seemed like a good exercise in playing with Python, and seemed worth recording.