opello.{com,net,org}

Python and HTML5 Entities

Saturday, November 16, 2013 categories: code, html5, python

I was presented with the task of extracting the plain text from some XML formatted closed captions. I was in a "quick and dirty" problem solving mood as opposed, so clearly regular expressions were going to be involved. As such, I started out with:

sed -r 's/<\/?[^>]+>/\r\n/g' data.xml | grep -v '^$' > data.txt

Since this was XML, of course there were some entities. And to make matters worse, there were not only the XML named entities (apos, gt, lt, etc.) but there were also hex encoded entities for things like music notes. Because music notes are very commonly used in closed captions to tell the viewer that music is playing. This is one of the big differences between closed captions and subtitles.

My first thought was that Python should be able to help me solve this problem. It's a "web friendly" language. People must do this all the time! And apparently they must, because I found this snippet on this blog post:

import re, htmlentitydefs

##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

This didn't actually work because my text included &apos; which is not in the htmlentitydefs.name2codepoint dictionary. That fact led me to Python Issue #11113 and an html5 dictionary that included all of the desired entities. The issue indicated that the changes were added in Python 3 somewhere along the line, and the html5 dictionary I mentioned was available here.

At this point, my problem was solved. But I couldn't help but think that there was a better way. After a little searching I found that something as simple as the following solved my problem:

import HTMLParser
HTMLParser.HTMLParser().unescape(text)

All of this seemed like a good exercise in playing with Python, and seemed worth recording.

PHP and Zend

Friday, February 4, 2011 categories: linux, php, work, zend

I was asked to take a quick look at getting a PHP extension to work. Little did I know the can of worms that particular question would open.

Some of my coworkers were trying to evaluate a piece of PHP based support call and ticket tracking software called Kayako. This particular web application uses Zend Guard to protect its code. The application hasn't released a version using Zend Guard 5.5, which added support for PHP 5.3. This is quite important because I attempted to bring up a virtual machine to aid in my coworkers testing Kayako.

After getting the virtual machine setup with Ubuntu 10.04.1, using the OpenSSH Server and LAMP Server canned configuration options to get a good set of necessary, base packages for this particular use case, I did a safe-upgrade, rebooted (new kernel), and added the "zendframework" package. The next step was to add support for running Zend Guard encoded applications, which is provided by the Zend Guard Loader (formerly called the Zend Optimizer). This is done by way of a binary that is setup in the php.ini according to the included README:

[zend]
zend_extension=/opt/ZendGuardLoader-php-5.3-linux-glibc23-i386/php-5.3.x/ZendGuardLoader.so
zend_loader.enable=1
zend_loader.disable_licensing=0

A quick reload of Apache, and the "with Zend Guard Loader v3.3" appears in the phpinfo() page. However, since I used the current version of PHP, and the corresponding Zend Guard Loader -- I was doomed to fail. As Kayako was encoded with Zend Guard for PHP 5.2. Apparently, Zend Guard Loader for PHP 5.3 does not support loading files encoded for earlier versions. This, I discovered after searching for the error from my Apache error_log:

[Thu Feb 03 22:31:21 2011] [error] [client 192.168.10.26] PHP Fatal error:  Incompatible file format:  The encoded file has format major ID 3, whereas the Loader expects 4 in /var/www/kayako/setup/index.php on line 0

I guess that this shouldn't be so frustrating. But I do find it rather annoying that there is no backward compatibility. This is somewhat exacerbated by the fact that PHP 5.3 was first released in June 2009, nearly 2 years ago. Not only from the perspective of maintaining legacy versions (granted, 5.2 still seems to be actively maintained), but the amount of release lock-in that comes with not supporting loading older versions in newer releases means that customers of the Zend Guard encoder need to keep updating (or at least releasing up-to-date encoded versions of their supported releases. I think that the associated cost of protecting the PHP source with this mechanism is too high.

However, Zend to the rescue! I don't even need to pin the 5.2 series of PHP in my package managed Ubuntu environment. There is a Zend Server Community Edition which is a bundle that includes all of the necessary components to run a Zend Guard encoded application of either the previous (5.2) or current (5.3) version. This bundle is available for various platforms (including Linux and Windows) and certainly seems like the easiest way to get a Zend Guard application up and running without fighting version incompatibilities. But that isn't necessarily the easiest way to manage a particular web application, unless you subscribe to the single-service-per-machine philosophy.

So, while I find the whole situation very frustrating, specifically the whole no backward compatibility thing, I do think it's nice that the "community edition" bundle is available to get applications up and running.

Useful links from this endeavor:

Android Alarms Icon

Saturday, January 15, 2011 categories: a855, mobile, motorola droid

I've been having some trouble with my phone, since early December. There have been phantom inputs, random scrolling, and unresponsive areas of the touch screen on my Motorola Droid A855. I could temporarily resolve the issues by applying even pressure across the touch screen, but that's certainly not a long term solution.

And so I called up Verizon support (*611) and the person had me explain what was going on, and I tried my best to suggest that the factory reset wasn't going to solve my problem. Nevertheless, low tier support, following the script, factory reset here I come. Then spent the rest of the day trying to restore my configuration. Which was an ultimately futile effort, as I ended up killing the market application as it apparently hung trying to reinstall all of my applications. Which resolved the problem, and allowed for manual installation of the various ones I still wanted. (At least it preserved the list of formerly-installed applications!)

Regarding other pieces of data, I was not so lucky. Apparently various applications store data in /data/data/<APP>/databases/<DB NAME>.db which is a sqlite database. This tidbit comes in handy later, but /data/data/* is on the NAND and is wiped clean during the factory reset. Goodbye SMS messages (but they are safely in GMail thanks to SMS Backup). Goodbye history of called numbers, and thus the handy favorites list. Goodbye my list of Shazam "tagged" songs that I might have wanted to sift through some day. And goodbye Angry Birds progress.

I think some of that stuff should be backed up onto the SD card in some signed format (if the concern is tampering) and sucked back in after the reset.

Little did I know that I would be even more annoyed by the lack of an "Alarms" shortcut in the applications drawer. I didn't realize how handy it was until it was gone, like most truly useful and under-appreciated things I suppose. So began the quest of complaining to my fellow Droid-wielding coworkers. One of which also happened to have the coveted "Alarms" shortcut still on his Droid's desktop.

A Google search led to most of the answer. However, upon first try that did not work. I wasn't sure how to access the database, and even after reviewing the recommended applications in the market, they required root access to edit other applications' database -- which makes perfect sense. Which led to a long road of understanding the current root escalation techniques, and ultimately to psneuter. The "trick" is pretty cool, if you take a look at the source.

With that in place, and after learning that /data/local/tmp is writable (the SD card doesn't work because setting permissions doesn't work there) I was off to the races.

adb pull /data/data/com.android.launcher/databases/launcher.db
sqlite3 launcher.db
sqlite> .tables
android_metadata  favorites
sqlite> .mode line
sqlite> .headers on
sqlite> select * from favorites where title like '%clock%'
_id = 18
title = Clock
intent = #Intent;action=android.intent.action.MAIN;category=android.intent.category.LAUNCHER;launchFlags=0x10200000;component=com.google.android.deskclock/com.android.deskclock.DeskClock;end
container = -100
screen = 3
cellX = 0
cellY = 1
spanX = 1
spanY = 1
itemType = 0
appWidgetId = -1
isShortcut =
iconType = 0
iconPackage =
iconResource =
icon =
uri =
displayMode =
sqlite3> update favorites set title='Alarms',intent='#Intent;action=android.intent.action.MAIN;category=android.intent.category.LAUNCHER;launchFlags=0x10200000component=com.google.android.deskclock/com.android.deskclock.AlarmClock;end' where _id=18;
sqlite3> .quit

It took a few tries to get the intent component value right (the post from earlier didn't have a value that worked for 2.2.1) but now it works! And I'm quite happy to have that special purpose icon back. As trivial as that may seem.

Update:
After moving to my replacement phone, I had to use the SD card as an intermediary in order to make editing the database work. That seems very strange to me, but was mentioned somewhere else as well. With no cp (?!?) I had to resort to cat /data/data/com.android.launcher/databases/launcher.db > /mnt/sdcard/launcher.db; mount it on my computer; and then edit it with sqlite3. Finally, used cat to overwrite the old file, and my 'Alarms' shortcut survived a reboot.

This post is going to be a collection of links and information, to organize the information I gathered for a project. The goal of the project being to create a temperature monitoring system that could be setup and plugged into a network connection. Ideally then have that monitoring device be accessible over the Internet.

My initial information gather suggested that something like an Arduino + Ethernet Shield would be a good base device, or maybe a plug-style computer (like the SheevaPlug or Pogoplug). I also looked at the Netduino Plus briefly, and may look at it in the future.

Reasonably accurate temperature sensors are available as Dallas 1-wire bus devices, which seemed like a reasonable way to go since they can be read over RS-232 using a fairly simple circuit. There are also, apparently, 1-wire to USB bridge chips available (even supported in Linux). One such bridge is used in the LinkUSB, which is a convenient form factor that is supposed to work with long runs of twisted pair cable to connect the sensor, and abstracts the interface to an RS-232 based protocol. Dallas 1-wire seems to support the lengths of cable I would need.

I also looked quite a bit at RS-485. It would be another option, but I think there would be more implementation cost. The sensors that I saw were rather expensive. But FTDI offers convenient adapters for about the same cost as the LinkUSB. Looks like Sparkfun has some even cheaper!

I'll add another post if I ever execute this project.

Other references:

SLUG Mini-Presentation on SSH Tips/Tricks

Friday, September 17, 2010 categories: linux, slug, ssh

I was asked if I could take the time to do a short, mini-presentation on SSH for a Siouxland Linux Users Group (SLUG) meeting. Like most things, it got put off until about a week before, and then put off until the week of the meeting. At least I started getting things ready the night before... Anyway, I figured I could put up here what I covered. Good excuse to post I guess.

For basic information, I covered syntax:

ssh host -l user
ssh user@host

Using scp and sftp to transfer files:

scp a slug02:
scp slug02:b ./
sftp slug02

Port Forwarding:

-D 12345
-L 12345:host:port

Optionally using less encryption for file transfers:

-c blowfish
-c none

And mentioned using compression for X forwarding, but didn't setup a VM with X running to demonstrate X forwarding:

-C

Then, for more advanced topics, I talked about how ~/.ssh/config and what can be put in there.

For example, setting ControlMaster for connections to use a single SSH connection for multiple sessions:

Host *
   ControlMaster auto
   ControlPath /tmp/%r@%h:%p

I couldn't recall the syntax to disable ControlMaster for a specific connection from the command line, but it is:

ssh -o ControlMaster=no slug02

Creating a basic connection shortcut:

Host s1
   HostName slug01
   User user1

As well as showing that the various settings that can be used here are found in ssh_config(5), and the earlier command line parameter analogs for the file:

   LocalCommand blah
   DynamicForward 8080
   LocalForward 12345 otherpc:3389
   LocalForward 12346 otherpc:80

I also mentioned fail2ban and denyhosts as ways to protect your SSH server from unwanted brute forcing. I personally use denyhosts, and showed how it populates /etc/hosts.deny with the rogue failures, as well as how white listing in /etc/hosts.allow can be done.

I think it was a reasonably successful mini-presentation.