Search This Blog

2015-01-15

CPAN PRC 2015 - January (Plack::Session::State::URI) - Part 2

See the previous post in this series here.

Things are moving along slowly, but surely. I mentioned the GitHub issue that I created last time with a preview of a tidy branch. To my satisfaction the author replied and said that it looks good. I could submit that any time for a pull-request and technically have completed the challenge for January, but I'm not ready to yet. I would like to see if there are any other tidying tasks to do before I actually submit that branch for merging.

I also brought up the idea of using a proper HTML parser for the form field. Initially, the author sounded quite against the idea due to performance reasons. I tried to explain that an HTML parser is necessary if you want the module to always work properly no matter what HTML it is processing, and as a compromise I suggested toggling between the existing regular expression and a proper HTML parser with a constructor argument. That way users can choose what they want, and hopefully we can pick a good default for the users that just don't care. The module maintainer seemed to like that idea!

He also discussed the existing solution and came to realize that he already has an HTML parser! He is using it to add query string parameters to all anchor (<a>) tags in the document. It is called HTML::StickyQuery and it is a subclass of HTML::Parser. So bonus. We already have an HTML parser and the performance is fine (I assume). There is a problem though. HTML::StickyQuery doesn't appear to allow any hooks into it. I can't see an easy way to do what we need to using the existing HTML::StickyQuery instance.

There is another module, which HTML::StickyQuery actually references in its documentation, called HTML::FillInForm that seems to do exactly what we need here. To add a form field containing the session identifier to all forms. It turns out that HTML::FillInForm is also based on HTML::Parser! Alas, again it doesn't look like there's an easy way to mash these modules together to accomplish both goals with no additional overhead. I'm trying to figure out how best to solve this problem.

A horrible hack of an idea was to attempt to dispatch between both modules' methods depending on which type of tag we are currently processing. My most recent comment on the GitHub issue describes a very ugly way to possibly accomplish it without duplicating any code. The downsides are that it is abusing the modules' internals, attempting to mash them together into a new super object, and then dispatching appropriately. This will only work if the modules' internals obey the assumptions that I am making and continue to do so (upstream changes to the internals could break this solution). It also assumes that each object has distinct hash keys so they won't collide and overwrite each other. Obviously I can't actually go ahead with this idea, but it's interesting anyway to consider it.

A more stable solution might be to manually merge the code from both modules into a new module that accomplishes both. That way upstream changes can't break us. The downside is that we would be duplicating code and we wouldn't automatically get bug fixes or feature enhancements from upstream.

I'm not overly familiar with HTML::Parser, but the way that these subclasses work doesn't impress me much. They are designed to own the entire document. They track their changes in an output string that they just concatenate to as they go. Firstly this makes it impossible to combine these subclasses without parsing multiple times (or resorting to hacks that aren't guaranteed to be stable). Secondly, I am assuming that there is quite a performance penalty for all of the string concatenation. I'm not sure how perl would cope with that, but I suspect that it would be better to store a list or array of strings and join() them in the end. It would be nice if we could manipulate the document structure in object form so that many changes by many different, unrelated modules could occur. Certainly there is a performance hit for keeping this in memory. Perhaps the problem is that HTML::Parser is too low-level for our needs.

I'll have to do some more research to identify if a better option exists. I'm not sure just how reliable HTML::FillInForm and HTML::StickyQuery are anyway. They don't strike me as being very robust. I'm sure they'll work for limited purposes, as I suppose is true for my assigned module too, but I don't think it would be hard to find cases in the wild where they don't work. For example, I saw one was testing the URI scheme for /https?/ and skipping non-matches assuming they were alternative protocols like ftp:// or mailto://. Instead they could have just been local relative URLs with no explicit scheme (unless the implicit scheme is filled in by a third party module or something, but I didn't notice any such thing). Those kinds of links would be unaffected by the use of these modules. In general I can't recommend these modules for robust Web applications, but they might get you by if you have limited time and you're lucky that your Web site is compatible (or you develop it as such).

I have to try to put together a few test cases that break the existing regular expression, and then I can go ahead and try to figure out how to integrate HTML parsing into the <input> injection. I currently plan to submit 3 pull requests. My tidy branch, a new branch with a test case to break the regular expression, and another branch based off of the test case that attempts to integrate a proper HTML parser (this way if the maintainer doesn't like my solution with the HTML parser he can still merge the test case and fix it himself). I currently have another branch, http-status, that just replaces magic numbers in the project with Http::Status constants. I'm not sure yet if the author will like that added dependency (albeit, I don't imagine it's a big or unstable one) so I might keep it separate. Otherwise, I might end up rebasing it into the tidy branch. We'll see...

2015-01-11

CPAN Pull Request Challenge (PRC) 2015 - Introduction / January (Plack::Session::State::URI)

I have signed myself up for the CPAN Perl Request Challenge (PRC) 2015 as organized by Neil Bowers. The idea is that every month you are assigned a CPAN module at random that is hosted on GitHub and you have to submit at least one pull request for it. I heard about it a few days ago when Olaf Alders announced it to the Toronto Perl Mongers list (again?). It sounded like a neat idea so I decided to sign up.

My January assignment is Plack::Session::State::URI. I began by creating an issue on the GitHub repository requesting feedback and guidance from the author. Unfortunately, I haven't received a response yet. The module is fortunately very tiny. Unfortunately, it is based upon Plack, which I have limited experience with (at least I have minimal experience at all...). There are a few things that could obviously use some work.

  • There is very little documentation. In fact, there's so little documentation that I just have to infer what it does myself. Fortunately, I have a background in Web programming so it was no real mystery. Unfortunately, that still makes it difficult for me to test it. And worse, it makes it difficult for me to imagine how changes I make will affect users of the module.
  • There are very few tests. Granted, the module only has about 60 SLOC so how much testing is really needed? That said, I believe that several more tests could be possible. Many of them will probably require a richer understanding of the Plack::Middleware framework than I possess. It would be lovely to acquire that extra needed experience, and perhaps this is a good opportunity to do it, but I'll need to try to balance it with the other aspects of life.
  • Parsing HTML with a regular expression. Of course, regular expressions cannot properly parse HTML, SGML, or XML. You need a proper parser to do that correctly. Several such modules exist on CPAN. Unfortunately I only have experience using XML parsers in Perl, and HTML is far too forgiving for an XML parser to get along with it. I'll need to learn to use an HTML parser with the power to edit the document contents, preferably without altering the formatting of the existing document. This is probably the most useful thing that I could do for this project at the present time, but I'll need to identify the module to use and learn to use it properly first. If I do pursue this problem then I can also add test cases that break the regular expression parsing and use that to verify that my fix works.

Since I haven't received a response yet from the module maintainer I have begun doing my best to come up with work to do. We all have personal preferences when it comes to code formatting, structure, and style. I've tried to avoid rewriting too much code, but at the same time I've tried to address things that I believe could be improved. I expect the author will not like some of the changes that I've made so I'm trying to make it clear that I can rebase as necessary to keep the changes that he likes while getting rid of the changes that he doesn't. I'm also trying to keep my commits nice and small so they're easy to understand and hopefully are small enough that controversial changes are isolated and easy to leave out.

I currently have a work-in-progress wip/tidy branch. This will probably eventually become my first pull request for this module. That branch is not considered safe yet. I am currently force pushing to it as I edit history and try to get things just right. I'm pushing mostly as a backup mechanism. In order to publish something that the module maintainer can review without worrying about the branch getting destroyed out from under him I have published a static snapshot to preview/tidy/1 and submitted an issue to document it. The idea is that until the wip/tidy branch is read to be merged I will continue to rebase it and publish snapshots to incrementing integer revisions as in preview/tidy/2, preview/tidy/3, ... preview/tidy/n. When it's finally done I'll probably delete wip/tidy and publish a tidy branch instead for the pull request. I have lots of experience with Git, but I have no real experience with pull requests. I mostly use Git by myself so I haven't had a need for pull requests and generally don't have to worry about people tripping over history edits either. One of the things I can hope to learn from this experience is how to do nice pull requests and also how to manage history in the public eye without sacrificing history cleanliness nor causing undue pain for others.

Hopefully I hear back from the maintainer soon so I have an idea which direction to go in. I quite enjoy writing (just look around you!) so I'd be happy to add rich documentation to this module too. I just don't know exactly how robust it is and don't want to go making documentation up for it. It would be nice to get feedback directly from the author or others who have used it so that I know better what the strengths and weaknesses are and could document them appropriately. I wouldn't want to lure people into wasting time with this module if it can't do what they need, and similarly I wouldn't want to chase them away if it will do exactly what they need. Perhaps I should work on setting up a basic Web site with it and seeing just how well it works. That is quite a lot of work though. There are advantages to it, but these days I am typically not a ball of energy at the end of a work day. We'll just have to see what I can accomplish in the coming weeks.