Search This Blog

2015-01-15

CPAN PRC 2015 - January (Plack::Session::State::URI) - Part 2

See the previous post in this series here.

Things are moving along slowly, but surely. I mentioned the GitHub issue that I created last time with a preview of a tidy branch. To my satisfaction the author replied and said that it looks good. I could submit that any time for a pull-request and technically have completed the challenge for January, but I'm not ready to yet. I would like to see if there are any other tidying tasks to do before I actually submit that branch for merging.

I also brought up the idea of using a proper HTML parser for the form field. Initially, the author sounded quite against the idea due to performance reasons. I tried to explain that an HTML parser is necessary if you want the module to always work properly no matter what HTML it is processing, and as a compromise I suggested toggling between the existing regular expression and a proper HTML parser with a constructor argument. That way users can choose what they want, and hopefully we can pick a good default for the users that just don't care. The module maintainer seemed to like that idea!

He also discussed the existing solution and came to realize that he already has an HTML parser! He is using it to add query string parameters to all anchor (<a>) tags in the document. It is called HTML::StickQuery and it is a subclass of HTML::Parser. So bonus. We already have an HTML parser and the performance is fine (I assume). There is a problem though. HTML::StickyQuery doesn't appear to allow any hooks into it. I can't see an easy way to do what we need to using the existing HTML::StickyQuery instance.

There is another module, which HTML::StickyQuery actually references in its documentation, called HTML::FillInForm that seems to do exactly what we need here. To add a form field containing the session identifier to all forms. It turns out that HTML::FillInForm is also based on HTML::Parser! Alas, again it doesn't look like there's an easy way to mash these modules together to accomplish both goals with no additional overhead. I'm trying to figure out how best to solve this problem.

A horrible hack of an idea was to attempt to dispatch between both modules' methods depending on which type of tag we are currently processing. My most recent comment on the GitHub issue describes a very ugly way to possibly accomplish it without duplicating any code. The downsides are that it is abusing the modules' internals, attempting to mash them together into a new super object, and then dispatching appropriately. This will only work if the modules' internals obey the assumptions that I am making and continue to do so (upstream changes to the internals could break this solution). It also assumes that each object has distinct hash keys so they won't collide and overwrite each other. Obviously I can't actually go ahead with this idea, but it's interesting anyway to consider it.

A more stable solution might be to manually merge the code from both modules into a new module that accomplishes both. That way upstream changes can't break us. The downside is that we would be duplicating code and we wouldn't automatically get bug fixes or feature enhancements from upstream.

I'm not overly familiar with HTML::Parser, but the way that these subclasses work doesn't impress me much. They are designed to own the entire document. They track their changes in an output string that they just concatenate to as they go. Firstly this makes it impossible to combine these subclasses without parsing multiple times (or resorting to hacks that aren't guaranteed to be stable). Secondly, I am assuming that there is quite a performance penalty for all of the string concatenation. I'm not sure how perl would cope with that, but I suspect that it would be better to store a list or array of strings and join() them in the end. It would be nice if we could manipulate the document structure in object form so that many changes by many different, unrelated modules could occur. Certainly there is a performance hit for keeping this in memory. Perhaps the problem is that HTML::Parser is too low-level for our needs.

I'll have to do some more research to identify if a better option exists. I'm not sure just how reliable HTML::FillInForm and HTML::StickyQuery are anyway. They don't strike me as being very robust. I'm sure they'll work for limited purposes, as I suppose is true for my assigned module too, but I don't think it would be hard to find cases in the wild where they don't work. For example, I saw one was testing the URI scheme for /https?/ and skipping non-matches assuming they were alternative protocols like ftp:// or mailto://. Instead they could have just been local relative URLs with no explicit scheme (unless the implicit scheme is filled in by a third party module or something, but I didn't notice any such thing). Those kinds of links would be unaffected by the use of these modules. In general I can't recommend these modules for robust Web applications, but they might get you by if you have limited time and you're lucky that your Web site is compatible (or you develop it as such).

I have to try to put together a few test cases that break the existing regular expression, and then I can go ahead and try to figure out how to integrate HTML parsing into the <input> injection. I currently plan to submit 3 pull requests. My tidy branch, a new branch with a test case to break the regular expression, and another branch based off of the test case that attempts to integrate a proper HTML parser (this way if the maintainer doesn't like my solution with the HTML parser he can still merge the test case and fix it himself). I currently have another branch, http-status, that just replaces magic numbers in the project with Http::Status constants. I'm not sure yet if the author will like that added dependency (albeit, I don't imagine it's a big or unstable one) so I might keep it separate. Otherwise, I might end up rebasing it into the tidy branch. We'll see...

No comments:

Post a Comment