Search This Blog

2010-05-18

Source Code Management -- A Minor Success Story

In my quest to become an open source [UNIX-y] programmer (or hacker :-o), I try to learn about development tools and practices that make development easier. One such category of tool is source code management (SCM) tools (also called version control software). Essentially they keep track of the history of changes that you make to your code[1], as well as who made each change, when they made the change, and even comments from the author of the change explaining the change. This greatly helps developers manage a project because it allows us to keep track of what we've done and even what we're doing right now. It allows you to undo changes that you've made easily and share changes easily between developers. There are plenty of benefits so don't consider this a complete list.

For the record I've found Git to be the best SCM. At j0rb, however, we use Subversion. I myself learned about Subversion a year or so after college and taught myself to use it. Then I introduced j0rb to it and eventually managed to get it adopted. More recently (~past six months) I started using Git after watching Google Tech Talks on YouTube of Linus Torvalds and Randal Schwartz explaining why Git is the superior SCM and why everything else sucks. I didn't want to be "stupid and ugly" so I naturally adopted Git. Now I'd like to switch j0rb over to Git, but I mostly work with Windows-y, GUI-y programmers that are afraid of something like Git. Needless to say, they are refusing to change for now. We all know what that makes them.

Anyway, I've been working on something at j0rb for the past ~week. With Git, I could easily branch and/or commit locally as I go to separate changes, but with Subversion branching and merging is expensive and painful and there's no local repository to commit to. That's because everything is centralized so committing would put my changes in the central repository that everyone uses, which would mean that the application that my colleagues and I are working on would be broken until I'm done, preventing others from doing any work of their own.

Branching (Side Tracked)

One of the nicest things about Git's branching mechanism is that I don't need to go anywhere in the file system. When I change branches in Git, Git automatically makes my working directory that I'm already in the branch that I'm switching to (checking out, technically). With Subversion, a branch is really just a copy of some tree; it's a duplicate of a subtree. In order to work on the new branch, I need to check it out somewhere else on my file system (or I could remove my working directory and overwrite it with the new branch). To demonstrate:
# With Git, it's simple. Create and checkout a new branch named 'newbranch'
# based on the master branch at the current HEAD of the branch (last commit).
bamccaig@castopulence:~/src/example$ git status
# On branch master
nothing to commit (working directory clean)
bamccaig@castopulence:~/src/example$ git checkout -b newbranch master
Switched to a new branch "newbranch"
bamccaig@castopulence:~/src/example$ git status
# On branch newbranch
nothing to commit (working directory clean)
bamccaig@castopulence:~/src/example$
As can be seen above, with one simple command Git has created a branch and I'm already in it! I didn't have to do anything else. I can just start working. And Git is fast when it comes to branching so I didn't have to wait for anything. The new branch just points to the master branch so there was no need for expensive duplication of data. Subversion tells a different story, however:
# With Subversion, it's quite painful and it's also pretty slow. Create
# and checkout a new branch named 'newbranch' based on the repository
# trunk in the HEAD revision. Note that in my experience it's best to do
# branching in Subversion server-side. At least if you ever intend to
# merge back into the original branch...
bamccaig@castopulence:~/src/example/trunk$ svn cp -m 'Example...' \
        file:///home/bamccaig/src/example.repo/trunk \
        file:///home/bamccaig/src/example.repo/branches/newbranch

Committed revision 2.
bamccaig@castopulence:~/src/example/trunk$ svn up .. && \
        cd ../branches/newbranch
A    ../branches/newbranch
A    ../branches/newbranch/foo
A    ../branches/newbranch/bar
A    ../branches/newbranch/baz
Updated to revision 2.
bamccaig@castopulence:~/src/example/branches/newbranch$ 
Notice that Subversion basically requires me to not only type out a semi-lengthy URL (twice!) and download another complete copy of the original branch (in this case branches/newbranch, which is a copy of trunk), but also requires me to move around in the file system. In this simple example, I had the entire tree in my working copy (from the root of the repository, including branches, tags, and trunk). Often though you aren't interested in all of the many branches and tags that exist so you'll only checkout the branch(es) that you're interested in. In that case, you have to type out a checkout command with what's probably a semi-lengthy URL and then type out a command to change to the new branch's working directory. In short, branching in Subversion is just no fun. And don't get me started on merging... :'(

Back To The Story

OK, so here I am with a lot of changes to my working copy (give or take, 15 added or modified files). I come back in to work on Monday after the weekend and start working on a separate, though related project. When I finally get that done in mid-afternoon I get back to my original project, rebuild it and run it (something I generally do to get an idea of what state things are in and remind myself what I was working on last; again, Subversion doesn't help much when there's 15 added or modified files). To my dismay, this ASP.NET project throws a StackOverflowException immediately upon launching in Visual Studio's development Web server, which subsequently "crashes" the server since there's really no recovering from that. Unfortunately, with a StackOverflowException, there is apparently no stack trace (something I discovered right then) because the stack[2] itself is in an unholy state. On to tracking down what was causing the problem. But how? A stack overflow usually means you're either calling too many nested functions (often a result of recursion) or you've allocated too much memory on the stack.

Here's where the SCM comes in handy (albeit, this particular SCM still comes up short). I tried looking at all of the changes I had made since my last commit to see if I could spot anything suspicious.
[bamccaig@j0rb:foo]$ svn diff | less -S
Unfortunately, nothing stood out. I had added some new LINQ to SQL entities to the project and added some code to work with them. Much of the new code was generated for me by Visual Studio. The code I had worked on didn't stand out as a culprit.

This is where having Cygwin installed, a UNIX-like environment for Windows, comes in handy[3]. I decided the most efficient way for me to find the problem was to undo the changes I had made, confirm that it worked, and then redo the changes bit by bit until I encountered the StackOverflowException. This way I would know where to look for problems: the last applied changes. UNIX and UNIX-like operating systems (and Cygwin, as mentioned above) have tools that make this easy. First, I generate patches[4] with the SCM, Subversion, and a little shell scripting.
[bamccaig@j0rb:foo]$ for f in `svn st | grep '^M' | \
        sed -r 's/M *(.*)/\1/'`;
do
    svn diff "$f" 1> "$f.patch" && svn revert "$f";
done
For every file, foo/bar, that was modified since the last commit, I get a file foo/bar.patch that stores the changes made. Then I undo those changes (svn revert). For added files, the changes are irrelevant because they're basically the entire file anyway so instead I just temporarily remove them from the Visual Studio project. I can easily get a list of which files to remove though using Subversion and the shell again.
[bamccaig@j0rb:foo]$ svn st | grep '^A'
With all the changes undone (there were no deleted files in my working copy) I was able to retest the code. Lo and behold, it runs fine now. This confirmed that it was indeed me that broke it (damn). That came as no surprise though because I had been working on it for close to a week without problems and without pulling changes from the central repository. I was the only one making changes.

Now comes the fun part. Applying each patch one at a time and testing. It might sound tedious, but imagine how much more tedious it would be without the SCM or UNIX tools. To apply the patches from before, we use (surprise) the patch program.
[bamccaig@j0rb:foo]$ patch -p0 -ui path/to/the.patch
The -p0 option is required to leave paths in the patches alone. The default behavior for patch is to strip off the directory part, leaving only the filename. That only works if the file you're patching is in the current working directory. Mine are all over the working tree. The paths just happen to be correct from where I'm working though so the 0 says to strip nothing from them. The -u option tells patch that the patch file is in unified format, which is what Subversion's diff sub-command outputs by default. patch will likely figure this out on its own, but why waste the resources? :P The -i option specifies the patch file to use (which is followed by the path to the file).

Since I'm using Cygwin, with UNIX newlines[5] on a Windows-based project, however, patch is going to somewhat mangle my source code by filling my files with the wrong newline type. It's not a problem. The code will still work, but Subversion will see every line as a change, which will make reviewing the changes later on rather difficult. To fix this, we can use the unix2dos tool to convert them back. To save myself a lot of tedious typing, I created a bash function for all of this.
p() {
    patch="$1"
    file="`dirname $patch`/`basename $patch .patch`";
    patch -p0 -ui "$patch" && unix2dos "$file" && rm -i "$patch";
}
This way instead of typing out that long patch ... unix2dos command line, I can just say `p path/to/file.patch'. After each patch is applied, I'd confirm that it applied properly and remove the patch file to mark which ones I had done. Then I'd refresh the Visual Studio solution, rebuild it, and run it. If there was no StackOverflowException then I'd move on to the next patch. Once again, the SCM and UNIX tools allow me to easily track my progress. The following function listed which patches I had yet to apply:
[bamccaig@j0rb:foo]$ c() {
    svn st | grep patch;
}
I used that list to try to apply patches in order of dependencies to avoid unrelated problems.

The Suspense Is Killing Me!!!11

So what was the bug in the code causing the StackOverflowException? I have no clue... :( After going through the above, I seemingly have applied all patches and added all new files back to the project and it works fine. The only bug I encountered in the code had to do with a data-layer interface that I recently tried (again, I now realize) to get fancy with. Essentially, LINQ to SQL is handled through a DataContext class that is generated for you by Visual Studio. When you query for a set of entities, they are linked to the DataContext that retrieves them. When you make changes to them, the DataContext knows and uses those changes to generate SQL that ultimately updates the database. However, often times the changes are coming from the client, or take place over a few layers of the application. It's hard in some of these instances to maintain the original DataContext and the entities that are attached to it. It's particularly difficult when using serialization to communicate with a user agent. Fortunately, there is an interface to attach detached entities. Unfortunately, it requires both the modified object and the original unmodified object to know what record it's dealing with. This means that something as simple as saving an entity can require an entity-specific query and it just generally results in code bloat. To get around this, I created an interface that returns a LINQ Expression that identifies the entity and implement it for each entity. This way, the framework that I've developed can automatically fetch the original object, reducing the bloat to a simple call:

1
2
3
new LinqManager().Save(entity); // INSERT or UPDATE.
new LinqManager().Delete(entity); // DELETE.
 

Anyway, getting even more lazy, I also added an interface that returns the record identifier (all tables of this database have INT record identifiers). This allowed me to generate a typed LINQ Expression using generics and that interface.

1
2
3
4
5
6
public Expression<Func<IEntity, bool>> DefaultIdPredicate(
        IEntity e)
{
    return o => o.GetRecordId() == e.GetRecordId();
}
 

It sounds good and compiles happy, but it fails hard at run-time because LINQ to SQL can't translate it into SQL. During the process of applying the above patches I eventually ran into an exception whose message spoke of this. It was then that I remembered trying it previously (which is why the above method existed already), but it failed and I reverted, leaving the method intact for a future revelation. Instead I'm stuck resorting to ugly type-casting and explicit property access, which I manually re-coded throughout the project. That is the only fix that I made to the code as I applied the changes.

I can only hope that was the problem, though I'm not sure how that could cause a StackOverflowException, and be thankful that I had an SCM, even a poor excuse for one, and a UNIX-like environment to help me through this mess..

References
1. Source code management (version control software) tools aren't limited to tracking source code. They can actually track changes to any set of files (though it may depend on the particular tool), but as a general rule they don't work as well with binary files as they do with text files.

2. If you're unfamiliar with the call stack or stack vs. heap then ask Wikipedia. I'm normally happy to explain it, but I feel exceptionally lazy right now. I'll give you some hints though.

3. Though not as handy as running a UNIX-like operating system, such as Linux, would be. Unfortunately, I'm stuck with Windows at j0rb, but I digress..

4. Patches are essentially instructions for how to change a file automatically. They show the difference between two files, which can be used to automatically modify the original and produce the new one. I feel like I'm doing a horrible job explaining this today so I'm trying to refer to material that will do a better job explaining than I can right now. See here.

5. http://en.wikipedia.org/wiki/Newline.