Saturday, August 16, 2014

Hidden in Plain Sight

With the DVD/GOG version of Pegasus Prime, there was a slight problem before release. We had an invalid function call entering the three new chase sequences when compiled in gcc with optimizations. I was unable to figure out the exact cause at the time and I ended up writing a hack around it in final release.

Since a bad function was getting called, I had feared gcc was overwriting a return address somewhere and sending the program counter where it shouldn't be. valgrind wasn't helping and only showed the after-effects of the bad function call. It was pretty hard to pinpoint in gdb too, due to the calling function being called numerous times during execution without breaking. I had shelved the issue for some time so I could return later, perhaps with another idea of tackling it. I found my hope in the AddressSanitizer.

Armed with my shiny new PC and gcc 4.8.1, I recompiled with the address sanitizer to see what I would get. The game would now crash as soon as the sequence would start, due to the sanitizer kicking in. The information the sanitizer gave helped in really one way: I had a way to make it stop as soon as it broke from the stack buffer overflow error. Perhaps not quite the way the tool was intended to be used, but it was enough of a hint for me.

With some logging to a file, I saw that it crashed here the first time _inputHandler changed. Going with LordHoto's suggestion to check the vtable of the pointer, I noticed something funny: It was the vtable for the wrong class!

Once I saw where the _inputHandler field was populated, I quickly saw what my mistake was. Instead of relying on the compiler to upcast from the subclass to the InputHandler class, I had a manual C-style cast in there. The Neighborhood pointer (only known through forward declaration) was being cast to the InputHandler pointer. Normally this would be OK, as long as the compiler knew about the class hierarchy (in this case, with multiple inheritance and virtual functions), and then generate a static_cast. But if it didn't know that, it would have to go with a reinterpret_cast. The code was doing a reinterpret_cast and throwing away the hierarchy, and therefore causing undefined behavior. It just so happened that it called into the wrong vtable in this case.

But why did it happen only during optimization? Probably because the function was getting inlined. If the include order had Neighborhood defined in the translation unit before getting to the constructor of GameInteraction, it would have output the correct static_cast. It's likely one other place had this situation and that version ended up being the actual used function.

Definitely one of the hardest bugs I've had to track down.

2 comments:

uruk said...

Haha, nice one! Thanks for sharing it, it was very teaching! :)

So it was another example of not putting any code into header files? Because that's why the include order wasn't known in the constructor of GameInteraction previously, right? Sorry if I am asking trivial things, I'd like to understand the problem fully.

clone2727 said...

@uruk: No, the lesson here was "don't needlessly cast". It shouldn't have needed a cast and it was a mistake to add the C-style cast -- at the very least, it should've been a static_cast.

Putting small functions into the header is usually a very good thing. The compiler can inline them to reduce the cost of calling into a function.