Ok, much improved. The whole process of tests now only takes 20 minutes, not 60 (all 966 tests), and the throughput is now:
256bit 778MB takes 44-45 seconds
512bit 778MB takes 44-45 seconds
1024bit 778MB takes 51-52 seconds
This is a DRASTIC improvement, but I still feel like I can do better. Basically what I did was go back to good ol' C++ routines for the cores of the ThreeFish functions, and unroll EVERYTHING! Only the MIX and INVMIX functions right now are actual functions, and normally where there would be loops, I actually wrote out the individual steps and removed as many array/pointer look-ups as possible.
With a few compiler tweaks, I might be able to squeeze more performance out. Also, as I've discovered in testing, my memory usage needs to improve, and I need to make sure that my C++ routines zero out memory properly when they are done with their local buffers. I've already done some of that in this version (which I will push up to CodePlex later today).
NOTE: (landmine) For whatever reason, my C++ routines will NOT use the intrinsic _rotl64 and _rotr64 functions. The functions don't even seem to work under Visual Studio 2010 SP1 with .NET 4. I had to write out the rotates as two shift functions:
*Y = (UInt64)(*Y << N) + (UInt64)(*Y >> (64 - N)); //ROTATE LEFT
*Y = (UInt64)(*Y >> N) + (UInt64)(*Y << (64 - N)); //ROTATE RIGHT
Whether or not the intrinsic functions would give me any kind of speed boost I have no idea. If I "_forceinline" these functions I might even get a tad more speed, but at 20 minutes for a full battery of tests, I'm going to just push up this version for now and call it a small victory.
[UPDATE: Ok, _forceinline is a BAD idea, it takes 10 times as long to build, and it actually reduces the performance versus not using it, so don't do _forceinline on the mix functions]
I am consistently getting the same hash results for all my tests, and all my code still passes the NIST Known Answer Tests (much faster might I add), so the outputs that I posted previously should still work for any implementation if you are so inclined to use them.