I dug out the old win32 profiler code, it needs a little clean up but it’s all functional. Samples are 32 bytes each and each function call takes one sample, there is a buffer of samples in memory and this buffer is full it will stop recording. There is a dump function which will dump the samples to disk so they can be processed by an external tool. Its nowhere near as simple as you think to dump from the hook function when you detect the buffer being full, you cannot call any C function because there is no stack frame. Your options are to create a fake stack frame which you would only do if you detect you need to write the samples, then you could call into C. The other method is to issue the syscall instructions that correspond to createfile/writefile/closefile, this can be done straight from the asm and syscall doesn’t need a stack frame.
Also attached is the tool to decode the results, its fairly simple but it can sort functions by min time, max time, avg time, total time, number of timed called etc It can also display the entire call graph of the application. The tool can easily compute total time of a function (cycles from start to finish) and the local time of a function (time excluding all the called functions). It uses dbghelp to cross reference the function entry instruction pointer (the function being called) and the stack return address (the function who called you). There is a lot this profiler could do if some effort was put into it.
You can also use this tool to see if a function is being inlined, an inline function will never show up the profiler because its never called!! You can compare the number of calls in debug and release to see what percentage of the calls have been inlined.