I offered MMX, SSE, SSE2 optimized opengl32, for machines like this. The softgpu only offers mmx (win95), with the other supplied opengl option requiring features not available on these older CPUs. But, the performance wasn't wonderful.
There is a better option, and I'm sure some at MSFN have heard of it before. TitaniumGL doubles the performance of older Mesa OpenGL releases [6.5.4-7] (asm optimized for MMX, SSE, and SSE2). This is also true, if not more true, when using the SoftGPU provided opengl.
TitaniumGL is meant to provide a conversion of opengl to direct3d, on machine that have hardware acceleration. But, when no hardware accel exists, it falls back on its own software rendering. I've tested its software rending, and it does double the performance. It can also be paired with Wine3d, like the Mesa opengl, included with softgpu.
In the TitaniumGL download, it includes supporting files for Modern Windows, Win9x, and ReactOS. The modern release does not work with Win9x (even with KernelEx). The Win9x version works great. The ReactOS version works in Win9x, with KernelEx (maybe without it, as well), with a potential minuscule increase in performance (not verified, and hardly detectable; if real).
Be sure to get the modern release of TitaniumGL, as the older ones floating around are half the performance.
Also, be warned, don't get too excited. While the performance is doubled, that might not mean much in a less powerful machine. My Pentium-M 1.2Ghz machine still did not achieve real playable results with "Unreal Tournament GOTY Edition". At windowed 400x300, it really came close (Titanium -> Wine3d [haven't to tested opengl patches for UT]). But PSCXR went from unusable to usable windowed @640x480 (better @600x440) with graphics performance settings enabled, on the opengl plugin.
For those using multiple core supporting versions of windows, the software (and hardware) renderer supports using these other cores. I assume, for hardware accelerated systems, the extra cores are utilized to increase performance of the opengl > direct3d translations. It has been noted, on machines with weaker GPUs, that the software renderer can outperform hardware "opengl (maybe not with wine3d)" acceleration (providing enough cores are available [support for 32 cores]).
Again, it would be nice if a hack could achieve access to other cores, on bare metal Win9x installs.
SIMD95 might improve performance, on CPUs that provide AVX (SSE/AVX for Win95) however it is intended for Virtual Machines.
Not an excellent update, but I thought it worth mentioning. A nicely capable Core Solo (bare metal Win9x install) would probably provide a low expectation, but usable, software rendering experience. This might pair well the the emerging potential for HDA audio support (un-accelerated [emulated], like AC97).