Easily outperformed by the more traditional TensorRT, which TF also supports: ht...

p1esk · on April 9, 2020

Interesting. I wonder why there's such a difference between Nvidia Pytorch benchmark and Exxact results: Nvidia is more than twice faster for single GPU. V100 should only be ~10% faster than Quadro 8000. Either Exxact is incompetent, or Nvidia has some special sauce.

m0zg · on April 10, 2020

FWIW, NVIDIA TensorRT pre-profiles the models before it runs them. I don't know how it does that exactly (that part is closed source) but I'd guess they just try different algorithms on each op individually (i.e. plain conv vs Winograd) and pick a good balance of speed and memory usage according to heuristics. On some nets this can make all the difference in the world, and ResNet50 is basically the most studied architecture in existence, so you can bet it's in every single benchmark for this kind of thing, and as such it receives disproportionate attention.

p1esk · on April 10, 2020

I thought all frameworks can do this type of profiling (e.g. torch.backends.cudnn.benchmark = True).

Nvidia might have eliminated any potential data pipeline bottlenecks (with careful DALI tuning), but I'd still expect a lot less speedup. Maybe they compiled pytorch with certain tricks, and used newer CUDA/CuDNN code, idk.

m0zg · on April 10, 2020

TRT profiling is more extensive. On the model I'm currently working with (which runs on Jetson Xavier), initial TRT profiling takes something like 4 minutes. The model is an object detector. You can save the result, but it's hardware dependent then, so the resulting model is only optimal for the particular hardware it was optimized for. I cache it on disk - I can't wait 4 minutes every time I run a test.

PyTorch, as far as I can tell, does much lighter cuDNN profiling. It's more pareto optimal, I suppose, but the benefit is nowhere near as significant.

Another framework which does amazing optimization (but on the CPU) is OpenVINO. Normally I don't expect much on the software side from Intel, but this thing really blows the doors off everything else if you don't have a GPU at your disposal, provided that you have an Intel processor. The wayt they do it is they generate kernels that fit your data, but not the way XLA does it. They hand code them in a DSL that produces assembly, using Xbyak, and incorporate their deep knowledge of Intel hardware into that. When it's time to run the model, that DSL spits out optimal kernels just for that particular model. It's pretty neat work, IMO.

p1esk · on April 10, 2020

I see, thanks. In my day job I develop hardware accurate simulations of a deep learning accelerator. This involves looking at a Spice model, simplifying it into some set of abstractions using Numpy, then accelerating this Numpy code using GPUs. Currently I'm porting a resnet-50 model from Numpy to Pytorch, and the next step is to speed up the Pytorch code (because right now I get ~1 image per second, which is about 10 times better than Numpy). Perhaps I should look into porting the model from Pytorch to TensorRT.

m0zg · on April 11, 2020

If you're working with pytorch, porting basically means export to ONNX. Sometimes you'll run into an op that doesn't work with ONNX, but there are a lot fewer of those in TRT7. Unfortunately I have to work with TRT6, so I have to use PyTorch 1.2 and be "creative" to work around TRT6 bugs. That said, it could very well be painless for you. No reason not to try. Just export the model, and benchmark it with `trtexec`, in both fp32 and fp16. An hour of work at most.