Does it matter? It's not a research project trying to rigorously evaluate novel architectural modifications or something, but just a project trying to be useful within the limited resources of a hobbyist. If someone labeled a bunch of the remaining errors, that data would then be better used as more training data than to benchmark.
In practice, the accuracy, whatever it is, appears to be very high and more than adequate to justify its use.