What if you were in an environment where you had to play Minecraft for say, an hour. Do you think your child brain would've eventually tried enough things (or had your finger slip and stay on the mouse a little extra while), noticed that hitting a block caused an animation, (maybe even connect it with the fact that your cursor highlights individual blocks with a black box,) decide to explore that further, and eventually mine a block? Your example doesn't speak to this situation at all.
I think learning to hold a button down in itself isn't too hard for a human or robot that's been interacting with the physical world for a while and has learned all kinds of skills in that environment.
But for an algorithm learning from scratch in Minecraft, it's more like having to guess the cheat code for a helicopter in GTA, it's not something you'd stumble upon unless you have prior knowledge/experience.
Obviously, pretraining world models for common-sense knowledge is another important research frontier, but that's for another paper.