Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can you elaborate on if/when $\theta$ is synchronized across nodes?

Algorithm 1 suggests that each node starts gradient aggregation from their local micro-gradient $g$. Since the order of aggregation matters, \theta would likely diverge after apply the step with $g_{\mathrm{GAF}}$ --- even if models on different nodes are initialized with the same weights. Hence, I would expect there to be a weight-synchronization step after each macro-gradient step. Do you have such a step? If so, how do you implement consensus? Simply via averaging?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: