Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The actual corpus that is worth using is the book corpus. While Google can't provide public access to all of the books it has scanned there is no restriction on them using the data in the books to feed this project. Given the amount of information they have scanned from libraries and elsewhere that is a much better source.


Is anyone doing the same for the books scanned by Archive.org?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: