(I am not a lawyer.) As far as I understand the legal precedents involved, rando...

ChuckMcM · on July 31, 2017

I worked at Google for four years, an independent search engine for 5 more after that, and at IBM after it acquired said search engine for 18 months after that. Everyone of those organizations spent many thousands of dollars on legal fees over just this question and reviewed tons of case law.

Every single one of them concluded that based on how the law was written and how the web worked, there is no legal way to scrape a web site without its explicit permission to do so.

That won't stop people from trying of course and it was a source of constant entertainment in the ops team at Blekko at how people tried to sneak around at scraping (it can get very creative) but; it isn't legal, you can and will get banned from all access for it, and if you use the results in another product or offering you will be found liable for damages.

sokoloff · on July 31, 2017

> there is no legal way to scrape a web site without its explicit permission to do so.

Google scrapes several of my sites and I've never given Google explicit permission to do so.

ChuckMcM · on July 31, 2017

If your robots.txt file is /allow then you did. If you have no robots.txt file then it's an open question. If you put a /deny into your robots.txt file Google will stop scraping your site.

The implicit contract is that you let them scrape because you want to show up in their search results which will send you traffic. If you don't care about Google traffic then set /deny in your robots.txt and get back the bandwidth you were giving them.

__jal · on Aug 1, 2017

> If you have no robots.txt file then it's an open question.

Only for definitions of explicit I must be unfamiliar with.

If the presence of a robots.txt makes one's intent for a given resource explicit one way or the other, the lack of one (and the lack of some communication in some other channel) must mean there is no explicit permission.

ChuckMcM · on Aug 1, 2017

That is correct, for what it was worth IBM's legal team came down on the side of 'assume deny' and Google was (at the time I was there) 'assume allow.'

sokoloff · on Aug 1, 2017

I think "assume allow" is perfectly reasonable. It's just implicit, not explicit.

saurik · on July 31, 2017

To the extent to which that is the case, though, it isn't due to the terms of service; and that is also a case of how you are using the data for later, which is a separate question from the scraping and collection process: it is very clear to me that a search engine is operating on the legal equivalent of thin ice, particularly with details like snippets and synthesis ;P. Whether the CFAA applies (as indicated in this article) is an open question, but that just isn't quite so obvious as "you also can't connect up to the public sewer".

ChuckMcM · on July 31, 2017

   > it is very clear to me that a search engine is 
   > operating on the legal equivalent of thin ice,

We may be saying similar things but from a metaphor I think of search engines operating on 'thick' ice. It has been litigated so much that there is a bevy of case law to refer to at all levels. Eric Goldman's blog used to have a pretty good list of the number of suits of various kind and the searchengine blog covered many of them as well.

For a search engine it is super clear, robots.txt is all. If you say yes explicitly, great. If you say no explicitly, that has to be honored. If you say nothing, then its up to the search engine to decide which way to interpret it, but if the site owner complains because you picked wrong you have to honor their wishes (which may include destroying any cached data as well).

PadMapper, Perfect10, and the newspapers generated a ton of cases based on 'scraping a web site and using the data.' There are also about a dozen comparative shopping sites that have been dinged for the exact same issues. (look vs Amazon or vs Walmart).

Whether CFAA, DMCA, Torte law (contracts), or something else applies is constantly being discussed :-). I'm just the messenger here. I haven't found a single case that has held that the point of view of the scraper of someone else's web site should prevail. The argument that it should be allowed 'to help new businesses get off the ground' is like saying Apple should pay out some of its cash hoard as grants to startups trying to break into some business. I have yet to read anything that was sympathetic to that point of view.

pkilgore · on Aug 1, 2017

yuummm Torte law. (It's tort, and tort law is generally considered to be distinct from contract law, because in tort rights and duties come from common law whereas in contract law they come from acts of agreement between two parties).

ChuckMcM · on Aug 1, 2017

Chocolate Torte is my favorite :-) Thanks for the clarification, in the various articles I've read over the years on this topic they refer to tort law (no doubt because much of the argument references common law and the way in which the relations are argued) and I made the leap to 'contracts' which was incorrect.