This is a great resource. At Splitgraph, we index ~40k open data sets, and we make sure to include structured metadata for each one, so we show up in these results. (example [0])
One cool aspect of this metadata is that it allows a dataset to have multiple sources. So if two sites index the same dataset, there is no duplicate content penalty like there might be with textual content. If you search for a dataset, it will include links to all its sources (whether canonical or otherwise).
For most of the data we index at Splitgraph, the canonical source is
an open government data portal powered by Socrata (e.g. data.cdc.gov). We noticed that Socrata powered a lot of portals, so we wrote a Socrata plugin for Splitgraph, along with a scraper to index the metadata. The plugin basically implements a Postgres FDW so that Splitgraph can translate from SQL to the upstream query language. In this case, the plugin translates to Socrata's bespoke API language. But for private deployments we also have plugins for Snowflake, Postgres, some SaaS services, etc.
If you find some data on Google Dataset Search with Splitgraph listed as a source, please take a look! Our "Data Delivery Network" (DDN) is implemented on top of the Postgres wire protocol, so you can connect with any Postgres client (or use our web editor). All the Postgres query syntax is available to you; you can even JOIN across any of the other 40k+ datasets indexed at Splitgraph. That includes "live data" like Socrata portals, but also versioned snapshots of data called "data images." Here's an example of a point-in-time query across two snapshots (basically a diff) [1], and another query that joins across tables at data.cityofchicago.org and data.cambridgema.gov [2].
One cool aspect of this metadata is that it allows a dataset to have multiple sources. So if two sites index the same dataset, there is no duplicate content penalty like there might be with textual content. If you search for a dataset, it will include links to all its sources (whether canonical or otherwise).
For most of the data we index at Splitgraph, the canonical source is an open government data portal powered by Socrata (e.g. data.cdc.gov). We noticed that Socrata powered a lot of portals, so we wrote a Socrata plugin for Splitgraph, along with a scraper to index the metadata. The plugin basically implements a Postgres FDW so that Splitgraph can translate from SQL to the upstream query language. In this case, the plugin translates to Socrata's bespoke API language. But for private deployments we also have plugins for Snowflake, Postgres, some SaaS services, etc.
If you find some data on Google Dataset Search with Splitgraph listed as a source, please take a look! Our "Data Delivery Network" (DDN) is implemented on top of the Postgres wire protocol, so you can connect with any Postgres client (or use our web editor). All the Postgres query syntax is available to you; you can even JOIN across any of the other 40k+ datasets indexed at Splitgraph. That includes "live data" like Socrata portals, but also versioned snapshots of data called "data images." Here's an example of a point-in-time query across two snapshots (basically a diff) [1], and another query that joins across tables at data.cityofchicago.org and data.cambridgema.gov [2].
[0] https://www.splitgraph.com/cdc-gov/distribution-of-covid19-d... – "View Source" to see the Schema.org metadata
[1] https://bit.ly/3epvxcj
[2] https://bit.ly/3f1ll8K
(Sorry for the bit.ly links. The URL for our query editor includes the full SQL string, and I don't want to mess up HN formatting.)