I came across Data diffs: Algorithms for explaining what changed in a dataset.
- The two papers discussed are:
- Scorpion by Eugene Wu and Sam Madden. Finds common properties of outlier points.
- DIFF. SQL implementation of Scorpion and similar. The computated can be distributed.
- An author of the DIFF paper, Peter Bailis, founded sisudata.com.
The comments on HN list some related work:
- Dolt is a company whose product is a version-controlled SQL database.
- A Spark implementation of DIFF by G-Research.
- TerminusDB. Another version-controlled db company.
Searching for “data diffs” on HN finds related work:
- 1: How to check two SQL tables are the same. At a glance, no interesting things on the comment thread.
- 2: data-diff. Compare datasets across different databases (like Postgres & Snowflake).