Robin Linacre’s Post

View profile for Robin Linacre, graphic

Lead author of Splink, Data Scientist at Ministry of Justice

Splink 3.9.11 (just released) is >10x faster on high core count machines with DuckDB.  DuckDB is now recommended even for very large data linkages/dedupes of 10s of millions of records +. Benchmarking results demonstrate deduping a 7m record dataset with 1bn comparisons is possible in just 2 minutes - making Splink (AFAIK!) by far the fastest free data deduplication library. See detailed results here: https://lnkd.in/eHCYFEqr I'm super pleased with these results because they dramatically lower the barriers to entry in using Splink. No more need to setting up a Spark cluster, fiddling with complex Spark config and registering UDFs. Just pip install splink and use the DuckDB backend. Discuss on HN here: https://lnkd.in/e92ZARkz

# Super-fast deduplication of large datasets using Splink and DuckDB

robinlinacre.com

Andreas Varotsis

Head of AI Capability at 10 Downing Street

7mo

DuckDB should get so much more love.

Jacob Browning

ML Ops / Site Reliability Engineer / Data Engineer

7mo

Love the innovation!

See more comments

To view or add a comment, sign in

Explore topics