Deduplicating with Python

At home I have small always-on NAS aka file server aka headless linux machine. It has served me well for almost half a decade. I have gradually migrated all my data on that server. It also serves as an ad-hoc VPN, web server, torrent box and more.

By both need and by attitude, I have grown to be a strong proponent of searching instead of organizing. By consolidating all of my various backups, music collections, photographs etc over the years on that server, it has become a mess where the same file can appear many times in many different folders. Unfortunately disk space does not grow on trees and cost per gigabyte price decreases have stagnated.

So, python to the rescue. I wrote a small python Deduplicator script. The way it works is quite efficient. It categorizes all the files that it receives by size. When a bunch of files have the same size, it uses a sha1 checksum to identify files with identical contents. As a final step, it keeps one of the duplicate files and replaces the rest with hardlinks to the "master" file.

It is smart enough to handle efficiently existing hard links. Be warned that it pays no attention to permissions whatsoever and that once you have hardlinked two files together, then any modifications will affect all the hardlinked files ate the same time. The above to me are not so much of a concern since the files that I am running it against are mostly cold backups or music and photo files that are never modified.

The script can be run as such:

find ~/Music ~/oldMusic ~/Backups/ -type f | python deduplicator.py --hardlink

Enjoy...