Finding duplicate files or directories with Linux

I think everybody knows this situation: the collection of $FOO has grown over the years and you may have some duplicates in it. Even disk space isn't very expensive anymore, the fact of having duplicates annoys me. So today I had some spare time and started cleaning up my drives. In directories with 20 or 30 items this is quite easy, but what is when you have a directory with 1500 items? Ask your favorite shell, but before you can do this, you need to look up a regex-pattern which matches the usual naming sheme. I usually have $FOO-$BAR-some-shit or $FOO_-_$BAR-some-shit and $FOO and $BAR are the only interesting information. Because of this, my pattern looks like ([a-zA-Z0-9]*)-([a-zA-Z0-9]*).* - I have to groups of alphanumerical-stuff, divided by a hyphen and do not care for the rest.

But I still do not know, where my duplicates are, so I ask my favorite tools: my shell (zsh, but this works with every POSIX-shell), sed and uniq:

ls /foo | sed -e 's#[^a-zA-Z0-9-]##g;s#\([a-zA-Z0-9]*\)-\([a-zA-Z0-9]*\).*#\1-\2#' | uniq -id

This lists the contents of the /foo directory, strips all the non-alphanumeric characters, reformats the string to show only $FOO and $BAR and then shows the duplicates case-insensitive.
In my dir with 1500 items I got about 70 duplicates, most of them were really positives, only some few false-positives because of the stripped "some-shit" at the end.

Hope this helps someone, but I also would love to see your comments how the search for duplicates could be improved.

Comments

Mario wrote on 2007-05-02 13:24:

I am currently looking for a more general approach as my hard disk is a real mess (no naming conventions).

I’d like to share some of my findings, hoping these help.

If you need the CLI:

http://www.die.net/doc/linux/man/man1/hardlink.1.html

http://dancameron.org/asides/1572

This manual page documents hardlink, a program which consolidates duplicate files in one or more directories using hardlinks.

hardlink traverses one or more directories searching for duplicate files. When it finds duplicate files, it uses one of them as the master. It then removes all other duplicates and places a hardlink for each one pointing to the master file. This allows for conservation of disk space where multiple directories on a single filesystem contain many duplicate files.

http://www.die.net/doc/linux/man/man1/fdupes.1.html

If you don’t need to stick to the shell:

http://www.pixelbeat.org/fslint/

http://www.linux.org/apps/AppId_8359.html

http://linux.softpedia.com/get/System/Diagnostics/Duper-1417.shtml

Please let me know via email if you find something better!

Regards,

Mario

Send your comments to evgeni+blogcomments@golov.de and I will publish them here (if you want).