powered_by:
powered by linux powered by apache powered by php powered by mysql valid html valid css Valid RSS TuxMobil - Linux on Laptops, Notebooks, PDAs and Mobile Phones
Add to Technorati Favorites bloggerei.de - deutsches Blogverzeichnis Digg! del.icio.us Save This Page
feedburner


18/03/2007: Finding duplicate files or directories with Linux
I think everybody knows this situation: the collection of $FOO has grown over the years and you may have some duplicates in it. Even disk space isn't very expensive anymore, the fact of having duplicates annoys me. So today I had some spare time and started cleaning up my drives. In directories with 20 or 30 items this is quite easy, but what is when you have a directory with 1500 items? Ask your favorite shell, but before you can do this, you need to look up a regex-pattern which matches the usual naming sheme. I usually have $FOO-$BAR-some-shit or $FOO_-_$BAR-some-shit and $FOO and $BAR are the only interesting information. Because of this, my pattern looks like ([a-zA-Z0-9]*)-([a-zA-Z0-9]*).* - I have to groups of alphanumerical-stuff, divided by a hyphen and do not care for the rest.

But I still do not know, where my duplicates are, so I ask my favorite tools: my shell (zsh, but this works with every POSIX-shell), sed and uniq:

ls /foo | sed -e 's#[^a-zA-Z0-9-]##g;s#\([a-zA-Z0-9]*\)-\([a-zA-Z0-9]*\).*#\1-\2#' | uniq -id

This lists the contents of the /foo directory, strips all the non-alphanumeric characters, reformats the string to show only $FOO and $BAR and then shows the duplicates case-insensitive.
In my dir with 1500 items I got about 70 duplicates, most of them were really positives, only some few false-positives because of the stripped "some-shit" at the end.

Hope this helps someone, but I also would love to see your comments how the search for duplicates could be improved.

Tags:

None

Kommentare:
Mario: I am currently looking for a more general approach as my hard disk is a real mess (no naming conventions).

I'd like to share some of my findings, hoping these help.


If you need the CLI:

http://www.die.net/doc/linux/man/man1/hardlink.1.html
http://dancameron.org/asides/1572
This manual page documents hardlink, a program which consolidates duplicate files in one or more directories using hardlinks.
hardlink traverses one or more directories searching for duplicate files. When it finds duplicate files, it uses one of them as the master. It then removes all other duplicates and places a hardlink for each one pointing to the master file. This allows for conservation of disk space where multiple directories on a single filesystem contain many duplicate files.

http://www.die.net/doc/linux/man/man1/fdupes.1.html


If you don't need to stick to the shell:
http://www.pixelbeat.org/fslint/
http://www.linux.org/apps/AppId_8359.html

http://linux.softpedia.com/get/System/Diagnostics/Duper-1417.shtml

Please let me know via email if you find something better!

Regards,
Mario

Kommentar hinzufuegen:
Name:

eMail:

Homepage:

Text:


Bitte hier nichts eintragen:

latest_blog:
SSH Keys removed
Streaming OGG Vorbis to a Nokia E51
Menschen
Ich bin unsportlich
Mülleimerbibliothek
random_pic:
konzerte/ericfish-20041211 / ericfish13.jpg
ericfish-20041211 / ericfish13.jpg
top_referer:
tuxmobil.org (1550)
tuxmobil.de (871)
maxwerner.de (310)
animexx.onlinewelten.com (194)
geekosphere.org (180)