header

Torsten Curdt’s weblog

Uniq with Hashing

Usually a simple command line can be used to extract the unique lines from a file.

cat data.txt | sort | uniq

Unfortunately the sort is required (as uniq requires a sorted input for its task) and the bigger the data set gets the more painful that sort operation becomes. I am surprised there is no option to ‘uniq’ to use hashing. But a short perl script also does the trick.

#!/usr/bin/perl

%seen = ();
while (<>) {
    print $_ unless $seen{$_}++;
}

Any better suggestions?

9 Responses to “Uniq with Hashing”

  1. francisoud said, on 21. February 2008 at 15:44

    In ruby:
    puts IO.readlines(”data.txt”).uniq

  2. tcurdt said, on 21. February 2008 at 15:49

    Try that on a 4GB file. I don’t think so ;)

  3. chromatic said, on 21. February 2008 at 21:57

    The line:

    %seen = ();

    … doesn’t do anything (unless you’re using the strict pragma). This is better, because it declares a lexical:

    my %seen;

  4. Sander van Zoest said, on 21. February 2008 at 23:49

    % sort -u data.txt
    or
    % sort -uc data.txt

  5. TheGuru said, on 22. February 2008 at 7:45

    This is brilliantly simple Torsten. I’ll add it to my back of script tricks as the old sort command, great as it is, just creates unecessary overhead. You can also derive an awk version from this code too.

  6. tcurdt said, on 22. February 2008 at 9:32

    @sander: The idea is not to sort but to get away with a O(1) operation.

    @chromatic: Thanks for the pointer …I am so not a perl guy :)

  7. Sander van Zoest said, on 23. February 2008 at 13:55

    ahh.. right. must have been too late to really grok as to what you were looking for.

  8. Shevek said, on 18. March 2008 at 14:17

    There is a simple proof that this cannot be done in O(1) time and space, it’s analogous to the lower-bound-on-sorting proof. You have hidden a higher order algorithm in the Perl hash, which may be implemented using either time or space, at Perl’s whim. The point of sort/uniq/sort -n is that it takes O(1) space (variable using the -S flag to sort), since while we can usually wait longer, we cannot create more RAM.

  9. tcurdt said, on 18. March 2008 at 14:56

    Hm …my CS skills may be rusty - but I thought hashing is a O(1) operation. Also I am wondering what is hidden besides the hashing itself? As I see it is that this space restriction is not always the way to go as it also comes at the price of sorting. I am wondering how a disk based hashtable would change the picture.

Leave a Reply

Please copy the string KivIKi to the field below: