header

Torsten Curdt’s weblog

Uniq with Hashing

Usually a simple command line can be used to extract the unique lines from a file.

cat data.txt | sort | uniq

Unfortunately the sort is required (as uniq requires a sorted input for its task) and the bigger the data set gets the more painful that sort operation becomes. I am surprised there is no option to ‘uniq’ to use hashing. But a short perl script also does the trick.

#!/usr/bin/perl

%seen = ();
while (<>) {
    print $_ unless $seen{$_}++;
}

Any better suggestions?

  • Hm ...my CS skills may be rusty - but I thought hashing is a O(1) operation. Also I am wondering what is hidden besides the hashing itself? As I see it is that this space restriction is not always the way to go as it also comes at the price of sorting. I am wondering how a disk based hashtable would change the picture.
  • There is a simple proof that this cannot be done in O(1) time and space, it's analogous to the lower-bound-on-sorting proof. You have hidden a higher order algorithm in the Perl hash, which may be implemented using either time or space, at Perl's whim. The point of sort/uniq/sort -n is that it takes O(1) space (variable using the -S flag to sort), since while we can usually wait longer, we cannot create more RAM.
  • ahh.. right. must have been too late to really grok as to what you were looking for.
  • @sander: The idea is not to sort but to get away with a O(1) operation.

    @chromatic: Thanks for the pointer ...I am so not a perl guy :)
  • TheGuru
    This is brilliantly simple Torsten. I'll add it to my back of script tricks as the old sort command, great as it is, just creates unecessary overhead. You can also derive an awk version from this code too.
  • % sort -u data.txt
    or
    % sort -uc data.txt
  • The line:

    %seen = ();

    ... doesn't do anything (unless you're using the strict pragma). This is better, because it declares a lexical:

    my %seen;
  • Try that on a 4GB file. I don't think so ;)
  • In ruby:
    puts IO.readlines("data.txt").uniq
blog comments powered by Disqus