Uniq with Hashing

Usually a simple command line can be used to extract the unique lines from a file.

cat data.txt | sort | uniq

Unfortunately the sort is required (as uniq requires a sorted input for its task) and the bigger the data set gets the more painful that sort operation becomes. I am surprised there is no option to ‘uniq’ to use hashing. But a short perl script also does the trick.

#!/usr/bin/perl

%seen = ();
while (<>) {
    print $_ unless $seen{$_}++;
}

Any better suggestions?

21. February 2008 | bash

tcurdt

Hm ...my CS skills may be rusty - but I thought hashing is a O(1) operation. Also I am wondering what is hidden besides the hashing itself? As I see it is that this space restriction is not always the way to go as it also comes at the price of sorting. I am wondering how a disk based hashtable would change the picture.
Shevek

There is a simple proof that this cannot be done in O(1) time and space, it's analogous to the lower-bound-on-sorting proof. You have hidden a higher order algorithm in the Perl hash, which may be implemented using either time or space, at Perl's whim. The point of sort/uniq/sort -n is that it takes O(1) space (variable using the -S flag to sort), since while we can usually wait longer, we cannot create more RAM.
Sander van Zoest

ahh.. right. must have been too late to really grok as to what you were looking for.
tcurdt

@sander: The idea is not to sort but to get away with a O(1) operation.

@chromatic: Thanks for the pointer ...I am so not a perl guy :)
TheGuru

This is brilliantly simple Torsten. I'll add it to my back of script tricks as the old sort command, great as it is, just creates unecessary overhead. You can also derive an awk version from this code too.
Sander van Zoest

% sort -u data.txt
or
% sort -uc data.txt
chromatic

The line:

%seen = ();

... doesn't do anything (unless you're using the strict pragma). This is better, because it declares a lexical:

my %seen;
tcurdt

Try that on a 4GB file. I don't think so ;)
francisoud

In ruby:
puts IO.readlines("data.txt").uniq

Torsten Curdt’s weblog

Uniq with Hashing