Identifying Duplicate Lines in a Text File

» 24 November 2010 » In Open-Source, Perl, Programming »

It was never easy checking if there are duplicate entries in our text files. Although there are simple methods like firing up notepad and try to find the whole line. But what if you need to identify line numbers?

Why and How?

Recently, I coded a duplicate line identifier in Perl. Actually I was planning to do that in Python instead, but for the sake of answering this question, I wrote it in Perl. It took me several minutes to get the general idea on how to completely answer that question, and I guess I just succeeded.

About the code, I really used that new style of mine I mentioned 2 blog posts away (maybe), and it worked well. I’m a bit worried about my variables though, they make me feel like I coded a mess. But still, it’s just me.

The code is pretty simple to understand, considering there are nested loops, I don’t recommend simulation. But for a 2 or 3 line file then go ahead. And what makes this different from others is, this identifies line numbers. Not removing them, or just printing them out. It’s a bit handy with, let’s say, debugging a text file. I don’t know if that exists but it’s probably the correct. Anyway, here’s the code.

The Code

#!/usr/bin/perl

#	This program is free software: you can redistribute it and/or modify
#	it under the terms of the GNU General Public License as published by
#	the Free Software Foundation, either version 3 of the License, or
#	(at your option) any later version.
#
#	This program is distributed in the hope that it will be useful,
#	but WITHOUT ANY WARRANTY; without even the implied warranty of
#	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#	GNU General Public License for more details.
#
#	You should have received a copy of the GNU General Public License
#	along with this program.  If not, see <http://www.gnu.org/licenses/>.
#
#	Copyright (c) 2010 Ruel Pagayon <ruel@ruel.me> - http://ruel.me

use strict;
use warnings;

sub loadf($) {
    my @file = ( );
    open(FILE, '<', $_[0] ) or die("Couldn't Open " . $_[0] . "\n");
    @file = <FILE>;
    close(FILE);
    return @file;
}

{
	my @file = loadf("path-to-file.txt");
	my @inner = @file;
	my @dup = ( );
	my $l0 = 0; my $l1 = 0; my $l2 = 0; my $dc = 0;	my $tc;
	foreach my $line (@file) {
		$l1++;
		$line =~ s/^\s+//;
		$line =~ s/\s+$//;
		foreach my $iline (@inner) {
			$l2++;
			$iline =~ s/^\s+//;
			$iline =~ s/\s+$//;
			next if ($l1 == $l2 || grep { $_ eq $l1} @dup );
			if ($iline eq $line) {
				$dc++;
				if ($dc > 0) {
					if ($l0 == 0) {
						print "Line " . $l1 . ": " . $line . "\n";
						$l0++;
					}
					print "Line " . $l2 . ": " . $iline . "\n";
					push (@dup, $l2);
				}
			}
		}
		print "\n" unless($dc == 0);
		$dc = 0; $l0 = 0; $l2 = 0;
	}
}

__END__

Just in case you have suggestions about this code, or if you want to download it without copy-paste (silly), I posted this code to gist. But please do leave a comment, if you have something in mind for this code.

Tags: , , , , , , , , ,

Trackback URL

  • x-way

    sort path-to-file.txt|uniq -d|xargs -L 1 -I '{}' grep -n -F '{}' path-to-file.txt
    gives you the same (for the machines with no perl installed…)

  • Anonymous

    uniq -c | sort -n | awk ‘{ if ($1 > 1) { print $0 }}’

    But really you can reduce run time and memory use by not using an O(N^2) algorithm

    while(my $line = ) {
    my $hash = hashof($line); #get a hashed value of $line
    unless (exists $seen{$hash}) {
    $seen{$hash} = [ $linenumber ];
    } else {
    print “Line $linenumber has been seen before on lines: “.join(“, “,@{$seen{$hash}}).$/.
    “Line $linenumber: $line”;
    }
    $linenumber++;
    }

    You just have to write a hash function (I recommend using 64bit numbers at most, there are lots of nice 32bit hashes).

  • Anonymous

    x-way:

    uniq requires a sort fyi.

    and xargs isn’t per line. It’s per token.

    Here’s one that doesn’t spawn a billion grep processes:

    #!/bin/sh
    cat $1 | sort | uniq -d | fgrep -n -x -f- $1

    it’s telling fgrep to matches full lines based on the input file from stdin

  • Anonymous

    FYI: if you’re working with text data a lot, or code even, definitely take the time to play with the unix-y command line tools. The program you wrote seems to do roughly the equivalent of this:

    $ cat /tmp/f | nl | sort -k2 | uniq -f1 D
    1 a
    4 a
    10 g
    8 g

    nl prepends each line with its line number, sort -k2 sorts the file, skipping the first field (the newly prepended line number) and uniq -f1 D “uniq’s” the file (again ignoring the line numbers), printing out all lines that were duplicated (uniq requires the file to be sorted, but this also means that you do an O(n lg n) sort and O(n) search instead of an O(n^2) nested loop. Not a big deal if you’re looking at duplicate code, but if you’re looking in a big CSV file or similar, it can make a big difference.

    $ cat /tmp/f
    a
    b
    c
    a
    d
    e
    f
    g
    h
    g

  • oslo

    lol, how easy these guys makes it seems… nice ;)

  • Pingback: Tweets that mention Identifying Duplicate Lines in a Text File -- Topsy.com

  • http://perlbuzz.com Andy Lester

    Don’t add a “\n” to the filename when you open it. Better still, use the 3-arg open

    • http://ruel.me Ruel

      Andy, without the newline seems not to work, but that’s long ago, when perl is still 5.6 or so (that’s when I’ve written that little subfunction), and it works well now. Thanks for that. :)

  • py

    Consider Python again :)

    src = r'path-to-file.txt'
    
    lines = {}
    
    with open(src) as f:
        for idx, line in enumerate(f):
            lines.setdefault(line.strip(), []).append(idx)
            
    for line in lines:
        if line and len(lines[line]) > 1:
            for idx in lines[line]:
                print "Line %d: %s" % (idx, line)
            print
    
  • http://ruel.me Ruel

    Thank you all for the input, I appreciate those corrections/suggestions. And I’ll be stepping into Python too. For this script, I just followed the SO OP’s requirements. But then again, thanks! :)

  • Venkat Chemist4

    thanks for code it is usefull for me to find duplicate in huge data

    • http://ruel.me Ruel

      No problem. :)

  • Yoda2701977

    how to make condition if some duplicate lines exists(uniq -d), print them to file and if not just continue the script. if possible with sh
    thank you