Identifying Duplicate Lines in a Text File
It was never easy checking if there are duplicate entries in our text files. Although there are simple methods like firing up notepad and try to find the whole line. But what if you need to identify line numbers?
Why and How?
Recently, I coded a duplicate line identifier in Perl. Actually I was planning to do that in Python instead, but for the sake of answering this question, I wrote it in Perl. It took me several minutes to get the general idea on how to completely answer that question, and I guess I just succeeded.
About the code, I really used that new style of mine I mentioned 2 blog posts away (maybe), and it worked well. I’m a bit worried about my variables though, they make me feel like I coded a mess. But still, it’s just me.
The code is pretty simple to understand, considering there are nested loops, I don’t recommend simulation. But for a 2 or 3 line file then go ahead. And what makes this different from others is, this identifies line numbers. Not removing them, or just printing them out. It’s a bit handy with, let’s say, debugging a text file. I don’t know if that exists but it’s probably the correct. Anyway, here’s the code.
The Code
#!/usr/bin/perl
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <http://www.gnu.org/licenses/>.
#
# Copyright (c) 2010 Ruel Pagayon <ruel@ruel.me> - http://ruel.me
use strict;
use warnings;
sub loadf($) {
my @file = ( );
open(FILE, '<', $_[0] ) or die("Couldn't Open " . $_[0] . "\n");
@file = <FILE>;
close(FILE);
return @file;
}
{
my @file = loadf("path-to-file.txt");
my @inner = @file;
my @dup = ( );
my $l0 = 0; my $l1 = 0; my $l2 = 0; my $dc = 0; my $tc;
foreach my $line (@file) {
$l1++;
$line =~ s/^\s+//;
$line =~ s/\s+$//;
foreach my $iline (@inner) {
$l2++;
$iline =~ s/^\s+//;
$iline =~ s/\s+$//;
next if ($l1 == $l2 || grep { $_ eq $l1} @dup );
if ($iline eq $line) {
$dc++;
if ($dc > 0) {
if ($l0 == 0) {
print "Line " . $l1 . ": " . $line . "\n";
$l0++;
}
print "Line " . $l2 . ": " . $iline . "\n";
push (@dup, $l2);
}
}
}
print "\n" unless($dc == 0);
$dc = 0; $l0 = 0; $l2 = 0;
}
}
__END__
Just in case you have suggestions about this code, or if you want to download it without copy-paste (silly), I posted this code to gist. But please do leave a comment, if you have something in mind for this code.
Pingback: Tweets that mention Identifying Duplicate Lines in a Text File -- Topsy.com