Core iTOps Tube

Thursday, 26 April 2012

Help in modifying existing Perl Script to produce report of dupes

Hello,

I have a large amount of data with the following structure:

Word=Transliterated word

I have written a Perl Script (reproduced below) which goes through the full file and identifies all dupes on the right hand side. It creates successfully a new file with two headers: Singletons and Dupes.

I have tried to modify the script to produce additionally a record listing the frequency count of all dupes. Thus in the sample provided, I would like to know how many times the dupe Albert has been transliterated in different ways. I am providing pseudo-data since the original data is in a foreign script.


Quote:









Albert=albt

Albert=albut

Albert=albat

Mary=mari

Mary=meri

Mary=merry

Mary=marey




The script should give me a report in a separate output with the following structure:


Quote:









Albert,3, albt,albut,albat

Mary,4,mari,meri,merry,marey




The final output would thus have two files:

The output file listing Singletons and Dupes

The report which would have the dupes listed along with their frequency.

I am not very good at generating reports in Perl and hence the request:

Perl script follows.

Many thanks for excellent help and advice given.



code

#!/usr/bin/perl



$dupes = $singletons = ""; # This goes at the head of the file



do {

$dupefound = 0; # These go at the head of the loop

$text = $line = $prevline = $name = $prevname = "";

do {

$line = <>;

$line =~ /^(.+)\=.+$/ and $name = $1;

$prevline =~ /^(.+)\=.+$/ and $prevname = $1;

if ($name eq $prevname) { $dupefound += 1 }

$text .= $line;

$prevline = $line;

} until ($dupefound > 0 and $text !~ /^(.+?)\=.*?\n(?:\1=.*?\n)+\z/m) or eof;

if ($text =~ s/(^(.+?)\=.*?\n(?:\2=.*?\n)+)//m) { $dupes .= $1 }

$singletons .= $text;

} until eof;

print "SINGLETONS\n$singletons\n\DUPES\n$dupes";

/code




No comments:

Post a Comment