[olug] Regex question
Matthew Anderson
manders2k.lists at gmail.com
Mon Oct 9 23:15:04 UTC 2006
Adam --
I don't know the answer to your regex issue, but for the problem you
have, I'm not sure that you need regular expressions.
I think it would be faster and easier just to walk through the
characters in the file, and examine them one by one. You swap a
character out for something else if the character matches an item in
your translation table. Something like this?
======================================================================
#!/usr/bin/env perl
%trans = (
chr(224) => "Z", # alpha -> Z
chr(225) => "YyY", # beta -> YyY
chr(226) => "X", # gamma -> X
);
foreach $line (<STDIN>) {
for (my $i = 0; $i < length($line); $i++) {
$key = substr($line,$i,1);
if ($trans{$key}) {
substr($line,$i,1) = $trans{$key};
}
}
print $line;
}
======================================================================
> perl -e 'print chr(224) . "a" . chr(225) . "b" . chr(226) . "c\n"'
| perl trans.pl
ZaYyYbXc
On Oct 9, 2006, at 2:16 PM, Adam Haeder wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I'm working on a thorny regex issue. I have some text files that
> contain
> some lines that have extended ascii characters. I would like to
> replace
> those characters with a regular ascii character that is as much of a
> logical replacement as I can come up with.
>
> You can see an example of the lines at this link:
> http://www.adamhaeder.com/regex_more.jpg
>
> The image is what the lines look like when I run 'more' on the text
> file to view the output. I wrote (ok, ok, found online somewhere) a
> perl script to
> tell me exactly what this character is. Here's the script:
>
> #!/usr/bin/perl
> $FILE=$ARGV[0];
> open(FILE_HANDLE, $FILE) || die "Can't open $FILE\n";
> while (<FILE_HANDLE>)
> {
> $line = $_;
> @chars = split(//,$line);
> foreach my $ch (@chars)
> {
> $new=ord($ch);
> print "$ch -> $new\n";
> }
> }
> close FILE_HANDLE;
>
>
> Here's the output relevant to the text in the image:
>
> -> 10
> - -> 226
> 0 -> 48
> 2 -> 50
> 2 -> 50
> -> 9
> S -> 83
> o -> 111
> u -> 117
> g -> 103
> h -> 104
> t -> 116
> -> 32
> a -> 97
> p -> 112
> p -> 112
> l -> 108
> i -> 105
> c -> 99
> a -> 97
> n -> 110
> t -> 116
> s -> 115
> -> 32
> f -> 102
> o -> 111
> r -> 114
> -> 32
> m -> 109
> o -> 111
> r -> 114
> t -> 116
> g -> 103
> a -> 97
> g -> 103
> e -> 101
>
> -> 10
> - -> 226
> 0 -> 48
> 2 -> 50
> 2 -> 50
> -> 9
> F -> 70
> i -> 105
> l -> 108
> l -> 108
> e -> 101
> d -> 100
> -> 32
> o -> 111
> u -> 117
> t -> 116
> -> 32
> m -> 109
> o -> 111
> r -> 114
> t -> 116
> g -> 103
> a -> 97
> g -> 103
> e -> 101
> -> 32
> a -> 97
> p -> 112
> p -> 112
> l -> 108
> i -> 105
> c -> 99
> a -> 97
> t -> 116
> i -> 105
> o -> 111
> n -> 110
> s -> 115
>
> So this tells me my extended ascii character is #226, which
> according to
> http://www.lookuptables.com/ is a weird upside down and backwords
> capital
> L (that's what it looks like to me anyway).
>
> So I'm trying to come up with a sed to replace this with something
> else,
> and I can't seem to get sed to match it.
>
> I want sed to replace ASCII 226 followed by two numbers with a dash.
> This sed line replaces everything _but_ our extended ASCII char:
>
> sed -r -e "s/[[:print:][:space:]]/-/g" $filename
>
> But the inverse doesn't work:
>
> sed -r -e "s/[^[:print:][:space:]]/-/g" $filename
>
> This regex works when passed to grep:
> grep -e "[^[:print:][:graph:]][0-9]{2}" $filename
>
> But the same regex _does not_ work when passed to sed.
>
> What am I doing wrong?
>
> - --
> Adam Haeder
> Vice President of Information Technology
> AIM Institute
> adamh at aiminstitute.org
> (402) 345-5025 x115
> PGP Public key: http://www.haederfamily.org/pgp.html
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2 (GNU/Linux)
>
> iD8DBQFFKqACbHC3IXlHqBQRAgPLAJ9R/vltSDck3rv008j/mgS0Bh3QDwCdHyDf
> +alQVcIfrImKTmEaMWJ9dBw=
> =X/Al
> -----END PGP SIGNATURE-----
> _______________________________________________
> OLUG mailing list
> OLUG at olug.org
> http://lists.olug.org/mailman/listinfo/olug
--
Matt Anderson
More information about the OLUG
mailing list