[olug] Regex question

Mon Oct 9 23:15:04 UTC 2006

Adam --

I don't know the answer to your regex issue, but for the problem you  
have, I'm not sure that you need regular expressions.

I think it would be faster and easier just to walk through the  
characters in the file, and examine them one by one.  You swap a  
character out for something else if the character matches an item in  
your translation table.  Something like this?

======================================================================
#!/usr/bin/env perl

%trans = (
     chr(224) => "Z", # alpha -> Z
     chr(225) => "YyY", # beta -> YyY
     chr(226) => "X", # gamma -> X
);

foreach $line (<STDIN>) {
     for (my $i = 0; $i < length($line); $i++) {
         $key = substr($line,$i,1);
         if ($trans{$key}) {
             substr($line,$i,1) = $trans{$key};
         }
     }
     print $line;
}

======================================================================

 > perl -e 'print chr(224) . "a" . chr(225) . "b" . chr(226) . "c\n"'  
| perl trans.pl
ZaYyYbXc

On Oct 9, 2006, at 2:16 PM, Adam Haeder wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I'm working on a thorny regex issue. I have some text files that  
> contain
> some lines that have extended ascii characters. I would like to  
> replace
> those characters with a regular ascii character that is as much of a
> logical replacement as I can come up with.
>
> You can see an example of the lines at this link:
> http://www.adamhaeder.com/regex_more.jpg
>
> The image is what the lines look like when I run 'more' on the text
> file to view the output. I wrote (ok, ok, found online somewhere) a  
> perl script to
> tell me exactly what this character is. Here's the script:
>
> #!/usr/bin/perl
> $FILE=$ARGV[0];
> open(FILE_HANDLE, $FILE) || die "Can't open $FILE\n";
> while (<FILE_HANDLE>)
> {
>  $line = $_;
>  @chars = split(//,$line);
>  foreach my $ch (@chars)
>  {
>   $new=ord($ch);
>   print "$ch -> $new\n";
>  }
> }
> close FILE_HANDLE;
>
>
> Here's the output relevant to the text in the image:
>
>  -> 10
> - -> 226
> 0 -> 48
> 2 -> 50
> 2 -> 50
>          -> 9
> S -> 83
> o -> 111
> u -> 117
> g -> 103
> h -> 104
> t -> 116
>   -> 32
> a -> 97
> p -> 112
> p -> 112
> l -> 108
> i -> 105
> c -> 99
> a -> 97
> n -> 110
> t -> 116
> s -> 115
>   -> 32
> f -> 102
> o -> 111
> r -> 114
>   -> 32
> m -> 109
> o -> 111
> r -> 114
> t -> 116
> g -> 103
> a -> 97
> g -> 103
> e -> 101
>
>  -> 10
> - -> 226
> 0 -> 48
> 2 -> 50
> 2 -> 50
>          -> 9
> F -> 70
> i -> 105
> l -> 108
> l -> 108
> e -> 101
> d -> 100
>   -> 32
> o -> 111
> u -> 117
> t -> 116
>   -> 32
> m -> 109
> o -> 111
> r -> 114
> t -> 116
> g -> 103
> a -> 97
> g -> 103
> e -> 101
>   -> 32
> a -> 97
> p -> 112
> p -> 112
> l -> 108
> i -> 105
> c -> 99
> a -> 97
> t -> 116
> i -> 105
> o -> 111
> n -> 110
> s -> 115
>
> So this tells me my extended ascii character is #226, which  
> according to
> http://www.lookuptables.com/ is a weird upside down and backwords  
> capital
> L (that's what it looks like to me anyway).
>
> So I'm trying to come up with a sed to replace this with something  
> else,
> and I can't seem to get sed to match it.
>
> I want sed to replace ASCII 226 followed by two numbers with a dash.
> This sed line replaces everything _but_ our extended ASCII char:
>
> sed -r -e "s/[[:print:][:space:]]/-/g" $filename
>
> But the inverse doesn't work:
>
> sed -r -e "s/[^[:print:][:space:]]/-/g" $filename
>
> This regex works when passed to grep:
> grep -e "[^[:print:][:graph:]][0-9]{2}" $filename
>
> But the same regex _does not_ work when passed to sed.
>
> What am I doing wrong?
>
> - --
> Adam Haeder
> Vice President of Information Technology
> AIM Institute
> adamh at aiminstitute.org
> (402) 345-5025 x115
> PGP Public key: http://www.haederfamily.org/pgp.html
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2 (GNU/Linux)
>
> iD8DBQFFKqACbHC3IXlHqBQRAgPLAJ9R/vltSDck3rv008j/mgS0Bh3QDwCdHyDf
> +alQVcIfrImKTmEaMWJ9dBw=
> =X/Al
> -----END PGP SIGNATURE-----
> _______________________________________________
> OLUG mailing list
> OLUG at olug.org
> http://lists.olug.org/mailman/listinfo/olug

--
  Matt Anderson