[olug] Regex question

Mon Oct 9 19:16:18 UTC 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'm working on a thorny regex issue. I have some text files that contain
some lines that have extended ascii characters. I would like to replace
those characters with a regular ascii character that is as much of a
logical replacement as I can come up with.

You can see an example of the lines at this link:
http://www.adamhaeder.com/regex_more.jpg

The image is what the lines look like when I run 'more' on the text
file to view the output. I wrote (ok, ok, found online somewhere) a perl script to
tell me exactly what this character is. Here's the script:

#!/usr/bin/perl
$FILE=$ARGV[0];
open(FILE_HANDLE, $FILE) || die "Can't open $FILE\n";
while (<FILE_HANDLE>)
{
 $line = $_;
 @chars = split(//,$line);
 foreach my $ch (@chars)
 {
  $new=ord($ch);
  print "$ch -> $new\n";
 }
}
close FILE_HANDLE;

Here's the output relevant to the text in the image:

 -> 10
- -> 226
0 -> 48
2 -> 50
2 -> 50
         -> 9
S -> 83
o -> 111
u -> 117
g -> 103
h -> 104
t -> 116
  -> 32
a -> 97
p -> 112
p -> 112
l -> 108
i -> 105
c -> 99
a -> 97
n -> 110
t -> 116
s -> 115
  -> 32
f -> 102
o -> 111
r -> 114
  -> 32
m -> 109
o -> 111
r -> 114
t -> 116
g -> 103
a -> 97
g -> 103
e -> 101

 -> 10
- -> 226
0 -> 48
2 -> 50
2 -> 50
         -> 9
F -> 70
i -> 105
l -> 108
l -> 108
e -> 101
d -> 100
  -> 32
o -> 111
u -> 117
t -> 116
  -> 32
m -> 109
o -> 111
r -> 114
t -> 116
g -> 103
a -> 97
g -> 103
e -> 101
  -> 32
a -> 97
p -> 112
p -> 112
l -> 108
i -> 105
c -> 99
a -> 97
t -> 116
i -> 105
o -> 111
n -> 110
s -> 115

So this tells me my extended ascii character is #226, which according to
http://www.lookuptables.com/ is a weird upside down and backwords capital
L (that's what it looks like to me anyway).

So I'm trying to come up with a sed to replace this with something else,
and I can't seem to get sed to match it.

I want sed to replace ASCII 226 followed by two numbers with a dash.
This sed line replaces everything _but_ our extended ASCII char:

sed -r -e "s/[[:print:][:space:]]/-/g" $filename

But the inverse doesn't work:

sed -r -e "s/[^[:print:][:space:]]/-/g" $filename

This regex works when passed to grep:
grep -e "[^[:print:][:graph:]][0-9]{2}" $filename

But the same regex _does not_ work when passed to sed.

What am I doing wrong?

- --
Adam Haeder
Vice President of Information Technology
AIM Institute
adamh at aiminstitute.org
(402) 345-5025 x115
PGP Public key: http://www.haederfamily.org/pgp.html
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFFKqACbHC3IXlHqBQRAgPLAJ9R/vltSDck3rv008j/mgS0Bh3QDwCdHyDf
+alQVcIfrImKTmEaMWJ9dBw=
=X/Al
-----END PGP SIGNATURE-----