Friday, November 11, 2011

PERL REGEX REGULAR EXPRESSIONS

PERL REGEX REGULAR EXPRESSIONS




REVISED: Sunday, March 3, 2013




You will learn how to use Perl REGEX, regular expressions.

I.  REGEX

In its simplest form, a regular expression is just a word or phrase to search for.

REGEX are Perl REGular EXpressions.  Perl REGEX syntax make it easy to do the following:

Complex string comparisons.

Complex string selections.

Complex string replacements.

Parsing based on the above abilities.

A.  COMPLEX STRING COMPARISONS

1.  The following is a very basic string logical comparison:

$string =~ m/sought_text/;

The above returns true if the string $string contains substring "sought_text", and false otherwise.

2.  If you only want the strings where the sought_text appears at the "very beginning" of $string, write:

$string =~ m/^sought_text/;

3.  The $ operator indicates "end of string". If you want to find out if the sought_text is the very last text in the $string, write:

$string =~ m/sought_text$/;

4.  If you want the comparison to be true only if $string contains the sought_text and nothing but the sought_text, write:

$string =~ m/^sought_text$/;

5.  If you want the comparison to be case insensitive add the letter i after the ending delimiter:

$string =~ m/^sought_text$/i;

6.  Wild Cards

 Match any character
\w  Match "word" character alphanumeric, plus "_"
\W  Match non-word character
\s  Match whitespace character
\S  Match non-whitespace character
\d  Match digit character
\D  Match non-digit character
\t  Match tab
\n  Match newline
\r  Match return
\f  Match formfeed
\a  Match alarm bell, beep, etc.
\e  Match escape
\021  Match octal char 21
\xf0  Match hex char f0

7.  Repetition

*                 Match 0 or more times
+                Match 1 or more times
?                 Match 1 or 0 times
{n}           Match exactly n times
{n,}          Match at least n times
{n,m}      Match at least n but not more than m times

8.  Using Groups ( ) in Matching

Groups are regular expression characters surrounded by parentheses ( ).

a.  Groups are used to allow alternative phrases; e.g.: 

/(Jack|Jill|Hill)/i

For single character alternatives, use character classes [ ].  Everything inside the brackets represents one character, listing all its alternative possibilities.  Character classes are alternative single characters within square brackets [ ].  There are two commonly used special characters inside the square brackets:

A hyphen (-) is used to indicate all characters in the colating sequence between the character on the hyphen's (-left and the character on the hyphen's (-right.

An uparrow (^) at immediately following the opening square bracket means "anything but these characters", and effectively negates the character class.

Character classes have three main advantages:

i.     Shorthand notation, as [AEIOUY] instead of (A|E|I|O|U|Y).
ii.   Character Ranges, such as [A-Z].
iii.  One to one mapping from one class to another; e.g.: tr/[a-z]/[A-Z]/.

b.  Groups are also used as a means of retrieving selected text in selection, translation and substitution, used with scalers; $1, $2; etc.

B.  COMPLEX STRING SELECTIONS

Replace every "Jack" with "Jill"

$string =~ s/Jack/Jill/;

C.  COMPLEX STRING REPLACEMENTS

Translations are like substitutions, except they happen on a letter by letter basis instead of substituting a single phrase for another single phrase.

What if you wanted to make all vowels upper case:

$string =~ tr/[a,e,i,o,u,y]/[A,E,I,O,U,Y]/;

Change everything to upper case:

$string =~ tr/[a-z]/[A-Z]/;

Change everything to lower case

$string =~ tr/[A-Z]/[a-z]/;

II.  SYMBOLS

=~

This operator appears between the string var you are comparing, and the regular expression you are looking for.  In selection or substitution a regular expression operates on the string var rather than comparing; for example:

$string =~ m/Jack/;
#return true if var $string contains the name Jack

$string =~ s/Jack/Jill/;
#replace Jack with Jill

!~

Just like =~, except negated.   Returns true if it does not match. 

/

This is the usual "delimiter" for the text part of a regular expression; for example:

$string =~ m/Jack/;
#return true if var $string contains the name Jack

$string =~ s/Jack/Jill/;
#replace Jack with Jill

If the sought-after text contains slashes, it is easier to use pipe symbols (|) for delimiters.

m

This is the "match" operator.  The match operator comes before the opening delimiter.   The match operator means read the string expression on the left of the =~, and see if any part of it matches the expression within the delimiters following the m.  If the delimiters are slashes (/), the m is optional and often not included.  Whether the match operator is there or not, it is still a match operation; for example:

$string =~ m/Jack/;
#return true if var $string contains the name Jack

$string =~ /Jack/;
#same result as previous statement

^

This is the "beginning of line" symbol.  When used immediately after the starting delimiter, it signifies "at the beginning of the line"; for example:

$string =~ m/^Jack/;
#true only when "Jack" is the first text in the string

$

This is the "end of line" symbol.  When used immediately before the ending delimiter, the "end of line" symbol signifies "at the end of the line"; for example:

$string =~ m/Jack$/;
#true only when "Jack" is the last text in the string

i

This is the "case insensitivity" operator when used immediately after the closing delimiter; for example:

$string =~ m/Jack/i;
#true when $string contains "Jack" or "jAcK"

III.  TABLES


Some characters have a special meaning to the searcher. These characters are called metacharacters. 

METACHARACTERS

CHARMEANING
^
beginning of string
$
end of string
.
any character except newline
*
match 0 or more times
+
match 1 or more times
?match 0 or 1 times; or: shortest match
|
alternative
( )
grouping; “storing”
[ ]
set of characters
{ }
repetition modifier
\
quote or special


REPETITION


a*
zero or more a’s
a+
one or more a’s
a?
zero or one a’s (i.e., optional a)
a{m}
exactly m a’s
a{m,}
at least m a’s
a{m,n}
at least m but at most a’s
repetition?
same as repetition but the shortestmatch is taken



SPECIAL NOTATIONS WITH \

Single characters
\ttab
\nnewline
\rreturn (CR)
\xhhcharacter with hex. code hh
“Zero-width assertions”
\b“word” boundary
\Bnot a “word” boundary
Matching
\wmatches any single character classified as a “word” character (alphanumeric or “_”)
\Wmatches any non-“word” character
\smatches any whitespace character (space, tab, newline)
\Smatches any non-whitespace character
\dmatches any digit character, equiv. to [0-9]
\Dmatches any non-digit character


CHARACTER CLASS [...]
Inside a "character class" denoted by [...] the following rules apply:
[characters]matches any of the characters in the sequence
[x-y]matches any of the characters from x to y (inclusively) in the ASCII code
[\-]matches the hyphen character “-
[\n]matches the newline; other single character denotations with \ apply normally, too
[^something]matches any character except those that [something] denotes; that is, immediately after the leading “[”, the circumflex “^” means “not” applied to all of the rest

EXPRESSION
MATCHES
abcabc (that exact character sequence, but anywhere in the string)
^abcabc at the beginning of the string
abc$abc at the end of the string
a|beither of a and b
^abc|abc$the string abc at the beginning or at the end of the string
ab{2,4}can a followed by two, three or four b’s followed by a c
ab{2,}can a followed by at least two b’s followed by a c
ab*can a followed by any number (zero or more) of b’s followed by a c
ab+can a followed by one or more b’s followed by a c
ab?can a followed by an optional b followed by a c; that is, either abc or ac
a.can a followed by any single character (not newline) followed by a c
a\.ca.c exactly
[abc]any one of ab and c
[Aa]bceither of Abc and abc
[abc]+any (nonempty) string of a’s, b’s and c’s (such as aabbaacbabcacaa)
[^abc]+any (nonempty) string which does not contain any of ab and c (such as defg)
\d\dany two decimal digits, such as 25; same as \d{2}
\w+a “word”: a nonempty sequence of alphanumeric characters and low lines (underscores), such as cp3o and r2d2 and cool_1
100\s*mkthe strings 100 and mk optionally separated by any amount of white space (spaces, tabs, newlines)
abc\babc when followed by a word boundary (e.g. in abc! but not in abcd)
perl\Bperl when not followed by a word boundary (e.g. in perlert but not in perl stuff)


You have learned how to use Perl REGEX, regular expressions.

Elcric Otto Circle





-->   -->   -->







How to Link to My Home Page

It will appear on your website as:
"Link to ELCRIC OTTO CIRCLE's Home Page"






No comments:

Post a Comment