PERL REGEX REGULAR EXPRESSIONS

REVISED: Sunday, March 3, 2013

You will learn how to use Perl REGEX, regular expressions.

I. REGEX

In its simplest form, a regular expression is just a word or phrase to search for.

REGEX are Perl REGular EXpressions. Perl REGEX syntax make it easy to do the following:

Complex string comparisons.

Complex string selections.

Complex string replacements.

Parsing based on the above abilities.

A. COMPLEX STRING COMPARISONS

1. The following is a very basic string logical comparison:

$string =~ m/sought_text/;

The above returns true if the string $string contains substring "sought_text", and false otherwise.

2.  If you only want the strings where the sought_text appears at the ^ "very beginning" of $string, write:

$string =~ m/^sought_text/;

3.  The $ operator indicates "end of string". If you want to find out if the sought_text is the very last text in the $string, write:

$string =~ m/sought_text$/;

4.  If you want the comparison to be true only if $string contains the sought_text and nothing but the sought_text, write:

$string =~ m/^sought_text$/;

5. If you want the comparison to be case insensitive add the letter i after the ending delimiter:

$string =~ m/^sought_text$/i;

6.  Wild Cards

.  Match any character
\w  Match "word" character alphanumeric, plus "_"
\W  Match non-word character
\s  Match whitespace character
\S  Match non-whitespace character
\d  Match digit character
\D  Match non-digit character
\t  Match tab
\n  Match newline
\r  Match return
\f  Match formfeed
\a  Match alarm bell, beep, etc.
\e  Match escape
\021  Match octal char 21
\xf0  Match hex char f0

7.  Repetition

* Match 0 or more times
+ Match 1 or more times
? Match 1 or 0 times
{n} Match exactly n times
{n,} Match at least n times
{n,m}      Match at least n but not more than m times

8.  Using Groups ( ) in Matching

Groups are regular expression characters surrounded by parentheses ( ).

a. Groups are used to allow alternative phrases; e.g.:

/(Jack|Jill|Hill)/i

For single character alternatives, use character classes [ ]. Everything inside the brackets represents one character, listing all its alternative possibilities. Character classes are alternative single characters within square brackets [ ]. There are two commonly used special characters inside the square brackets:

A hyphen (-) is used to indicate all characters in the colating sequence between the character on the hyphen's (-) left and the character on the hyphen's (-) right.

An uparrow (^) at immediately following the opening square bracket [ means "anything but these characters", and effectively negates the character class.

Character classes have three main advantages:

i. Shorthand notation, as [AEIOUY] instead of (A|E|I|O|U|Y).
ii. Character Ranges, such as [A-Z].
iii. One to one mapping from one class to another; e.g.: tr/[a-z]/[A-Z]/.

b. Groups are also used as a means of retrieving selected text in selection, translation and substitution, used with scalers; $1, $2; etc.

B. COMPLEX STRING SELECTIONS

Replace every "Jack" with "Jill"

$string =~ s/Jack/Jill/;

C. COMPLEX STRING REPLACEMENTS

Translations are like substitutions, except they happen on a letter by letter basis instead of substituting a single phrase for another single phrase.

What if you wanted to make all vowels upper case:

$string =~ tr/[a,e,i,o,u,y]/[A,E,I,O,U,Y]/;

Change everything to upper case:

$string =~ tr/[a-z]/[A-Z]/;

Change everything to lower case

$string =~ tr/[A-Z]/[a-z]/;

II. SYMBOLS

=~

This operator appears between the string var you are comparing, and the regular expression you are looking for. In selection or substitution a regular expression operates on the string var rather than comparing; for example:

$string =~ m/Jack/;
#return true if var $string contains the name Jack

$string =~ s/Jack/Jill/;
#replace Jack with Jill

!~

Just like =~, except negated. Returns true if it does not match.

/

This is the usual "delimiter" for the text part of a regular expression; for example:

$string =~ m/Jack/;
#return true if var $string contains the name Jack

$string =~ s/Jack/Jill/;
#replace Jack with Jill

If the sought-after text contains slashes, it is easier to use pipe symbols (|) for delimiters.

This is the "match" operator. The match operator comes before the opening delimiter. The match operator means read the string expression on the left of the =~, and see if any part of it matches the expression within the delimiters following the m. If the delimiters are slashes (/), the m is optional and often not included. Whether the match operator is there or not, it is still a match operation; for example:

$string =~ m/Jack/;
#return true if var $string contains the name Jack

$string =~ /Jack/;
#same result as previous statement

^

This is the "beginning of line" symbol. When used immediately after the starting delimiter, it signifies "at the beginning of the line"; for example:

$string =~ m/^Jack/;
#true only when "Jack" is the first text in the string

$

This is the "end of line" symbol. When used immediately before the ending delimiter, the "end of line" symbol signifies "at the end of the line"; for example:

$string =~ m/Jack$/;
#true only when "Jack" is the last text in the string

i

This is the "case insensitivity" operator when used immediately after the closing delimiter; for example:

$string =~ m/Jack/i;
#true when $string contains "Jack" or "jAcK"

III. TABLES

Some characters have a special meaning to the searcher. These characters are called metacharacters.

METACHARACTERS

CHAR	MEANING
`^`	beginning of string
`$`	end of string
`.`	any character except newline
`*`	match 0 or more times
`+`	match 1 or more times
`?`	match 0 or 1 times; or: shortest match
`\|`	alternative
`( )`	grouping; “storing”
`[ ]`	set of characters
`{ }`	repetition modifier
`\`	quote or special

REPETITION

a*
zero or more a’s
a+
one or more a’s
a?
zero or one a’s (i.e., optional a)
a{m}
exactly m a’s
a{m,}
at least m a’s
a{m,n}
at least m but at most n a’s
repetition?
same as repetition but the shortestmatch is taken

SPECIAL NOTATIONS WITH \

Single characters
`\t`	tab
`\n`	newline
`\r`	return (CR)
`\xhh`	character with hex. code `hh`

“Zero-width assertions”
`\b`	“word” boundary
`\B`	not a “word” boundary

Matching
`\w`	matches any single character classified as a “word” character (alphanumeric or “`_`”)
`\W`	matches any non-“word” character
`\s`	matches any whitespace character (space, tab, newline)
`\S`	matches any non-whitespace character
`\d`	matches any digit character, equiv. to `[0-9]`
`\D`	matches any non-digit character

CHARACTER CLASS [...]

Inside a "character class" denoted by [...] the following rules apply:

`[characters]`	matches any of the characters in the sequence
`[x-y]`	matches any of the characters from `x` to `y` (inclusively) in the ASCII code
`[\-]`	matches the hyphen character “`-`”
[`\n`]	matches the newline; other single character denotations with \ apply normally, too
`[^something]`	matches any character except those that `[something]` denotes; that is, immediately after the leading “`[`”, the circumflex “`^`” means “not” applied to all of the rest

EXPRESSION	MATCHES
`abc`	`abc` (that exact character sequence, but anywhere in the string)
`^abc`	`abc` at the beginning of the string
`abc$`	`abc` at the end of the string
`a\|b`	either of `a` and `b`
`^abc\|abc$`	the string `abc` at the beginning or at the end of the string
`ab{2,4}c`	an `a` followed by two, three or four `b`’s followed by a `c`
`ab{2,}c`	an `a` followed by at least two `b`’s followed by a `c`
`ab*c`	an `a` followed by any number (zero or more) of `b`’s followed by a `c`
`ab+c`	an `a` followed by one or more `b`’s followed by a `c`
`ab?c`	an `a` followed by an optional `b` followed by a `c`; that is, either `abc` or `ac`
`a.c`	an `a` followed by any single character (not newline) followed by a `c`
`a\.c`	`a.c` exactly
`[abc]`	any one of `a`, `b` and `c`
`[Aa]bc`	either of `Abc` and `abc`
`[abc]+`	any (nonempty) string of `a`’s, `b`’s and `c’s` (such as `a`, `abba`, `acbabcacaa`)
`[^abc]+`	any (nonempty) string which does not contain any of `a`, `b` and `c` (such as `defg`)
`\d\d`	any two decimal digits, such as `25`; same as \d{2}
`\w+`	a “word”: a nonempty sequence of alphanumeric characters and low lines (underscores), such as `cp3o` and `r2d2` and c`ool_1`
`100\s*mk`	the strings `100` and `mk` optionally separated by any amount of white space (spaces, tabs, newlines)
`abc\b`	`abc` when followed by a word boundary (e.g. in `abc!` but not in `abcd`)
`perl\B`	`perl` when not followed by a word boundary (e.g. in `perlert` but not in `perl stuff`)

You have learned how to use Perl REGEX, regular expressions.

Elcric Otto Circle

--> --> -->

How to Link to My Home Page

It will appear on your website as:
"Link to ELCRIC OTTO CIRCLE's Home Page"

PERL 5

Friday, November 11, 2011

PERL REGEX REGULAR EXPRESSIONS

METACHARACTERS

REPETITION

How to Link to My Home Page

No comments:

Post a Comment