Thursday, December 13, 2007

How to use Regular Expressions


Ever needed to parse a web page and get all the Links in it (href's)? the easy way is to use this regular expression to get the href:

Regex r = new Regex("href.*)";

for those of you who don't know this means get me something that starts with -href- and then: whatever... that's what the -.*- is for. The problem is that now we have to work on the results in order to get the actual link.

Extra work? I don't think so...

We want to use groups, so the regular expression will look like this:


Or in code: (we need to add \ for some escape characters)

Regex MyRegex = new Regex("href.*?\"(?<href>.*?)\"",RegexOptions.Multiline);

The RegexOptions.Multiline means that we can provide a multiline string as the input of the Regular expression

lets break it down:


The beginning is the same -href.*- get everything that starts with href now comes the twist.

the -?"- means stop on the first " you find, if we drop the -?- he will stop on the last -"- he finds (greedy!!!). Now comes the definition of the group: -(?<HREF>.*?) the syntax for defining a group is :


What comes after the Group name is the regular expression for the group, in our case the end looks like this:


which means get everything until the first " you see.

that way we will get the "clean" URL inside the HREF group!

To use the groups use this code:

public static void GetMatches(string s)
Regex MyRegex = new Regex("href.*?\"(?<href>.*?)\"", RegexOptions.Multiline);
MatchCollection mc1 = MyRegex.Matches(s);
foreach (Match m1 in mc1)
Console.WriteLine("URL: {0}", m1.Groups["href"].Value);

Credit to Shahar A.

Have fun!!


