C# Code Snippets  C# Code Snippets
 C# Code Snippets  C# Code Snippets
 C# Code Snippets  C# Code Snippets
 C# Code Snippets  C# Code Snippets

Thursday, December 13, 2007

How to use Regular Expressions

Hi

Ever needed to parse a web page and get all the Links in it (href's)? the easy way is to use this regular expression to get the href:

Regex r = new Regex("href.*)";

for those of you who don't know this means get me something that starts with -href- and then: whatever... that's what the -.*- is for. The problem is that now we have to work on the results in order to get the actual link.

Extra work? I don't think so...

We want to use groups, so the regular expression will look like this:

"href.*?"(?<HREF>.*?)"

Or in code: (we need to add \ for some escape characters)

Regex MyRegex = new Regex("href.*?\"(?<href>.*?)\"",RegexOptions.Multiline);


The RegexOptions.Multiline means that we can provide a multiline string as the input of the Regular expression


lets break it down:


href.*?"(?<HREF>.*?)"



The beginning is the same -href.*- get everything that starts with href now comes the twist.



the -?"- means stop on the first " you find, if we drop the -?- he will stop on the last -"- he finds (greedy!!!). Now comes the definition of the group: -(?<HREF>.*?) the syntax for defining a group is :



(?<GroupName><Rule>)



What comes after the Group name is the regular expression for the group, in our case the end looks like this:



.*?)"



which means get everything until the first " you see.



that way we will get the "clean" URL inside the HREF group!



To use the groups use this code:



public static void GetMatches(string s)
{
Regex MyRegex = new Regex("href.*?\"(?<href>.*?)\"", RegexOptions.Multiline);
MatchCollection mc1 = MyRegex.Matches(s);
Console.WriteLine(MyRegex.ToString());
foreach (Match m1 in mc1)
{
Console.WriteLine("URL: {0}", m1.Groups["href"].Value);
}
}

Credit to Shahar A.

Have fun!!

Amit

AddThis Social Bookmark Button

No comments: