Hi
Ever needed to parse a web page and get all the Links in it (href's)? the easy way is to use this regular expression to get the href:Regex r = new Regex("href.*)";
for those of you who don't know this means get me something that starts with -href- and then: whatever... that's what the -.*- is for. The problem is that now we have to work on the results in order to get the actual link.Extra work? I don't think so...
We want to use groups, so the regular expression will look like this:
"href.*?"(?<HREF>.*?)"
Or in code: (we need to add \ for some escape characters)
Regex MyRegex = new Regex("href.*?\"(?<href>.*?)\"",RegexOptions.Multiline);
The RegexOptions.Multiline means that we can provide a multiline string as the input of the Regular expression
lets break it down:
href.*?"(?<HREF>.*?)"
The beginning is the same -href.*- get everything that starts with href now comes the twist.
the -?"- means stop on the first " you find, if we drop the -?- he will stop on the last -"- he finds (greedy!!!). Now comes the definition of the group: -(?<HREF>.*?) the syntax for defining a group is :
(?<GroupName><Rule>)
What comes after the Group name is the regular expression for the group, in our case the end looks like this:
.*?)"
which means get everything until the first " you see.
that way we will get the "clean" URL inside the HREF group!
To use the groups use this code:
public static void GetMatches(string s)
{
Regex MyRegex = new Regex("href.*?\"(?<href>.*?)\"", RegexOptions.Multiline);
MatchCollection mc1 = MyRegex.Matches(s);
Console.WriteLine(MyRegex.ToString());
foreach (Match m1 in mc1)
{
Console.WriteLine("URL: {0}", m1.Groups["href"].Value);
}
}
Credit to Shahar A.
Have fun!!
Amit
No comments:
Post a Comment