Web scraping using selenium and c# can be done quite easily. By locating html elements with the Find Element or FindElements command you’ll be able to filter down to the relevant part of your webpage inside your browser.
Combine this with a litle bit of Xpath and you’ll have the exact information you’ll need from a webpage.
I’ve been in a situation before where a client had lost the credentials for an important database but the website connected to the database was still up with all the data presenting in a few hundred webpages. In this case, web scraping was an excellent solution to get the data they needed.
Allthough you can just fire up visual studio, I used mashitup IDE combined with chrome (and chrome driver) to figure out the format of the webpages, so that I know what elements to target when trying to read the contents with regular expressions.
The first thing you’ll need to do after setting up your visual studio project and declaring your selenium webdriver object is navigate to the desired webpage that you want to scrape. Below is the C# you’ll need.
driver.navigate().GoToURL("https://www.Some-WebPage-We-Want-To-Scrape.com/");
Next we want to find the element that contains the text that we need. Let’s say for testing purposes we have hyperlinks on a page and we want to scrape the text inside that hyperlink, but we only want texts that start with “contact:”.
List<IWebElement> hyperlinks = driver.findElements(By.tagName("a"));
foreach(IWebElement link in hyperlinks)
{
string linktext = link.Text;
// this is the text inside of the hyperlink
if (Regex.IsMatch(linktext, "^contact:"))
{
ListOfScrapedTexts.Add(linktext);
}
}
You must be logged in to post a comment.