MagicLegend MagicLegend - 1 year ago 91
C# Question

Get specific content from website via C#

For a non-commercial private school project I'm creating a piece of software that will search for lyrics based on what song currently is playing on Spotify. I have to do this in C# (requirement), but I can use other languages if I so desire.

I've found a few sites that I can use to fetch the lyrics from. I have already succeeded in fetching the entire html code, but after that I'm not sure what to do. I've asked my teacher, she told me to use XML (which I also found complicated :p), so I've read quite a bit about it and searched for examples, but haven't found anything that seems applicable to my case.

Time for some code.

Let's say I wanted to fetch the lyrics from

(Human-readable altered) HTML:

<span data-reactid="199">
<p class="mxm-lyrics__content" data-reactid="200">First line of the lyrics!
These words will never be ignored
I don't want a battle
<!-- react-empty: 201 -->
<div data-reactid="202">
<div class="inline_video_ad_container_container" data-reactid="203">
<div id="inline_video_ad_container" data-reactid="204">
<div class="" style="line-height:0;" data-reactid="205">
<div id="div_gpt_ad_outofpage_musixmatch_desktop_lyrics" data-reactid="206">
<script type="text/javascript">
//Really nice google ad JS which I have removed;
<p class="mxm-lyrics__content" data-reactid="207">But I got a war
More fancy lyrics
And lines
That I want to fetch
And display

Note the first three lines of the lyrics are located at the top, with the rest in the bottom
. Also note that the two
tags have the same class.
Full html source can be found here:
At around line 97 the snippet starts.

So in this specific example there are the lyrics, and there is quite a bit of code that I don't need. So far I've tried fetching the html code with the following C#:

string source = "’s-a-War";

// The HtmlWeb class is a utility class to get the HTML over HTTP
HtmlWeb htmlWeb = new HtmlWeb();

// Creates an HtmlDocument object from an URL
HtmlAgilityPack.HtmlDocument document = htmlWeb.Load(source);

// Targets a specific node
HtmlNode someNode = document.GetElementbyId("mxm - lyrics__content");

if (someNode != null)
} else

foreach (var node in document.DocumentNode.SelectNodes("//span/div[@id='site']/p[@class='mxm-lyrics__content']"))
// here is your text: node.InnerText "//div[@class='sideInfoPlayer']/span[@class='wrap']"


The fetching of the entire html works, but the extracting doesn't. I'm stuck at extracting the lyrics from the html. Since for this page the lyrics aren't in an ID tag, I can't just use the
. Can somebody point me in the right direction? I want to support multiple sites, so I have to do this a few times for different sites.

Answer Source

One of the solutions

var htmlWeb = new HtmlWeb();
var documentNode = htmlWeb.Load(source).DocumentNode;

var findclasses = documentNode.Descendants("p")
    .Where(d => d.Attributes["class"]?.Value.Contains("mxm-lyrics__content") == true);
var findclasses = documentNode.SelectNodes("//p[contains(@class,'mxm-lyrics__content')]")
var text = string.Join(Environment.NewLine, findclasses.Select(x => x.InnerText));
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download