MagicLegend MagicLegend - 7 days ago 5
C# Question

Get specific content from website via C#

For a non-commercial private school project I'm creating a piece of software that will search for lyrics based on what song currently is playing on Spotify. I have to do this in C# (requirement), but I can use other languages if I so desire.

I've found a few sites that I can use to fetch the lyrics from. I have already succeeded in fetching the entire html code, but after that I'm not sure what to do. I've asked my teacher, she told me to use XML (which I also found complicated :p), so I've read quite a bit about it and searched for examples, but haven't found anything that seems applicable to my case.

Time for some code.



Let's say I wanted to fetch the lyrics from musixmatch.com:

(Human-readable altered) HTML:

<span data-reactid="199">
<p class="mxm-lyrics__content" data-reactid="200">First line of the lyrics!
These words will never be ignored
I don't want a battle
</p>
<!-- react-empty: 201 -->
<div data-reactid="202">
<div class="inline_video_ad_container_container" data-reactid="203">
<div id="inline_video_ad_container" data-reactid="204">
<div class="" style="line-height:0;" data-reactid="205">
<div id="div_gpt_ad_outofpage_musixmatch_desktop_lyrics" data-reactid="206">
<script type="text/javascript">
//Really nice google ad JS which I have removed;
</script>
</div>
</div>
</div>
</div>
<p class="mxm-lyrics__content" data-reactid="207">But I got a war
More fancy lyrics
And lines
That I want to fetch
And display
Tralala
lala
Trouble!
</p>
</div>
</span>


Note the first three lines of the lyrics are located at the top, with the rest in the bottom
<p>
. Also note that the two
<p>
tags have the same class.
Full html source can be found here:
view-source:https://www.musixmatch.com/lyrics/Bullet-for-My-Valentine/You-Want-a-Battle-Here%E2%80%99s-a-War
At around line 97 the snippet starts.


So in this specific example there are the lyrics, and there is quite a bit of code that I don't need. So far I've tried fetching the html code with the following C#:

string source = "https://www.musixmatch.com/lyrics/Bullet-for-My-Valentine/You-Want-a-Battle-Here’s-a-War";

// The HtmlWeb class is a utility class to get the HTML over HTTP
HtmlWeb htmlWeb = new HtmlWeb();

// Creates an HtmlDocument object from an URL
HtmlAgilityPack.HtmlDocument document = htmlWeb.Load(source);

// Targets a specific node
HtmlNode someNode = document.GetElementbyId("mxm - lyrics__content");

if (someNode != null)
{
Console.WriteLine(someNode);
} else
{
Console.WriteLine("Nope");
}

foreach (var node in document.DocumentNode.SelectNodes("//span/div[@id='site']/p[@class='mxm-lyrics__content']"))
{
// here is your text: node.InnerText "//div[@class='sideInfoPlayer']/span[@class='wrap']"
Console.WriteLine(node.InnerText);
}

Console.ReadKey();


The fetching of the entire html works, but the extracting doesn't. I'm stuck at extracting the lyrics from the html. Since for this page the lyrics aren't in an ID tag, I can't just use the
GetElementbyId
. Can somebody point me in the right direction? I want to support multiple sites, so I have to do this a few times for different sites.

Answer

One of the solutions

var htmlWeb = new HtmlWeb();
var documentNode = htmlWeb.Load(source).DocumentNode;

var findclasses = documentNode.Descendants("p")
    .Where(d => d.Attributes["class"]?.Value.Contains("mxm-lyrics__content") == true);
//or
var findclasses = documentNode.SelectNodes("//p[contains(@class,'mxm-lyrics__content')]")
var text = string.Join(Environment.NewLine, findclasses.Select(x => x.InnerText));