Jon H. Jon H. - 3 months ago 6
HTML Question

Parsing HTML document to retrieve a list of items

Greetings Stackoverflow! I am looking for a little help on how to parse an html document. My challenge is that I can not use a third party dll such as HTML Agility pack etc. Unfortunately this all has to be done via code or refrences native to VS. I was looking into JSon but I thought maybe someone had an easier way. I am trying to retrieve certain data from webpages like: There are multiple sections I am looking to retrieve data from: Each section starts with:

new Listview({template:

and with in that section it has "id". What I am looking for are lists of the "id"'s with regards to what type the item is from (spell, npc, object, etc).

Unfortunately my skill set is not up to par with this or regex. I was hoping someone could help me out. Thank you ahead of time for your time.

Edit: Unfortunately the solution provided did not work for me.


Well, hundreds of SO users will tell you not to regex HTML, but you're technically scraping the content within <script>...</script> tags, so you may be able to get away with this one.

Let's take a crack at it.

After inspecting the page source, it appears that the JS within the <script>...</script> tags is formatted consistently. This makes our jobs easy.


We know that the id attribute will follow the template attribute. We also know that the developer of this webpage consistently used single-quotes to surround his id and template values. Therefore we'll capture the contents within this single-quotes that follow the template and id attribute names using '([^']++)'