Forward Ed Forward Ed - 7 months ago 35
HTML Question

Extract data from a web page that may not be formatted as a table

For starters I am by no means an expert in VBA. Just know enough to be dangerous 8).

I started out by doing a search on how to extract a table from a web page and saw many many people have asked the same question. Unfortunately most of what I was reading was over my head. One article I read pointed me to this detailed article by Siddharth Rout, but alas I could not follow what was going on other than there are two methods internet explorer or some other methods. Since I only have IE11 installed and MS Office I would prefer to go the IE route.

I have encountered this problem several times in the past and have always dropped the project or done things manually. Today I thought I would try and learn how to do this and make my future life hopefully a little easier. As such I am going to use data from a gaming website since it mimics other things I have encountered in the past.

So today's (this week's..no this month's..I am an optimist!) project is to build a list of every team involved in a tournament and copy their results into excel. This would be akin to pulling cricket, hockey, baseball, soccer, or football stats. I tried using Excel's built in Get Data From Web process, but it did not identify the table on the web page.

The address for the web page is: http://worldoftanks.com/en/tournaments/1000000017/

and is in the image below

Web Page Example

So the basics and my starting point is to simply pull the list of teams from 1 group and paste it in an excel page with no formatting. Basically the area in yellow in the image above. The image could not fit the whole page but there are actually 10 teams in this group. However I would like to make it variable as sometimes you may have more or less than 10 teams in a group. I am going to assume the number of rows is a minor issue at this point.

Once I get that part figured out I am hoping it will be relatively easy to switch to the next group, grab that list of teams and results and add them to the end of the list I am building in excel. On the web page this would be done by selecting the blue areas.

Now once I have those two things figured out I would need to build the list again from scratch based on the stage of the tournament areas in green and put that list on a new page. I have some ideas how to achieve this but it will really depend on what the previous two steps look like.

I have a bonus task for myself too which is to pull the schedule for each team in a group to see how they did against various other teams. Who beat who type deal. I am hoping I can figure that part out based on the information learned from the task above.

So I am pretty sure there are other languages/prgs that are better suited for the task at hand, but I would like to stick with what I have...and the little I know so far. So I tried a wee bit of VBA code and commented on what I need to achieve. So far I think I have opened the webpage! and built a bit a thought process in comments on how to do some of the things.

Sub GetTeamData()
Dim IE As Object
Dim roundcounter As Integer
Dim groupcounter As Integer
Dim TeamList As Variant
Dim WebAddress As String
Dim Number_of_rounds as Integer
Dim Number_of_Groups as Integer

'set webaddress of site to link to
WebAddress = "http://worldoftanks.com/en/tournaments/1000000017/"

Set IE = CreateObject("InternetExplorer.Application")

With IE
.Visible = True
.navigate (WebAddress)
End With

'What does this chunk of code do? Wait for webpage to finish loading?
While IE.readyState <> 4
DoEvents
Wend

'set initial parameters for loops. I am ok with hardcoding this for now.
Number_of_groups = 125
Number_of_rounds = 5

'start pulling teamdata

'For roundcounter = 1 To number_of_rounds
'select roundcounter on webpage
'for groupcounter = 1 to number_of_groups
'select groupcounter on webpage
'grab table of 6-10 teams (position, team name, battles, wins, losses, ties, and points)
'add table to TeamList
'next groupcounter
'paste TeamList to sheet roundcounter cell A1
'clear TeamList
'next roundcounter


'Next task
'based on results on how to pull group table date, pull individual team schedule results to build matrix result

Set IE = Nothing

End Sub


One thing I was thinking about was that instead of using for next loops with a counter is if it would be easier to set it up to do a loop until an error had occurred like exceeding the number of groups or rounds. Now I am rambling.

Anyhow if someone would be so kind to get me started on how to pull the yellow area from the image above that would be much appreciated! Please be gentle! I do realize that this question has been asked many a time... I just did not understand what I was reading. Also if this is not possible or extremely difficult to do please let me know. Thank you in advance for your assistance in educating me.

UPDATE 16/03/19 0900



So I tried the Get Data From Web process again this morning with a bit more luck...but not much.

Get From Web

after 1 error window which I click yes to I get the web page to load

error window on load

I got the little yellow arrow to show up once on the page in the very top left corner. So I tried it and it did pull in information.

Get From Web tick box

but I did notice there were no yellow boxes next to the table I want which makes me wonder if it is not a table.

enter image description here

When I did pull in information, it was not the information I was looking for. When I scanned through the results, I could see where the data I am looking for should be, but all the results are missing, just the table column headers show up in about Row 263 or so.

web querry results

So then I tried doing a copy and paste method from the web page using select all for the copy on the web page. For the paste I tried different methods. keeping source formatting resulted in nothing. keep destination formatting brought in information. I tried paste special (html, Unicode and text) HTML made things look pretty and the other two put everything into a single column. More importantly the results were in the table.

copy paste results

Now if I only needed round 1 group 1 team list and results I could work with this. Simply delete all the rows above and below the table and voila! however since the web address is the same for every group and every round I have no idea how to "click" on the blue or green areas to update the info. If I knew this I could automate the process by copying and pasting each page, then editing the results to just the table, and moving the table to another sheet just below the last results.

To me there seems like there should be a better method.

16/03/19 1600





<!-- ko if: visibleBracketType() === ROUND_ROBIN -->
<table class="tournament-table tournament-table__indent" cellpadding="0" cellspacing="0">
<tr class="tournament-table_tr">
<th class="tournament-table_th tournament-table_th__numb">#</th>
<th class="tournament-table_th">
<div class="tournament-table_ico-holder">
<span class="ico-team">Team</span>
</div>
<div class="tournament-table_heading-text">
Team
</div>
</th>
<th class="tournament-table_th">
<div class="tournament-table_ico-holder">
<span class="ico-battles">Battles</span>
</div>
<div class="tournament-table_heading-text">
Battles
</div>
</th>
<th class="tournament-table_th">
<div class="tournament-table_ico-holder">
<span class="ico-victory">Victories</span>
</div>
<div class="tournament-table_heading-text">
Victories
</div>
</th>
<th class="tournament-table_th tournament-table_th__mobile-hide">
<div class="tournament-table_ico-holder">
<span class="ico-flag">Defeats</span>
</div>
<div class="tournament-table_heading-text">
Defeats
</div>
</th>
<th class="tournament-table_th tournament-table_th__mobile-hide">
<div class="tournament-table_ico-holder">
<span class="ico-division">Draws</span>
</div>
<div class="tournament-table_heading-text">
Draws
</div>
</th>
<th class="tournament-table_th">
<div class="tournament-table_ico-holder">
<span class="ico-points">Points</span>
</div>
<div class="tournament-table_heading-text">
Points
</div>
</th>
</tr>
<!-- ko foreach: {data: rrBrackets().teams, as: 'team' } -->
<tr class="tournament-table_tr" data-bind="css: {'tournament-table_tr__my-team': team.team_id === $root.currentUserTeamIdInCurrentGroup()}">
<td class="tournament-table_td" data-bind="text: team.position"></td>
<td class="tournament-table_td" data-bind="css: {'tournament-table_td__my-team': team.team_id === $root.currentUserTeamIdInCurrentGroup()}">
<a class="tournament-table_team tournament-table_team__big" target="_blank" data-bind="text: team.team_title, attr: {href: $root.getTournamentTeamUrl(team.team_id)}"></a>
</td>
<td class="tournament-table_td" data-bind="text: team.battle_played"></td>
<td class="tournament-table_td" data-bind="text: team.wins"></td>
<td class="tournament-table_td tournament-table_td__mobile-hide" data-bind="text: team.losses"></td>
<td class="tournament-table_td tournament-table_td__mobile-hide" data-bind="text: team.draws"></td>
<td class="tournament-table_td" data-bind="text: team.extra_statistics.points"></td>
</tr>
<!-- /ko -->
</table>‚Äč





ok, from what I am gathering from the various posts I have been reading and videos I have been watching, I need to find some critical "Tag" in the coding of the web page and from that I can eventually start pulling data. I hit F12 on IE to view the code, and then in the code area I did a search on some of the display text in the area I was looking and found the above chunk of "code". With a lot of GUESSING I am hoping I grabbed the right chunk. Now to figure out what that critical tag is and how to use it. By the way, what code is that web page in?

Answer

So if written a small Sub which i think should solve your Problem if i understood you correctly. Of course you will invest some work, since it only reads one stage right now. But it reads the data from every Group:

Option Explicit

Private Sub CommandButton1_Click()

'make sure you add references to Microsoft Internet Controls (shdocvw.dll) and
 'Microsoft HTML object Library.
 'Code will NOT run otherwise.

Dim objIE As SHDocVw.InternetExplorer 'microsoft internet controls (shdocvw.dll)
Dim htmlDoc As MSHTML.HTMLDocument 'Microsoft HTML Object Library
Dim htmlInput As MSHTML.HTMLInputElement
Dim htmlColl As MSHTML.IHTMLElementCollection

Set objIE = New SHDocVw.InternetExplorer

Dim htmlCurrentDoc As MSHTML.HTMLDocument 'Microsoft HTML Object Library

Dim RowNumber As Integer
            RowNumber = 1

With objIE
    .Navigate "http://worldoftanks.com/en/tournaments/1000000017/" ' Main page
    .Visible = 0
    Do While .READYSTATE <> 4: DoEvents: Loop
        Application.Wait (Now + TimeValue("0:00:01"))


        Set htmlDoc = .document

        Dim ButtonRoundData As Variant
        Set ButtonRoundData = htmlDoc.getElementsByClassName("group-stage_link")

        Dim ButtonData As Variant
        Set ButtonData = htmlDoc.getElementsByClassName("groups_link")



        Dim button As HTMLLinkElement
        For Each button In ButtonData

           Debug.Print button.nodeName

            button.Click

               Application.Wait (Now + TimeValue("0:00:02")) ' This is to prevent double entryies but it is not clean. you should definitly check if the table is still the same and wait then

            Set htmlCurrentDoc = .document
            Dim RawData As HTMLTable
            Set RawData = htmlCurrentDoc.getElementsByClassName("tournament-table tournament-table__indent")(0)



            Dim ColumnNumber As Integer
            ColumnNumber = 1

            Dim hRow As HTMLTableRow
            Dim hCell As HTMLTableCell
            For Each hRow In RawData.Rows

                For Each hCell In hRow.Cells
                    Cells(RowNumber, ColumnNumber).Value = hCell.innerText
                    ColumnNumber = ColumnNumber + 1
                Next hCell
                ColumnNumber = 1
                RowNumber = RowNumber + 1
            Next hRow

            RowNumber = RowNumber + 3
        Next button
    End With

End Sub

What it does is starting an invisible IE, reads the data, clicks the button, reads the next and so on ...

for Debugging i suggest to set .Visible to 1, so you will se what happens.

EDIT 1: if you get a debbuging error, try to Abort and run it again, it definitly Needs some error handling, if the Website isn't loaded right.

EDIT 2: Made it a bit stabler, you should really pay Attention, since the Webpage takes some time to load, you MUST check if the data has changed before writting it. if it hasn't changed wait a second or so and then try again.

Here some sample data i got in Excel:

Sample Data