Anastasia Pupynina Anastasia Pupynina - 4 months ago 21
R Question

Scraping password protected forum in r

I have a problem with logging in in my script. Despite all other good answers that I found on stackoverflow, none of the solutions worked for me.

I am scraping a web forum for my PhD research, its URL is http://forum.axishistory.com.

The webpage I want to scrape is the memberlist - a page that lists the links to all member profiles. One can only access the memberlist if logged in. If you try to access the memberlist without logging in, it shows you the log in form.

The URL of the memberlist is this: http://forum.axishistory.com/memberlist.php.

I tried the httr-package:

library(httr)
members <- GET("http://forum.axishistory.com/memberlist.php", authenticate("username", "password"))
members_html <- html(members)


The output is the log in form.

Then I tried RCurl:

library(RCurl)
members_html <- htmlParse(getURL("http://forum.axishistory.com/memberlist.php", userpwd = "username:password"))
members_html


The output is the log in form - again.

Then i tried the list() function from this topic - Scrape password-protected website in R :

handle <- handle("http://forum.axishistory.com/")
path <- "ucp.php?mode=login"

login <- list(
amember_login = "username"
,amember_pass = "password"
,amember_redirect_url =
"http://forum.axishistory.com/memberlist.php"
)

response <- POST(handle = handle, path = path, body = login)


and again! The output is the log in form.

The next thing I am working on is RSelenium, but after all these attempts I am trying to figure out whether I am probably missing something (probably something completely obvious).

I have looked at other relevant posts in here, but couldn't figure out how to apply the code to my case:

How to use R to download a zipped file from a SSL page that requires cookies

Scrape password-protected website in R

How to use R to download a zipped file from a SSL page that requires cookies

Scrape password protected https website in R

Web scraping password protected website using R

Answer

Thanks to Simon I found the answer here: Using RVest or httr to log in to non-standard forms on a webpage

library(rvest)
url       <-"http://forum.axishistory.com/memberlist.php"
pgsession <-html_session(url)

pgform    <-html_form(pgsession)[[2]]

filled_form <- set_values(pgform,
                      "username" = "username", 
                      "password" = "password")

submit_form(pgsession,filled_form)
memberlist <- jump_to(pgsession, "http://forum.axishistory.com/memberlist.php")

page <- html(memberlist)

usernames <- html_nodes(x = page, css = "#memberlist .username") 

data_usernames <- html_text(usernames, trim = TRUE)