aoceano aoceano - 3 months ago 21
R Question

R sets of coordinates extract from string

I'am trying to extract sets of coordinates from strings and change the format.

I have tried some of the stringr package and getting nowhere with the pattern extraction.
It's my first time dealing with regex and still is a little confusing to create a pattern.

There is a data frame with one column with one or more sets of coordinates.
The only pattern (the majority) separating Lat from Long is (-), and to separate one set of coordinates to another there is a (/)

Here is an example of some of the data:

ID Coordinates
1 3438-5150
2 3346-5108/3352-5120 East island, South port
3 West coast (284312 472254)
4 28.39.97-47.05.62/29.09.13-47.44.03
5 2843-4722/3359-5122(1H-2H-3H-4F)


Most of the data is in decimal degree, e.g. (id 1 is Lat 34.38 Lon 51.50), some others is in 00º00'00'', e.g. (id 4 is Lat 28º 39' 97'' Lon 47º 05' 62'')

I will need to make in a few steps

1 - Extract all coordinates sets creating a new row for each set of each record;

2 - Extract the text label of record to a new column, concatenating them;

3- Convert the coordinates from 00º00'00''(28.39.97) to 00.0000º (28.6769 - decimal dregree) so all coordinates are in the same format. I can easily convert if they are as numeric.

4 - Add dot (.) to separate the decimal degree values (from 3438 to 34.38) and add (-) to identify as (-34.38) south west hemisphere. All value must have (-) sign.

I'am trying to get something like this:

Step 1 and 2 - Extract coordinates sets and names

ID x y label
1 3438 5150
2 3346 5108 East island, South port
2 3352 5120 East island, South port
3 284312 472254 West coast
4 28.39.97 47.05.62
4 29.09.13 47.44.03
5 2843 4722 1H-2H-3H-4F
5 3359 5122 1H-2H-3H-4F


Step 3 - convert coordinates format to decimal degree (ID 4)

ID x y label
1 3438 5150
2 3346 5108 East island, South port
2 3352 5120 East island, South port
3 284312 472254 West coast
4 286769 471005
4 291536 470675
5 2843 4722 1H-2H-3H-4F
5 3359 5122 1H-2H-3H-4F


Step 4 - change display format

ID x y label
1 -34.38 -51.50
2 -33.46 -51.08 East island, South port
2 -33.52 -51.20 East island, South port
3 -28.43 -47.22 West coast
4 -28.6769 -47.1005
4 -29.1536 -47.0675
5 -28.43 -47.22 1H-2H-3H-4F
5 -33.59 -51.22 1H-2H-3H-4F


I have edit the question to better clarify my problems and change some of my needs. I realized that it was messy to understand.

So, has anyone worked with something similar?
Any other suggestion would be of great help.

Thank you again for the time to help.

Answer

Note: the first answers address the original asking of the question and the last answer addresses its current state. The data in data1 should be set appropriately for each solution.

The following should address your first question given the data you provided and the expected output (using dplyr and tidyr).

library(dplyr)
library(tidyr)

### Load Data
data1 <- structure(list(ID = 1:4, Coordinates = c("3438-5150", "3346-5108/3352-5120", 
"2843-4722/3359-5122(1H-2H-3H-4F)", "28.39.97-47.05.62/29.09.13-47.44.03"
)), .Names = c("ID", "Coordinates"), class = "data.frame", row.names = c(NA, 
-4L))

### This is a helper function to transform data that is like '1234'
### but should be '12.34', and leaves alone '12.34'.
### You may have to change this based on your use case.
div100 <- function(x) { return(ifelse(x > 100, x / 100, x)) }

### Remove items like "(...)" and change "12.34.56" to "12.34"
### Split into 4 columns and xform numeric value.
data1 %>%
    mutate(Coordinates = gsub('\\([^)]+\\)', '', Coordinates),
           Coordinates = gsub('(\\d+[.]\\d+)[.]\\d+', '\\1', Coordinates)) %>%
    separate(Coordinates, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE) %>%
    mutate_at(vars(matches('^[xy][.]')), div100) # xform columns x.N and y.N
##   ID   x.1   y.1   x.2   y.2
## 1  1 34.38 51.50    NA    NA
## 2  2 33.46 51.08 33.52 51.20
## 3  3 28.43 47.22 33.59 51.22
## 4  4 28.39 47.05 29.09 47.44

The call to mutate modifies Coordinates twice to make substitutions easier.

Edit

A variation that uses another regex substitution instead of mutate_at.

data1 %>%
mutate(Coordinates = gsub('\\([^)]+\\)', '', Coordinates),
       Coordinates = gsub('(\\d{2}[.]\\d{2})[.]\\d{2}', '\\1', Coordinates),
       Coordinates = gsub('(\\d{2})(\\d{2})', '\\1.\\2', Coordinates)) %>%
separate(Coordinates, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE)

Edit 2: The following solution addresses the updated version of the question

The following solution does a number of transformations to transform the data. These are separate to make it a bit easier to think about (much easier relatively speaking).

library(dplyr)
library(tidyr)

data1 <- structure(list(ID = 1:5, Coordinates = c("3438-5150", "3346-5108/3352-5120 East island, South port", 
"East coast (284312 472254)", "28.39.97-47.05.62/29.09.13-47.44.03", 
"2843-4722/3359-5122(1H-2H-3H-4F)")), .Names = c("ID", "Coordinates"
), class = "data.frame", row.names = c(NA, -5L))

### Function for converting to numeric values and
### handles case of "12.34.56" (hours/min/sec)
hms_convert <- function(llval) {
  nres <- rep(0, length(llval))
  coord3_match_idx <- grepl('^\\d{2}[.]\\d{2}[.]\\d{2}$', llval)
  nres[coord3_match_idx] <- sapply(str_split(llval[coord3_match_idx], '[.]', 3), function(x) { sum(as.numeric(x) / c(1,60,3600))})
  nres[!coord3_match_idx] <- as.numeric(llval[!coord3_match_idx])
  nres
}

data1 %>%
  mutate(Coordinates_new = Coordinates) %>%
  mutate(Coordinates_new = gsub('\\([^) ]+\\)', '', Coordinates_new)) %>%
  mutate(Coordinates_new = gsub('(.*?)\\(((\\d{6})[ ](\\d{6}))\\).*', '\\3-\\4 \\1', Coordinates_new)) %>%
  mutate(Coordinates_new = gsub('(\\d{2})(\\d{2})(\\d{2})', '\\1.\\2.\\3', Coordinates_new)) %>%
  mutate(Coordinates_new = gsub('(\\S+)[\\s]+(.+)', '\\1|\\2', Coordinates_new, perl = TRUE)) %>%
  separate(Coordinates_new, c('Coords', 'label'), fill = 'right', sep = '[|]', convert = TRUE) %>%
  mutate(Coords = gsub('(\\d{2})(\\d{2})', '\\1.\\2', Coords)) %>%
  separate(Coords, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE) %>%
  mutate_at(vars(matches('^[xy][.]')), hms_convert) %>%
  mutate_at(vars(matches('^[xy][.]')), function(x) ifelse(!is.na(x), -x, x))

##   ID                                 Coordinates       x.1       y.1       x.2       y.2                   label
## 1  1                                   3438-5150 -34.38000 -51.50000        NA        NA                    <NA>
## 2  2 3346-5108/3352-5120 East island, South port -33.46000 -51.08000 -33.52000 -51.20000 East island, South port
## 3  3                  East coast (284312 472254) -28.72000 -47.38167        NA        NA             East coast 
## 4  4         28.39.97-47.05.62/29.09.13-47.44.03 -28.67694 -47.10056 -29.15361 -47.73417                    <NA>
## 5  5            2843-4722/3359-5122(1H-2H-3H-4F) -28.43000 -47.22000 -33.59000 -51.22000                    <NA>
Comments