Canned Man Canned Man - 1 month ago 7
SQL Question

How can I split a column by multiple conditions?

Background



I have a table of some 900 000 individuals (census of 1875). I am creating lists of criminals, and am trying to find children following their parents in prison. To do this, I am using two tables: table 1 contains all criminals; table 2 contains every individual. I want to search for people in table 2 that are listed below those in table 1, and are aged 0–15 years of age.

Preparation



In cleaning my tables, I have made a selection of roughly 6 000 that could not be easily selected and/or cleaned. The main parts of the data structure is as follows:

These collectively identify each individual Not part of table
┌───────┴───────┐ ┌───────────┴───────────┐
ID1 ID2 ID3 ID4 FNAME LNAME BYEAR *comment*
0515 004 0072 006 Ole Jacobsen 1825@1826 usable
0515 004 0072 007 Gunhild Thorsdatter 1831@1835 possibly usable
0515 004 0075 010 Hans Johnsden 1875@1865 unusable
0515 004 0099 001 Marit Jakobsdatter 1832??@1837?? usable
0515 004 0102 193 Arne Olsen 1836 %1805% usable
0515 004 0102 194 Rasmus Pedersen 1876!! usable after corrections


Explanation:




  • An @ sign indicates ''or''

  • A ?? indicates uncertainty

  • A !! indicates the source actually contained this wrongful information

  • %[text]% indicates text was stricken in the source



Method and Main Question



What I plan to do, is list every individual in which
BYEAR
is listed as
YYYY@YYYY
(year1[or]year2); I could do this with
like "*@*"
. I would, however, also like to have the information in BYEAR split into three columns:
BYEAR1
BYEARJUNK
(see note 3)
BYEAR2
. From these columns, I would be able to calculate the age difference, with clean numbers, thereby finding the individuals on which I could actually perform further studies.

So I repeat and clarify my question from the heading: How would I split the BYEAR column into three (or more) columns, with @ or space as delimiters?

Notes



1



Others have asked similar questions, but this one is about Unix and this other post answers the question by not splitting the data, which I need to do to perform calculations (and so others can easily reproduce my method).

2



I have added the Oracle tag. I am, however, presently working locally, but would in the future need to to the same operations towards the server.

3



The above does not clearly state how the junk column should be filled. It’s function is to list everything from the initial to the final separator. Here are the basic rules:

Approach 1




  • If there is a @,
    BYEAR
    can be considered to have only one separator.
    BYEAR1
    would be populated by everything before the @ and
    BYEAR2
    by everything following it;
    BYEARJUNK
    would contain only the @.

  • If there is a space following the first year – e.g. ‘1802?? eller 1808’ or ‘1802?? eller ant. 1808’,
    BYEARJUNK
    should be populated by respectively ‘ eller ’ and ‘ eller ant. ’ (including the spaces).

  • But what if there are spaces and an @ sign, e.g. ‘1802?? eller 1803@1808’? In such cases, I would count the first space as the initial separator and the (final) @ as the final separator. In other words:
    BYEAR1
    : ‘1802??’;
    BYEARJUNK
    : ‘ eller 1803@’;
    BYEAR2
    : ‘1808’.



Approach 2



Simply like this: @ and space is always considered a separator; the number of
BYEARJUNK
columns to create, would therefore be equal to the field with the most number of words (minus two). This method is probably the easiest and best approach, as one could easily isolate the data of interest afterwards.
BYEAR1
and
BYEAR2
should always be the first and last part of the field’s data.

Answer

In the case of a dataset that has been cleaned, i.e. there are always four digits followed by a separator (here @) followed by four digits, the following code solves the problem. Note: I have changed the names of the fields to L[eft]BYEAR and R[ight]BYEAR and have not created a column for junk, as the data have been cleaned.

SELECT
LEFT ([BYEAR], 4) AS LBYEAR,
RIGHT([BYEAR], 4) AS RBYEAR,
(LBYEAR-RBYEAR)   AS ΣBYEARDIFF,
*

FROM [mytable]

WHERE BYEAR like "*@*";

This creates three columns:

  • LBYEAR finds the first four characters (hopefully digits) in the column BYEAR.
  • RBYEAR finds the last four characters (hopefully digits) in the column BYEAR.
  • ΣBYEARDIFF calculates the difference.

The final column is very useful, as it returns error for fields where it could extract the correct data, i.e. fields in which the data had not been cleaned. Output example:

LBYEAR  RBYEAR  ΣBYEARDIFF  PERSNR  FORNVN      ETTNVN          FAAR
1857    1847    10          005     Dorthea     Evensen         1857@1847
1845    1847    -2          002     Katharine   Olsdatter Ovren 1838@1837
22.0    1825    #Error      002     Boel Karin  Amundsen        22.06@07.1825
1857    1867    -10         008     Thrine  Andersen            1857@1867
1858    1857    1           027     Karl Georg  Syvertsen       1858@1857