user3142695 user3142695 - 5 months ago 6
Javascript Question

Regex for splitting reference string into its components

Updated the question with more proper example strings

There are strings like this:

Name I, Some-Thing A, More BC (2016) Example: A string title. Publication. 12:123-54
Name I, Some-Thing A, More BC, et al. (2016) Example: A string title? Publication. 12:123-54
Name I, Some-Thing A, More BC: Example: A string title. Publication 2016; 12: 123-54
Name I, Some-Thing A, More BC: Example: A string title. Publication 2016; 12: 123
Name I, Some-Thing A, More BC (2016): Example: A string title. Publication 12, 123-54
Name I, Some-Thing A, More BC (2016): Example: A string title. Publication 12 (6), 123-54
Name I, Some-Thing A, More BC: Example: A string title. Publication. 2016 June;12(6):123-54. Ignore this


Now I'm trying to extract the parts of them to get the result:

1: Name I, Some-Thing A, More BC || Name I, Some-Thing A, More BC, et al.
2: 2016
3: Example: A string title? || Example: A string title
4: Publication
5: 12
6: 123-54 || 123


This is what I get so far:

/([\w-]+ [A-Z]{1,3}(?:, [\w-]+ [A-Z]{1,3})*(?:, et al\.)*)|\((\d{4})\)?|([\w:]+[\w ]+(?=\.|\?|$))|(\d+(?=:))|([\d-]+)/g


https://regex101.com/r/wB3wU4/2

Thanks to anubhava and Jan so far.

But with this I don't get all Publication, in the last string I would like to ignore everything after the pagenumber and I need to ignore the bracket in front of the pagenumber (if there is one).

The second problem for me is how to do a proper processing with this data, as the position of the matches could be different. Example: Normally match[2] should be the
year
, but for the 3rd string, that wouldn't be the case. So the results get mixed up :-(

Answer

Time for me to throw the hat in the ring. This is what I came up with:

^(.*?)\s*(?:\(((?:19|20)\d\d)\)|:)[\s:]*(.*?[?.!])\s*([\w\s]+?)\.?\s*(?:((?:19|20)\d\d)(?:\s+\w+)?)?[.;\s]*(\d+)\s*(?:\(\d+\))?[,:\s]+(\d+(?:-\d+)?)[^\d]*$

See it here at regex101.

Because of the complexity I won't try to explain every bit of the regex here, but check the link to regex101 and you'll see an explanation in the right pane.

I'll try to explain the gist of it however. It depends on a few things that I'm unsure are facts, but...

From the beginning of the string

  • the song title must end with a year in parentheses or a colon (:). It's limiting the match of the year to the twentieth or twenty first century.

Then from the back:

  • there can be no digits in the ignore-part
  • the last capture is always in the form number dash number where the to latter parts are optional.
  • the 12 part is a number, optionally preceded by a ., ; or a space, and followed by an optional number inside parentheses. The parentheses can optionally be preceded by spaces. This whole part is then followed by at least one ,, : or space.
  • the 12 part is optionally preceded by a year, which in turn optionally is followed by a month (thrown away).

Between the start and the end of the string, from the first part, a "sentence" is captured ending with a punctuation (., ? or !). After that comes the second "sentence" - the Publicationpart.

This gives us the following capture groups:

  1. Title 1
  2. (Optionally) Year
  3. Title 2
  4. Sentence Publication
  5. (Optionally) Year
  6. 12 part
  7. 123-54 part

I.e year is either in group 2 or 5.

It feels quite fragile, but it may get the job done for you. ;)

Edit

I've made a JS snip to illustrate: (use full screen)

var theStrings = [
		'Name I, Some-Thing A, More BC (2016) Example: A string title. Publication. 12:123-54',
		'Name I, Some-Thing A, More BC, et al. (2016) Example: A string title? Publication. 12:123-54',
		'Name I, Some-Thing A, More BC: Example: A string title. Publication 2016; 12: 123-54',
		'Name I, Some-Thing A, More BC: Example: A string title. Publication 2016; 12: 123',
		'Name I, Some-Thing A, More BC (2016): Example: A string title. Publication 12, 123-54',
		'Name I, Some-Thing A, More BC (2016): Example: A string title. Publication 12 (6), 123-54',
		'Name I, Some-Thing A, More BC: Example: A string title. Publication. 2016 June;12(6):123-54. Ignore this',
		'Name I, Some-Thing A, More BC (2050) Example: A string title. Placeholder. 55:123-54',
		'Name I, Some-Thing A, More BC, et al. (2016) Example: A string title? Word. 22:123-54',
		'Name: Example: A string title. Variable 2014; 31: 123-54',
		'This can basically be anything!: Example: A string title. Publication 100 2058; 789: 123',
		'Name I, Some-Thing A, More BC (1998): Example: A string title. What Ever 4, 123-54',
		'Name I, Some-Thing A, More BC (2016): Example: A string title. Journey of 2000 miles 54 (6), 123-54',
		'Name I, Some-Thing A, More BC: Example: A string title. Some Words. 1999 June;1(6):123-54. Ignore this'
	],
	re = /^(.*?)\s*(?:\(((?:19|20)\d\d)\)|:)[\s:]*(.*?[?.!])\s*([\w\s]+?)\.?\s*(?:((?:19|20)\d\d)(?:\s+\w+)?)?[.;\s]*(\d+)\s*(?:\(\d+\))?[,:\s]+(\d+(?:-\d+)?)[^\d]*$/,
	res,
	i, j
	output = '<style>caption {background-color: blue; color: white;} th {background-color: lightblue;}</style>';

for (i = 0; i < theStrings.length; i++) {

	res = theStrings[i].match(re);

	output += '<table border="1" style="width:100%">';
	output += '<tr>';
	output += '<caption>The string "' + theStrings[i] + '" ends up as:</caption>';
	output += '<tr><th style="width:30%">Title 1</th><th style="width:10%">Year</th><th style="width:30%">Title 2</th><th style="width:10%">Value 4</th><th style="width:10%">Value 5</th><th style="width:10%">Value 6</th></tr>';
	output += '<td>' + res[1] + '</td>';
	output += '<td>' + (res[2] ? res[2] : res[5]) + '</td>';
	output += '<td>' + res[3] + '</td>';
	output += '<td>' + res[4] + '</td>';
	output += '<td>' + res[6] + '</td>';
	output += '<td>' + res[7] + '</td></tr></table><br/>';
}
document.write(output);

Edit

Comment: The title ends by a year in parentheses or colon OR dot .

I haven't completely grasped what the different parts are, but I assume in this case it's the first field we're talking about. (The third field in the examples ends "A string title"...) The regex in it's current form handles year and colon. So to add dot to the field terminators you could change the : in question to [:.] allowing either:

                          Here: ▼▼▼▼
^(.*?)\s*(?:\(((?:19|20)\d\d)\)|[:.])[\s:]*(.*?[?.!])\s*([\w\s]+?)\.?\s*(?:((?:19|20)\d\d)(?:\s+\w+)?)?[.;\s]*(\d+)\s*(?:\(\d+\))?[,:\s]+(\d+(?:-\d+)?)[^\d]*$