Andrew Andrew - 3 months ago 6
R Question

Why and where are \n newline characters getting introduced to c()?

Hoping someone can help me understand why errant \n characters are showing up in a vector of strings that I'm creating in R.

Trying to import and clean up a very wide data file that's in fixed width format
(http://www.state.nj.us/education/schools/achievement/2012/njask6/, 'Text file for data runs'). Followed the UCLA tutorial on using read.fwf and this excellent SO question to give the columns names after import.

Because the file is really wide, the column headers are LONG - all together, just under 29,800 characters. I'm passing them in as a simple vector of strings:

column_names <- c(...)


I'll spare you the ugly dump here but I dropped the whole thing on pastebin.

Was cleaning up and transforming some of the variables for analysis when I noticed that some of my subsets were returning 0 rows. After puzzling over it (did I misspell something?) it realized that somehow a bunch of '\n' newline characters had been introduced into my column headers.

If I loop over the column_names vector that I created

for (i in 1:length(column_names)) {
print(column_names[i])
}


I see the first newline character in the middle of the 81st line -


SPECIAL\nEDUCATION SCIENCE Number Enrolled Science


Avenues that I tried to resolve this:

1) Is it something about my environment? I'm using the regular script editor in R, and my lines do wrap - but the breaks on my screen don't match the placement of the \n characters, which to me suggests that it's not the R script editor.

2) Is there a GUI setting? Did some searching, but couldn't find anything.

3) Is there a pattern? Seems like the newline characters get inserted about every 4000 characters. Did some reading on R/S primitives to try to figure out if this had something to do with basic R data structures, but was pretty quickly in over my head.

I tried breaking up the long string into shorter chunks, and then subsequently combining them, and that seemed to solve the problem.

column_names.1 <- c(...)
column_names.2 <- c(...)
column_names_combined <- c(column_names.1, column_names.2)


so I have an immediate workaround, but would love to know what's actually going on here.

Some of the posts that had to do with problems with character vectors suggested that I run memory profile:

memory.profile()
NULL symbol pairlist closure environment promise
1 9572 220717 4734 1379 5764
language special builtin char logical integer
63932 165 1550 18935 10302 30428
double complex character ... any list
2039 1 60058 0 0 20059
expression bytecode externalptr weakref raw S4
1 16553 725 150 151 1162


I'm running R 2.15.1 (64-bit) R on Windows 7 (Enterprise, SP 1, 8 gigs RAM).
Thanks!

Answer

I doubt this is a bug. Instead, it looks like you're running into a known limitation of the console. As it says in Section 1.8 - R commands, case sensitivity, etc. of An Introduction to R:

Command lines entered at the console are limited[3] to about 4095 bytes (not characters).

[3] some of the consoles will not allow you to enter more, and amongst those which do some will silently discard the excess and some will use it as the start of the next line.

Either put the command in a file and source it, or break the code into multiple lines by inserting your own newlines at appropriate points (between commas). For example:

column_names <-
  c("County Code/DFG/Aggregation Code", "District Code", "School Code",
    "County Name", "District Name", "School Name", "DFG", "Special Needs",
    "TOTAL POPULATION TOTAL POPULATION Number Enrolled LAL", ...)