I have 4 100GB csv files where two fields need to be concatenated. Luckily the two fields are next to each other.
My thought is to remove the 41st occurence of
from each line and then my two fields will be properly united and ready to be uploaded to an analytical tool that I use.
The development machine is a Windows 10 machine with 4 x 3.6GHz and 64G RAM and I push the file to a server on Centos 7 system with 40 x 2.4GHz and 512G RAM. I have sudo access on the server and I can technically change the file there if someone has a solution that is dependent on Linux tools. The idea is to accomplish the task in the fastest/easiest way possible. I have to repeat this task monthly and would be ecstatic to automate it.
My original way of accomplishing this was to load the csv to MySQL, concat the fields and remove the old fields. Export the table as a csv again and push to the server. This takes two days and is laborious.
Right now I'm torn between learning to use sed
or using a something I'm more familiar with like node.js to stream the files line by line into a new file and then push those to the server.
If you recommend using sed, I've read here
but don't know how to remove the nth occurrence from each line
Cyrus asked for a sample input/output.
Input file formatted thusly:
Output file formatted like so: