Aunt Jamaima Aunt Jamaima - 3 years ago 134
Node.js Question

How to remove the nth occurrence of a substring from each line on four 100GB files

I have 4 100GB csv files where two fields need to be concatenated. Luckily the two fields are next to each other.

My thought is to remove the 41st occurence of

from each line and then my two fields will be properly united and ready to be uploaded to an analytical tool that I use.

The development machine is a Windows 10 machine with 4 x 3.6GHz and 64G RAM and I push the file to a server on Centos 7 system with 40 x 2.4GHz and 512G RAM. I have sudo access on the server and I can technically change the file there if someone has a solution that is dependent on Linux tools. The idea is to accomplish the task in the fastest/easiest way possible. I have to repeat this task monthly and would be ecstatic to automate it.

My original way of accomplishing this was to load the csv to MySQL, concat the fields and remove the old fields. Export the table as a csv again and push to the server. This takes two days and is laborious.

Right now I'm torn between learning to use sed or using a something I'm more familiar with like node.js to stream the files line by line into a new file and then push those to the server.

If you recommend using sed, I've read here and here but don't know how to remove the nth occurrence from each line.

Edit: Cyrus asked for a sample input/output.
Input file formatted thusly:


Output file formatted like so:


Answer Source

If you want to remove 41st occurrence of , then you can try :

sed -i 's/","//41' file
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download