Matt Matt - 8 days ago 7
Bash Question

Textfile processing: Undo text wrapping (column and line)

I have some very large text files which are the output from an old mainframe application. I no longer have access to the source application but need to perform some data analysis on the output.

The data is basically tab separated values but due to the source system it wraps the values and breaks the output based on width and number of lines

Contents of text files look something like this (this is mockup data):

Page 1:

Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
------------------------------------------------------------
1111 1111 1111 1111 1111 1111 1111 1111
2222 2222 2222 2222 2222 2222 2222 2222
3333 3333 3333 3333 3333 3333 3333 3333
4444 4444 4444 4444 4444 4444 4444 4444
5555 5555 5555 5555 5555 5555 5555 5555
6666 6666 6666 6666 6666 6666 6666 6666
7777 7777 7777 7777 7777 7777 7777 7777
-----------------------------------------------------------

Col9 Col10 Col11
--------------------
1111 1111 1111
2222 2222 2222
3333 3333 3333
4444 4444 4444
5555 5555 5555
6666 6666 6666
7777 7777 7777
--------------------

Page 2:


Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
------------------------------------------------------------
8888 8888 8888 8888 8888 8888 8888 8888
9999 9999 9999 9999 9999 9999 9999 9999
-----------------------------------------------------------

Col9 Col10 Col11
--------------------
8888 8888 8888
9999 9999 9999
--------------------


Pages will continue on for some time.

I would like to convert the files programtically so that the columns are continuous. ie. The final data set would look like a more typical CSV style delimited file.

Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 Col11
------------------------------------------------------------------------------------
1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111
2222 2222 2222 2222 2222 2222 2222 2222 2222 2222 2222
3333 3333 3333 3333 3333 3333 3333 3333 3333 3333 3333
4444 4444 4444 4444 4444 4444 4444 4444 4444 4444 4444
5555 5555 5555 5555 5555 5555 5555 5555 5555 5555 5555
6666 6666 6666 6666 6666 6666 6666 6666 6666 6666 6666
7777 7777 7777 7777 7777 7777 7777 7777 7777 7777 7777
8888 8888 8888 8888 8888 8888 8888 8888 8888 8888 8888
9999 9999 9999 9999 9999 9999 9999 9999 9999 9999 9999
-------------------------------------------------------------------------------------


I'm unsure exactly where to start here - can I use something like AWK to do this or some sort of Regular Expression. Any help as to starting point would be appreciated.

Answer

I suggest that you can do it with csplit command and paste command.