Protaeus Protaeus - 5 months ago 10
Python Question

Concatenating files with matching string in middle of filename

My goal is to concatenate files in a folder based on a string in the middle of the filename, ideally using python or bash. To simplify the question, here is an example:

  • P16C-X128-22MB-LL_merged_trimmed.fastq

  • P16C-X128-27MB-LR_merged_trimmed.fastq

  • P16C-X1324-14DL-UL_merged_trimmed.fastq

  • P16C-X1324-21DL-LL_merged_trimmed.fastq

I would like to concatenate based on the value after the first dash but before the second (e.g. X128 or X1324), so that I am left with (in this example), two additional files that contain the concatenated contents of the individual files:

  • P16C-X128-Concat.fastq (concat of 2 files with X128)

  • P16C-X1324-Concat.fastq (concat of 2 files with X1324)

Any help would be appreciated.


You can use open to read and write (create) files, os.listdir to get all files (and directories) in a certain directory and re to match file name as needed.

Use a dictionary to store contents by filename prefix (the file's name up until 3rd hyphen -) and concatenate the contents together.

import os
import re

contents = {}
file_extension = "fastq"

# Get all files and directories that are in current working directory
for file_name in os.listdir('./'):

    # Use '.' so it doesn't match directories
    if file_name.endswith('.' + file_extension):

        # Match the first 2 hyphen-separated values from file name
        prefix_match = re.match("^([^-]+\-[^-]+)", file_name)
        file_prefix =

        # Read the file and concatenate contents with previous contents
        contents[file_prefix] = contents.get(file_prefix, '')
        with open(file_name, 'r') as the_file:
            contents[file_prefix] += + '\n'

# Create new file for each file id and write contents to it
for file_prefix in contents:
    file_contents = contents[file_prefix]
    with open(file_prefix + '-Concat.' + file_extension, 'w') as the_file: