TobyMcCann TobyMcCann - 7 months ago 24
Python Question

Removing text from field in Mongodb with Python

I have a CSV file that has a field/column with a comma (","). I load this CSV into mongodb for data manipulation. I would like to strip all text from the comma to the right, leaving only the text to the left of the comma.

What is the most efficient method of accomplishing this task? In my mongodb csv import script (I utilize pandas)? Afterward when the data is already in MongoDB? Honestly, I'm new to programming and would like to know how to do it in either scenario, but I would like to see a solution for which is most efficient.

Here's my csv to python import script:

#!/usr/bin/env python
import sys
import os
import pandas as pd
import pymongo
import json

def import_content(filepath):
mng_client = pymongo.MongoClient('localhost', 27017)
mng_db = mng_client['swx_inv']
collection_name = 'device.switch'
db_cm = mng_db[collection_name]
cdir = os.path.dirname(__file__)
file_res = os.path.join(cdir, filepath)

data = pd.read_csv(file_res, skiprows=2, skip_footer=1)
data_json = json.loads(data.to_json(orient='records'))
db_cm.remove()
db_cm.insert(data_json)

if __name__ == "__main__":
filepath = '/vagrant/data/DeviceInventory-Category.Switch.csv'
import_content(filepath)


Here are the top three rows of the CSV for reference. I'm trying to alter the last field, "OS Image":

Device,Serial Number,Realm,Vendor,Model,OS Image
ABBNWX0100,SMG3453ESDN,BlAH BLAH,Cisco,WS-C6509-E,"IOS 12.2(33)SXI9, s72033_rp-ADVENTERPRISEK9_WAN-M"
ABBNWX0101,SDG127343S0,BLAH BLAH,Cisco,WS-C4506-E,"IOS 12.2(53)SG8, cat4500-IPBASEK9-M"
ABBNWX0102,TREFDSFY1KK,BLAH BLAH,Cisco,WS-C3560V2-48PS-S,"IOS 12.2(55)SE5, C3560-IPBASEK9-M"


EDIT: I found a method to do what I needed via pandas prior to uploading to the mongoDB collection. I have to do this twice, as the save column data uses two different delimiters and a regex would not work properly:

# Use pandas to read CSV, skipping top 2 lines & footer line from
# CSV export. Set column data to string type.
data = pd.read_csv(
file_res, index_col=False, skiprows=2,
skip_footer=1, dtype={'Device': str, 'Serial Number': str,
'Realm': str, 'Vendor': str, 'Model': str,
'OS Image': str}
)
# Drop rows where Serial Number is empty
data = data.dropna(subset=['Serial Number'])

# Split the OS Image column by "," and ";" to remove extraneous data
data['OS Image'].update(data['OS Image'].apply(
lambda x: x.split(",")[0] if len(x.split()) > 1 else None)
)
data['OS Image'].update(data['OS Image'].apply(
lambda x: x.split(";")[0] if len(x.split()) > 1 else None)
)

Answer
import csv

s='''Device,Serial Number,Realm,Vendor,Model,OS Image
ABBNWX0100,SMG3453ESDN,BlAH BLAH,Cisco,WS-C6509-E,"IOS 12.2(33)SXI9, s72033_rp-ADVENTERPRISEK9_WAN-M"
ABBNWX0101,SDG127343S0,BLAH BLAH,Cisco,WS-C4506-E,"IOS 12.2(53)SG8, cat4500-IPBASEK9-M"
ABBNWX0102,TREFDSFY1KK,BLAH BLAH,Cisco,WS-C3560V2-48PS-S,"IOS 12.2(55)SE5, C3560-IPBASEK9-M"'''

print("\n".join([','.join(row[:5])+","+str(row[5].split(",")[0]) for row in csv.reader(s.split("\n"))]))

Converting list comprehension into loops for more readability:

newtext=""
for row in csv.reader(s.split("\n")):
    newtext+=','.join(row[:5])+","+str(row[5].split(",")[0])+"\n"
print(newtext)

Output:

Device,Serial Number,Realm,Vendor,Model,OS Image
ABBNWX0100,SMG3453ESDN,BlAH BLAH,Cisco,WS-C6509-E,IOS 12.2(33)SXI9
ABBNWX0101,SDG127343S0,BLAH BLAH,Cisco,WS-C4506-E,IOS 12.2(53)SG8
ABBNWX0102,TREFDSFY1KK,BLAH BLAH,Cisco,WS-C3560V2-48PS-S,IOS 12.2(55)SE5

https://ideone.com/FMJCrO

For a file you will have to use

with open(fname) as f:
    content = f.readlines()

content will contain a list of lines in the file and then use csv.reader(content)