user2421173 user2421173 - 4 months ago 9
Python Question

Pulling hostnames from single line of text with regex

I'm attempting to write a Python script pull all the Google Cloud Compute subnets from their DNS. More info about this here:

https://cloud.google.com/compute/docs/faq#where_can_i_find_short_product_name_ip_ranges

So far, I'm able pull the TXT record list of individual hostnames as a basestring with no problem.

import dns.resolver

# Set the resolver
my_resolver = dns.resolver.Resolver()
my_resolver.nameservers = ['8.8.8.8']

answer = my_resolver.query('_cloud-netblocks.googleusercontent.com', 'TXT')

for rdata in answer:
for txt_string in rdata.strings:
txt_record = txt_string


This leaves me with a string of

v=spf1 include:_cloud-netblocks1.googleusercontent.com include:_cloud-netblocks2.googleusercontent.com include:_cloud-netblocks3.googleusercontent.com include:_cloud-netblocks4.googleusercontent.com include:_cloud-netblocks5.googleusercontent.com ?all


What I would like to do is use re.match to extract the 5 hostnames from this initial response so I can do consecutive lookups and strip out the subnets then put them into an array. All my efforts with regex thus far haven't been so... great... I was wondering if anyone would provide some guidance? Thanks!

Edit:

Here is the full script for anyone else with a need to collect all of Google's Cloud IPs.

import dns.resolver, re

# Set the resolver
my_resolver = dns.resolver.Resolver()
my_resolver.nameservers = ['8.8.8.8']

answer = my_resolver.query('_cloud-netblocks.googleusercontent.com', 'TXT')

for rdata in answer:
for txt_string in rdata.strings:
txt_record = txt_string

# Extract hostnames into array
hostnames = [x.split(":")[1] for x in txt_record.split() if ":" in x]
total_subnets = []

for host in hostnames:
answer = my_resolver.query(host, 'TXT')

for rdata in answer:
for txt_string in rdata.strings:
txt_record = txt_string

ip4_subnets = re.findall(r'ip4:(\S+)', txt_record)
ip6_subnets = re.findall(r'ip6:(\S+)', txt_record)

for subnet in ip4_subnets:
total_subnets.append(subnet)

for subnet in ip6_subnets:
total_subnets.append(subnet)

print total_subnets

Answer

You do not need to use a regex for this, use split twice and comprehension:

s = "v=spf1 include:_cloud-netblocks1.googleusercontent.com include:_cloud-netblocks2.googleusercontent.com include:_cloud-netblocks3.googleusercontent.com include:_cloud-netblocks4.googleusercontent.com include:_cloud-netblocks5.googleusercontent.com ?all"
print([x.split(":")[1] for x in s.split() if ":" in x])
# => ['_cloud-netblocks1.googleusercontent.com', 
#     '_cloud-netblocks2.googleusercontent.com',
#     '_cloud-netblocks3.googleusercontent.com',
#     '_cloud-netblocks4.googleusercontent.com',
#     '_cloud-netblocks5.googleusercontent.com']

See the demo here

Details:

  • s.split() - splits with spaces
  • if ":" in x - only gets those entries with a : inside
  • x.split(":")[1] - splits the above entries with : and gets the second chunk

Certainly, if you wish, you can use a regex:

include:(\S+)

See demo.

This will match include: and will capture 1+ non-whitespace symbols into Group 1. re.findall will fetch you the list (re.findall(r'include:(\S+)', s)).