yannenkar yannenkar - 1 month ago 6
Python Question

Google Autocomplete Script Truncates Results Based on Input String in Python 2.7.12

Before I get into details, note that I'm not a programmer. Just learning python and starting to get a handle on things. Now to the issue at hand:

I adapted an autocomplete string generator to add a list of terms to a question. It seems to be almost working, but I'm getting hung up on some formatting issues in the output file.

Here's the adapted code (it's overkill, but I'm looking to get something working before improving it).

import urllib, urllib2
import json
import time
import codecs

inFile = 'keywordFile.txt'
sep = ','
keywordField = 'keyword'
outFile = 'autoCompFile.txt'

google_endpoint = 'http://google.com/complete/search?output=firefox&q='

def find_index(fieldname, inFile):
with open(inFile, 'r') as f:
header = f.readline().rstrip().split(sep)
i = 0
for i in range(0, len(header)):
if header[i] == fieldname:
return i
break
else:
return -1

def build_phrase(keyword):
phrase = u'why did'
return u'%s %s' % (phrase, keyword)

def query_google(phrase):
url = '%s%s' % (google_endpoint, urllib.quote_plus(phrase))
data = urllib2.urlopen(url)
data = json.load(data)
results = [result.replace(phrase.lower(), '') for result in data[1]]
return results

kwIndex = find_index(keywordField, inFile)

with codecs.open(inFile, 'r', 'utf-8') as f:

with codecs.open(outFile, 'w', 'utf-8') as f_out:
f_out.write('keyword, autocomplete phrase\n')

data = f.readlines()
for record in data[1:]:
time.sleep(0.3)

record = record.rstrip()
items = record.split(sep)
kw = items[kwIndex]

phrase = build_phrase(kw)
results = query_google(phrase)
if len(results) > 0:
for result in results:
f_out.write('%s, %s, %s\n' % (kw, phrase, result))
else:
f_out.write('%s, %s\n' % (kw, phrase))


Something in the code is causing the "result" to truncate if it includes the exact wording of the built phrase. Example output:


  • bureaucracy, why did bureaucracy, develop in early governments

  • bureaucracy, why did bureaucracy, become a branch of government

  • bureaucracy, why did bureaucracy, grow in the 20th century

  • bureaucracy, why did bureaucracy, why did a bureaucracy develop in egypt

  • bureaucracy, why did bureaucracy, why did weber consider bureaucracy ideal

  • bureaucracy, why did bureaucracy, why did the federal bureaucracy grown

  • bureaucracy, why did bureaucracy, why did weber study bureaucracy

  • bureaucracy, why did bureaucracy, why did max weber fear bureaucracy



Ideally, I would like to get the keyword and a full results string without any kind of truncation. So I want both cases to look like this:


  • bureaucracy, why did bureaucracy, why did bureaucracy become a branch of government?

  • bureaucracy, why did bureaucracy, why did a bureaucracy develop in egypt



Thanks in advance!

Answer

Looks like you're removing the phrase from your results. So your code is doing exactly what you wrote it to do...

Remove this line:

results = [result.replace(phrase.lower(), '') for result in data[1]]