Samyak Jain Samyak Jain - 10 months ago 151
Python Question

Flickr API returns duplicate photos while extracting all geotagged photos

I'm trying to extract all the geotagged photos from Flickr using the Flickr API method
. Here is the code:

import flickr_api
import urllib2
from flickr_api.api import flickr

flickr_api.set_keys(api_key = 'my_api_key', api_secret = 'my_api_secret')

for i in range(1, 1700):
photo_list ='my_api_key', has_geo=1, extras='description,license,geo,tags,machine_tags', per_page=250, page=i, min_upload_date='972518400', accuracy=12)
f = open('xmldata1/photodata' + str(i) + '.xml','w')

This script runs to give me an xml file for each page of the data. Each xml file has 250 photos data. There are 1699 such xml files. I get approximately 420,000 photos data with a lot of duplicates. After removing the duplicates, I got only 9022 unique images.

I have read here that it is safe to query for 16 pages = 4000 images at once to avoid duplicates.

I want to avoid duplicate images as much as possible and I require 100,000+ unique geotagged images for gps clustering purpose.

What time lag should I insert between two instances of the query?
If I must consider another approach, please elaborate on it.

Let me know if you have any queries. Any help would be appreciated!

Answer Source

Try using a max_upload_date along with the min_upload_date. Keep a time frame of a couple of days and keep shifting the time frame from the min_upload_date to the max_upload_date. Search for photos within that time frame only.