I'm working on creating an e-comm products price comparison tool(in python) which is somewhat similar to camelcamelcamel.com, both for fun and profit. I'm facing the difficult when I want to match the identical items from the list that I gathered from various websites using a search term. I'm using Cosine similarity and thinking of using Levenshtein's Algorithm for product matching, to match the titles of the various items against each other to find the identical items.
For example, I have the following items and their price values as,
title: "Apple MacBook Air MMGF2HN/A 13.3-inch Laptop (Core i5/8GB/128GB/Mac OS X/Integrated Graphics)",
title: "Apple MacBook Air MMGF2HN/A 13.3-inch Laptop (Core i5/8GB/128GB/Mac OS X/Integrated Graphics) cover",
title: "Apple Macbook Air MMGF2HNA Notebook (Intel Core i5- 8GB RAM- 128GB SSD- 33.78 cm(13.3)- OS X El Capitan) (Silver)"
// product title and price
cosine(product_0 * product_1) = 0.973328526785
cosine(product_0 * product_2) = 0.50251890763
You could train word2vec on the product titles. Resulting code would look something like this when using the Python word2vec wrapper and slightly different but similar when using Gensim's model.word2vec:
indexes, metrics = model.cosine(normalized_phrase) model.generate_response(indexes, metrics)
The generated response will be the title vectors sorted by descending cosine similarity.