jgetner jgetner - 2 months ago 5
MySQL Question

c# loop duplicate records inserted

So im working on a proof of concept adventure and what im trying to do is loop through html and find all the links and save the link in mysql using C#. The finding the links part is working perfectly but when i go to save the links in a table they are be duplicated incrementally. Example below

Records in table:
1:1
1:2
1:2
1:3
1:3
1:3
1:4
1:4
1:4
1:4
etc...

pages table:

| page_id | page_url | date_added |
------------------------------------------
| 1 | "http://bla.com | |


links table:

| link_id | link_page_id | date_added |
---------------------------------------


code to get the links:

private static void getLinks(string contents, string url)
{
HtmlDocument html = new HtmlDocument();
html.LoadHtml(contents);

HtmlNodeCollection links = html.DocumentNode.SelectNodes("//a[@href]");

foreach(HtmlNode link in links)
{
foreach(HtmlAttribute attr in link.Attributes)
{
if(attr.Name == "href")
{
saveLinks(url, attr.Value);
}
}
}

}


Code to save the links:

private static void saveLinks(string url, string link)
{
try
{
Console.WriteLine(url + " - " + link);
MySqlCommand cmd = new MySqlCommand();
cmd.Connection = conn;
cmd.CommandText = @"INSERT INTO
sites.links(
links_page_id,
link
)
SELECT
p.page_id,
@link
FROM
sites.links l
RIGHT JOIN
sites.pages p
ON p.page_url = @url";
cmd.Prepare();

cmd.Parameters.AddWithValue("@url", url);
cmd.Parameters.AddWithValue("@link", link);
cmd.ExecuteNonQuery();
}

catch (MySqlException e)
{
Console.WriteLine("Failure: " + e.Message);
}
}


On my console output im getting the expected number of links with no duplicates and i have tested the query in mysql and all seemed good. Its probably something stupid that im overlooking but i cannot seem to find what im doing wrong. So any help would be awesome.

Answer

If you need to get page_id by its url, you needn't join with sites.links

          cmd.CommandText = @"INSERT INTO 
                                sites.links(
                                    links_page_id, 
                                    link
                                )
                            SELECT
                                p.page_id,
                                @link
                            FROM 
                                sites.pages p
                            WHERE p.page_url = @url";