Intern87 Intern87 - 11 months ago 58 Question

convert to ASCII characters ssis

I have a file with a list of words, not too dissimilar to this, which i want to import into my MSSQL database;


I want to do a rudimentary conversion of the Polish characters into ASCII, so that they can be parsed through my DB.
I am using SSIS to import the data and therefore have I am using a script to alter the text on/before import.

I have tried;

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
Row.unsanitised1.Replace("Ą", "A")
Row.unsanitised1.Replace("ą", "a")

Row.unsanitised1.Replace("Ć", "C")
Row.unsanitised1.Replace("ć", "c")

Row.unsanitised1.Replace("Ę", "E")
Row.unsanitised1.Replace("ę", "e")

Row.unsanitised1.Replace("Ł", "L")
Row.unsanitised1.Replace("ł", "l")

Row.unsanitised1.Replace("Ń", "N")
Row.unsanitised1.Replace("ń", "n")

Row.unsanitised1.Replace("Ó", "O")
Row.unsanitised1.Replace("ó", "o")

Row.unsanitised1.Replace("Ś", "S")
Row.unsanitised1.Replace("ś", "s")

Row.unsanitised1.Replace("Ź", "Z")
Row.unsanitised1.Replace("ź", "z")

Row.unsanitised1.Replace("Ż", "Z")
Row.unsanitised1.Replace("ż", "z")

Row.sanitised = Row.unsanitised
Row.sanitised1 = Row.unsanitised1

End Sub

However, when the file is read in, the characters are not read properly, they come in more like "Ĺ‚" what am I doing wrong? Do I need to read them in as unicode?

The file and vbscript are saved in codepage 1200 to preserve the Polish characters.

Answer Source


So after a little break from this and looking at it again I've found the problem. I'm posting it here in case people come across this in the future and see that is is currently unsolved.

@Sandeep Rowat & Tom Blodget said in the comments, codepage 65001 (UTF-8) is required. However, the replace function was still not working for two of the characters in the list, Ó and ó.

I thought the import was confusing the characters and making them unrecognisable, this theory was tested by importing the characters and converting them to HEX ensuring they are the same baseline characters.

Ó appeared as HEX D3 on both import and code.

ó appeared as HEX F3 on both import and code.

Since they were the same on import and within the code, this led me to believe the String.Replace() function does not work as intended on these characters.

I have circumnavigated this issue by doing a remove and subsequently an insert to replace the characters;

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

        var = Row.unsanitised1.Replace("Ą", "A")
        var.Replace("ą", "a")

        var.Replace("Ć", "C")
        var.Replace("ć", "c")

        var.Replace("Ę", "E")
        var.Replace("ę", "e")

        var.Replace("Ł", "L")
        var.Replace("ł", "l")

        var.Replace("Ń", "N")
        var.Replace("ń", "n")

        var.Replace("Ó", "O") 'does not work?!
        var.Replace("ó", "o") 'does not work?!

        i = var.LastIndexOf("Ó")
        Do While i >= 0
            var = var.Remove(i, 1) 'this works!
            var = var.Insert(i, "O") 'this works!
            i = var.LastIndexOf("Ó")

        i = var.LastIndexOf("ó")
        Do While i >= 0
            var = var.Remove(i, 1) 'this works!
            var = var.Insert(i, "o") 'this works!
            i = var.LastIndexOf("ó")

        var.Replace("Ś", "S")
        var.Replace("ś", "s")

        var.Replace("Ź", "Z")
        var.Replace("ź", "z")

        var.Replace("Ż", "Z")
        var.Replace("ż", "z")

        Row.sanitised = Row.unsanitised
        Row.sanitised1 = var

        i = Nothing

    End Sub

I hope nobody else gets stuck with this problem.