Wednesday, 15 July 2015

Parse CSV/TSV file in Haskell - Unicode Characters -



Parse CSV/TSV file in Haskell - Unicode Characters -

i'm trying parse tab-delimited file using cassava/data.csv in haskell. however, problems if there "strange" (unicode) characters in csv file. i'll parse error (endofinput) then.

according command-line tool "file", file has "utf-8 unicode text" decoding. haskell code looks this:

{-# language scopedtypevariables #-} {-# language overloadedstrings #-} import qualified data.bytestring c import qualified system.io.utf8 u import qualified data.bytestring.utf8 ub import qualified data.bytestring.lazy.char8 dl import qualified codec.binary.utf8.string import qualified data.text.lazy.encoding el import qualified data.bytestring.lazy l import data.text.encoding e -- handle csv / tsv files ... import data.csv import qualified data.vector v import data.char -- ord csvfile :: filepath csvfile = "myfile.txt" -- set delimiter \t (tabulator) myoptions = defaultdecodeoptions { decdelimiter = fromintegral (ord '\t') } main :: io () main = csvdata <- l.readfile csvfile case el.decodeutf8' csvdata of left err -> print err right dat -> case decodewith myoptions noheader $ el.encodeutf8 dat of left err -> putstrln err right v -> v.form_ v $ \ (category :: string , user :: string , date :: string, time :: string, message :: string) -> print message

i tried using decodingutf8', preprocessing (filtering) input predicates data.char, , much more. endoffile error persists.

my csv-file looks this:

a - - - rt utilize " kenny" • hahahahahahahahaha. #emmen #brandstapel - - - uhm .. wat dan ook ????!!!! 👋

or more literally:

a\t-\t-\t-\trt utilize " kenny" • hahahahahahahahaha. #emmen #brandstapel a\t-\t-\t-\tuhm .. wat dan ook ????!!!! 👋

the problem chars 👋 , • (and in finish file, there many more of similar characters). can do, cassava / data.csv can read file properly?

edit: i've created next preprocessor escaping text before decoding cassava (see tibbe's answer). there's improve possibility, far, works fine!

import qualified data.text t preprocess :: t.text -> t.text preprocess txt = cons '\"' $ t.snoc escaped '\"' escaped = t.concatmap escaper txt escaper :: char -> t.text escaper c | c == '\t' = "\"\t\"" | c == '\n' = "\"\n\"" | c == '\"' = "\"\"" | otherwise = t.singleton c

per cassava documentation:

non-escaped fields may contain characters except double-quotes, commas, carriage returns, , newlines.

escaped fields may contain characters (but double-quotes need escaped).

since lastly field in first record contains double quotes field needs escaped double quotes , double quotes need escaped, so:

a - - - "rt utilize "" kenny"" • hahahahahahahahaha. #emmen #brandstapel"

this code works me:

import data.bytestring.lazy import data.char import data.csv import data.text.encoding import data.vector test :: either string (vector (string, string, string, string, string)) test = decodewith defaultdecodeoptions {decdelimiter = fromintegral $ ord '\t' } noheader (fromstrict $ encodeutf8 "a\t-\t-\t-\t\"rt utilize \"\" kenny\"\" • hahahahahahahahaha. #emmen #brandstapel\"")

(note had create sure utilize encodeutf8 on literal of type text rather using bytestring literal directly. isstring instance bytestrings, what's used convert literal bytestring, truncates each unicode code point.)

csv haskell

No comments:

Post a Comment