Parse CSV/TSV file in Haskell - Unicode Characters -
i'm trying parse tab-delimited file using cassava/data.csv in haskell. however, problems if there "strange" (unicode) characters in csv file. i'll parse error (endofinput)
then.
according command-line tool "file", file has "utf-8 unicode text" decoding. haskell code looks this:
{-# language scopedtypevariables #-} {-# language overloadedstrings #-} import qualified data.bytestring c import qualified system.io.utf8 u import qualified data.bytestring.utf8 ub import qualified data.bytestring.lazy.char8 dl import qualified codec.binary.utf8.string import qualified data.text.lazy.encoding el import qualified data.bytestring.lazy l import data.text.encoding e -- handle csv / tsv files ... import data.csv import qualified data.vector v import data.char -- ord csvfile :: filepath csvfile = "myfile.txt" -- set delimiter \t (tabulator) myoptions = defaultdecodeoptions { decdelimiter = fromintegral (ord '\t') } main :: io () main = csvdata <- l.readfile csvfile case el.decodeutf8' csvdata of left err -> print err right dat -> case decodewith myoptions noheader $ el.encodeutf8 dat of left err -> putstrln err right v -> v.form_ v $ \ (category :: string , user :: string , date :: string, time :: string, message :: string) -> print message
i tried using decodingutf8', preprocessing (filtering) input predicates data.char, , much more. endoffile error persists.
my csv-file looks this:
a - - - rt utilize " kenny" • hahahahahahahahaha. #emmen #brandstapel - - - uhm .. wat dan ook ????!!!! 👋
or more literally:
a\t-\t-\t-\trt utilize " kenny" • hahahahahahahahaha. #emmen #brandstapel a\t-\t-\t-\tuhm .. wat dan ook ????!!!! 👋
the problem chars 👋 , • (and in finish file, there many more of similar characters). can do, cassava / data.csv can read file properly?
edit: i've created next preprocessor escaping text before decoding cassava (see tibbe's answer). there's improve possibility, far, works fine!
import qualified data.text t preprocess :: t.text -> t.text preprocess txt = cons '\"' $ t.snoc escaped '\"' escaped = t.concatmap escaper txt escaper :: char -> t.text escaper c | c == '\t' = "\"\t\"" | c == '\n' = "\"\n\"" | c == '\"' = "\"\"" | otherwise = t.singleton c
per cassava documentation:
non-escaped fields may contain characters except double-quotes, commas, carriage returns, , newlines.
escaped fields may contain characters (but double-quotes need escaped).
since lastly field in first record contains double quotes field needs escaped double quotes , double quotes need escaped, so:
a - - - "rt utilize "" kenny"" • hahahahahahahahaha. #emmen #brandstapel"
this code works me:
import data.bytestring.lazy import data.char import data.csv import data.text.encoding import data.vector test :: either string (vector (string, string, string, string, string)) test = decodewith defaultdecodeoptions {decdelimiter = fromintegral $ ord '\t' } noheader (fromstrict $ encodeutf8 "a\t-\t-\t-\t\"rt utilize \"\" kenny\"\" • hahahahahahahahaha. #emmen #brandstapel\"")
(note had create sure utilize encodeutf8
on literal of type text
rather using bytestring
literal directly. isstring
instance bytestring
s, what's used convert literal bytestring
, truncates each unicode code point.)
csv haskell
No comments:
Post a Comment