Most efficient way to parse a large .csv in python? -
Most efficient way to parse a large .csv in python? -
i tried on other answers still not sure right way this. have number of big .csv files (could gigabyte each), , want first column labels, cause not same, , according user preference extract of columns criteria. before start extraction part did simple test see fastest way parse files , here code:
def mmapusage(): start=time.time() open("csvsample.csv", "r+b") f: # memory-mapinput file, size 0 means whole file mapinput = mmap.mmap(f.fileno(), 0) # read content via standard file methods l=list() s in iter(mapinput.readline, ""): l.append(s) print "list length: " ,len(l) #print "sample element: ",l[1] mapinput.close() end=time.time() print "time completion",end-start def fileopenusage(): start=time.time() fileinput=open("csvsample.csv") m=list() s in fileinput: m.append(s) print "list length: ",len(m) #print "sample element: ",m[1] fileinput.close() end=time.time() print "time completion",end-start def readascsv(): x=list() start=time.time() spamreader = csv.reader(open('csvsample.csv', 'rb')) row in spamreader: x.append(row) print "list length: ",len(x) #print "sample element: ",x[1] end=time.time() print "time completion",end-start
and results:
======================= populating list mmap list length: 1181220 time completion 0.592000007629 ======================= populating list fileopen list length: 1181220 time completion 0.833999872208 ======================= populating list csv library list length: 1181220 time completion 5.06700015068
so seems csv library people utilize alot slower others. maybe later proves faster when start extracting info csv file cannot sure yet. suggestions , tips before start implementing? alot!
as pointed out several other times, first 2 methods no actual string parsing, read line @ time without extracting fields. imagine bulk of speed difference seen in csv due that.
the csv module invaluable if include textual info may include more of 'standard' csv syntax commas, if you're reading excel format. if you've got lines "1,2,3,4" you're fine simple split, if have lines "1,2,'hello, name\'s fred'" you're going go crazy trying parse without errors. csv transparently handle things newlines in middle of quoted string. simple for..in without csv going have problem that.
the csv module has worked fine me reading unicode strings if utilize so:
f = csv.reader(codecs.open(filename, 'ru'))
it plenty of robust importing multi-thousand line files unicode, quoted strings, newlines in middle of quoted strings, lines fields missing @ end, etc. reasonable read times. i'd seek using first , looking optimizations on top of if need speed.
python csv
Comments
Post a Comment