this post was submitted on 04 Apr 2026
605 points (99.7% liked)

Science Memes

19865 readers
1226 users here now

Welcome to c/science_memes @ Mander.xyz!

A place for majestic STEMLORD peacocking, as well as memes about the realities of working in a lab.



Rules

  1. Don't throw mud. Behave like an intellectual and remember the human.
  2. Keep it rooted (on topic).
  3. No spam.
  4. Infographics welcome, get schooled.

This is a science community. We use the Dawkins definition of meme.



Research Committee

Other Mander Communities

Science and Research

Biology and Life Sciences

Physical Sciences

Humanities and Social Sciences

Practical and Applied Sciences

Memes

Miscellaneous

founded 3 years ago
MODERATORS
 
top 23 comments
sorted by: hot top controversial new old

I like to raw dog my data exports

[–] Saganaki@lemmy.zip 46 points 1 week ago (1 children)

“Wait, why is this not a CSV file?”

[–] Jason2357@lemmy.ca 31 points 1 week ago (5 children)

God I hate csv with the fire of a thousand suns.

Contractors never seem to know how to write them correctly. Last year, one even provided “csv”s that were just Oracle error messages. lol. Another told me their system could not quote string columns nor escape commas or use anything but commas as their separator, so there were unpredictable numbers of commas in the rows when the actual data contained commas. Total nightmare. And so much of my data has special character issues because somewhere in the pipeline a text encoding was wrong and there is exactly one mangled character in 5 million lines for me to find.

Give me the data as closely to the source data as you can. If it is a database, then a database dump or access to a clone of your database is the best option by far. I don’t care how obscure your shit is, Ill do the conversion myself.

For intermediate data, something like parquet or language specific formats like Rdata or pickle files. Maaaaybe very carefully created csv files for archival purposes, but even then, I think parquet is safe for the long haul nowadays.

[–] kernelle@lemmy.dbzer0.com 15 points 1 week ago (1 children)

I can't tell you how many scripts I've written to format poorly made CSV files

[–] harmbugler@piefed.social 5 points 1 week ago (1 children)

The essence of data science

[–] vithigar@lemmy.ca 7 points 1 week ago (1 children)

After many years of being a developer I've come to the conclusion that the single strongest indicator of a person's competence is how they handle CSV when asked to produce or consume it.

[–] Jason2357@lemmy.ca 1 points 6 days ago

I usually treat them by using an extremely well established library where someone else has spent the requisite years crying over every stupid edge case of csv reading. Rolling your own csv reader is a bit like encryption. Until someone hands you a file that rejects all sanity and you start fking with regex. Lol.

[–] GTG3000@programming.dev 4 points 1 week ago

Reminds me of writing my own csv parser that implemented escapes properly. The one everyone else went with of course was written in regex, so it was faster... But broke if there were escaped newlines.

[–] espurr@sopuli.xyz 3 points 1 week ago (5 children)

What delimiter should I be using instead of commas?

[–] Jason2357@lemmy.ca 1 points 6 days ago (1 children)

The delimiter isn't really the issue. Its that there are lots and lots of weird edge cases that break reading csvs. If you use commas, at minimum, you need to escape commas in the data, or quote strings that might contain commas... But now you have to deal with the possibility of a quote character or your escape character in the data.

Then you have the fact that csvs can be written with so many different character encodings, mangling special characters where they occur.

Aaand then you have all the issues that come with lack of metadata - good formats will at least tell you the type of data in each column so you dont have to guess them.

Lets see, its also really annoying to include any binary data in a csv, theres no redundancy or parity checks to catch currupted data, and they arent compressed so you need to tack on compression if you want efficient storage, but that means you always have to read the whole csv file for any task.

Oh, that brings me to the joys of modern columnar formats where you can read selected columns super fast without reading the whole file.

Oh god, I really kept going there. Sorry. Its been a year.

[–] espurr@sopuli.xyz 1 points 4 days ago

No that was really cool, what's an exaple of modern format? CSV sounds kinda not based

[–] rustydrd@sh.itjust.works 12 points 1 week ago

🤪 as a delimiter

🥦 for end of line

[–] emergencyfood@sh.itjust.works 5 points 1 week ago

Use comma for delimiter, and escape any comma in the data by enclosing that entry in quotes.

Data: 225 | 2,500 | 450

CSV: 225,"2,500",450

[–] death_to_carrots@feddit.org 4 points 1 week ago

Semi-colons. Tabulators. Something not in the actual strings. However the Python CSV module it formats.

[–] Jason2357@lemmy.ca 3 points 1 week ago (1 children)

P.s. in the above quagmire, the only solution is choose to keep only the most important un-clean column per csv, and make it the last column in the file so you have predictable columns. If you need more, then write separate csvs. Computers are stupid.

[–] deegeese@sopuli.xyz 4 points 1 week ago (1 children)

If you could choose the column order, you could choose a better format, or at least escape correctly.

[–] Jason2357@lemmy.ca 2 points 1 week ago

It was some sort of weird database frontend the contractor used. It was very limited.

[–] harmbugler@piefed.social 28 points 1 week ago

But, please, not wriggling.

[–] Berengaria_of_Navarre@lemmy.world 9 points 1 week ago (1 children)

Oh baby I like it raw🎵

Oh baby I like it raaAAWWWW 🎵

[–] fossilesque@mander.xyz 8 points 1 week ago

I'm glad I wasn't then only one lol.

[–] Chakravanti@monero.town -1 points 1 week ago

I'm not sure you understand what Golem is...