The Ins & Outs of Data Data Transfer

In the world of big-data, data transfer is a much ignored topic. So much of the big data ecosystem assumes that your data is just "there". In reality, your data is both coming from somewhere and also ending up somewhere else.

I gave a talk last night at an event sponsored by TrueCar and addressed three points around this topic and ETL in general.

First, ETL problems are pervasive and many people are tackling them without even knowing that they're tackling them. If you're building a business dashboard, you're doing ETL. If you're adding customer records programmatically into Salesforce. If you're correlating customer lifetime values to marketing attribution sources, you're doing ETL.

Second, ETL is inherently brittle and hard to test. I discussed some common failure cases here including source breakage (database connection timeouts), transform limitations (out of memory errors during computation), or loading issues (API connection issues). You have to set expectations that things are going to break, and you need to have strong expectations on how they're going to break.

Finally, I addressed some basic programming principles that maximize reliability of these processes. These are basic principles that should be applied to any software problem in general; I'd argue that they're even more important in the realm of data transfer.

See slides below for more detail - enjoy!