news
25 April 2019

An Elephant In The Room

An Elephant In The Room

 

We all have some certain biases.  Even when we're talking about the same thing, like Data Science. I know, the definition of this term is usually vague, but it became a lot more stable in recent years, but still, everyone has a unique view on what matters most and which part of the pipeline to focus on. This goes to me as well, as I'm totally biased towards engineering side of things, and we may possibly have a different point of view to a problem when talking with mathematicians or business guys. That's not a bad thing at all, I would say, that's a great thing, as this way we can complete each other to work as a team.

 

So, what's wrong with Hadoop? You know, MapReduce revolutionized the world of analytics, and the surrounding infrastructure became a de facto standard for storing corporate data, allowing running SQL over it, even using Python or other convenient tools.

 

It seems to me, even nowadays it usually takes a lot of time to just go and solve some task on Hadoop, even if it's simple. Why? Because you will inevitable end up writing MapReduce instead of writing your own business logic, and trying to fit your algorithm into paradigm instead of relying on it to do the right thing. Besides, every intermediate result is written to disk and sent across the network, drastically decreasing the performance.

 

Using Python instead of Java? Great, now you may have even worse performance and also you'll probably need to bug your operations team every time you want to try, say, new Pandas version on a cluster. It's not really a problem with Python, it's a problem with several layers of abstraction, one on top of another, trying really hard to be both flexible and fast, and at the end kinda failing in both.

 

By the way, have you seen this article https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html ? Just wanted to share, it kinda blew my mind when I saw it for the first time, and now I've also switched some of our workflows, even in production, to such approach, and it works great. Also, no, I'm not the only person who is able to support it.

 

Okay, Hadoop is kinda slow. It's a thing that you kinda accept when you start dealing with heavy batch processing, but it also affects the time gap between when you first saw the data and the actual result, like, a ML model trained on this piece. Are you processing your webserver's logs daily? Hourly?

 

I personally really like the idea of a data bus. You may have a queue, like Apache Kafka or RabbitMQ, which is storing all your data, and it is the main source of everything in your pipelines. Different topics store different data streams, there is some real time analytics, neat! Would you really still go cut it into pieces and process with Hadoop instead of realtime models?

 

What did you say, SQL? But do you need full text search and for the whole storage to be really scalable? It can get tricky. That's why I would probably say that you can at least try with NoSQL. Especially if you don't need a lot of relations. Don't get me wrong, PostgreSQL is an amazing database, but Elasticsearch is far better for search, and Neo4j is a lot better for storing graphs. Also, you shall probably want to integrate it with your cool fancy pipeline at some point, good idea. Hazelcast, Apache Ignite, Aerospike, Cassandra - just choose the one of many for your needs.

 

Ahh, you may still want SQL. Fair point. If you're REALLY sure you need it, there is not only Hive and Impala, there are also Spark SQL amd Presto, for example. Presto is integrated with a ton of possible data sources through plugins, and AWS already has a service for running this thing on top of S3, it's called Amazon Athena. Moreover, Apache Zeppelin also has an integration with both of them, allowing your data team to run flexible SQL queries over the cloud.

 

Let's also take a look at CI/CD workflow. Developers are submitting some code into repository, the whole thing is tested automatically, then packaged into OS packages to be deployed on some machines. It's a good trend for people to move to using containers instead, though. You may need to spawn several different environments for testing? Containers. You want to deploy both to Ubuntu and CentOS? Containers. You want to provide HA, close-to-zero-downtime upgrades and scale horizontally? Containers.

 

But containers are just wrappers. They will allow you to "isolate" the required environment and create a somewhat independent sandbox for an application, but you will probably want more. What about flexible networking for these containers? Or an automatic configuration of a number of replicas? Or an integration with different file systems like GlusterFS or Ceph which you may have?

 

Not so long ago all the companies were solving these things for themselves or just using VMs instead of containers, for example, with OpenStack. Now, thanks to Google and a lot of other cool people, we have Kubernetes, which covers the most of it. And it's a really valuable tool in case of flexible workflows like analytic pipelines.

 

As we researched, there is one data-related thing which Kubernetes is not able yet to handle properly, and it's DAGs. By "DAG" I mean a directed acyclic graph of tasks with topology sorting and execution control. There are several open source tools that allow you to build such things, like Airflow, Luigi or Mistral. But I would say, that for usual purposes the first one is overloaded with features while lacking the declarative API for tasks, second one, on the opposite, is too simplified, and the third one is mostly focused on OpenStack-related things. This doesn't mean that they won't work for you, of course, but finally we ended up with much simpler solution.

 

You probably have Jenkins as part of your CI, right? Or some other similar tool. The thing is, recently the majority of such tools implemented a declarative API for jobs (with some DSL like Groovy). And this DSL allows you to specify whether you want a certain stage to run in parallel with something, or strictly in a sequence.  So Jenkins is currently executing Kubernetes jobs for us, and there is a new thing called Jenkins X that is even more tightly integrated with it.

 

At the end, you know better what fits your needs and requirements. The only thing is, data analytic workflow doesn't really fit well with "stable Agile with milestones" approach which works wonders for software development. R&D is an astonishingly interesting process, so best of luck and don't hesitate to notice an elephant in the room.