+2 votes
The task is to process the raw data statistics.
Requests simple aggregate - Select SUM/AVG from a group by (AGE,SEX,DAY,SOURCE). (usually in the group by parameters 10-20 staging). Aggregated data are placed in a separate table and have it done the search with the WHERE in which the same 10-20 parameters.

Now all this case deals with mongodb(aggregation framework), don't like the performance. (Indexes are all in memory like, to optimize mongo is clearly not much more)
Maybe there's a database over sharpened under such tasks?

2 Answers

0 votes
Alternatively, you can use Impala or Hive on Tez on Hadoop cluster. Scalability will be 100%, same CDH or HDP fairly easy unfolding.
If You have a lot of money and CPU no problem, you can use Spark SQL on top of the same Hive.
+3 votes

From my own experience I will say that is a very good job. Now there is a small cluster 300+ GB statistics event, all works very fast.

Here are a few links in order to avoid common mistakes in the cluster configuration.

Just stumbled on the stone, described in this article:
When you configure the mapping index indicates the option to not analyze fields:
"doc_values" : true