Advertisement

What is wrong with "Big Data"?

Started by February 04, 2014 08:52 PM
18 comments, last by Glass_Knife 10 years, 8 months ago

I am looking at running some code on multiple machines so it will scale. Although I have never done this kind of thing before, I think "Surely I can Google it..."

Here's one article I found: https://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/

The title is (click the link if you don't believe me):

Open Source Big Data for the Impatient, Part 1: Hadoop tutorial: Hello World with Java, Pig, Hive, Flume, Fuse, Oozie, and Sqoop with Informix, DB2, and MySQL.

I thought this was a joke. But it's not.

One: Feel free to make fun of the title of the article or explain to me why this isn't a laughing matter.

Two: Has anyone used a distributed file system and job scheduler of some kind that actually worked?

I think, therefore I am. I think? - "George Carlin"
My Website: Indie Game Programming

My Twitter: https://twitter.com/indieprogram

My Book: http://amzn.com/1305076532

As far as 'it just works' is concerned (not something cloud computing is known for), by far the best thing out there as far as I know of is picloud. But im not an expert; I just dabble.

Advertisement

"Big data" is a shitty term.

It's popular as a fad name for a nebulous concept that nobody agrees on. No two people seem to share a definition for what it means. The common elements seem to be overengineering, hot new platforms, and hero worship.

It's basically just "enterprise software" all over again.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

As far as 'it just works' is concerned (not something cloud computing is known for), by far the best thing out there as far as I know of is picloud. But im not an expert; I just dabble.

I did a terrible job of explaining what we're trying to do. We do not want to use cloud resources, but setup our own inside network with our own servers. I had hoped that http://hadoop.apache.org/ would work, but like ApochPiQ suggested, it seems like an over-engineered nightmare.

I think, therefore I am. I think? - "George Carlin"
My Website: Indie Game Programming

My Twitter: https://twitter.com/indieprogram

My Book: http://amzn.com/1305076532

The company I previously worked at used Hadoop to handle large quantity of application statistics. This ranges from response time to user activitiy events. If your application is used by 10M+ users, you can expect to generate, at least, 100+ incoming traffic coming from your app per second. Where do you want to store all this data so you can analyze it further? SQL?

if I had 10M users at the same time ... I'd hire some C programmers to design the most efficient system possible to handle my unique situation - and make it scalable.

I personally have a hard time trusting databases I do not design myself ... had SQL corruption a few years ago that wiped everything off the servers .

I cannot remember the books I've read any more than the meals I have eaten; even so, they have made me.

~ Ralph Waldo Emerson

Advertisement
This might not be relative to the question asked by the OP but to reply to Shippou; what do you mean by SQL corruption? If you have a large database, surely you would have scheduled backups?


if I had 10M users at the same time ... I'd hire some C programmers to design the most efficient system possible to handle my unique situation - and make it scalable.



I personally have a hard time trusting databases I do not design myself ... had SQL corruption a few years ago that wiped everything off the servers .

Unfortunatly for finance and banking this is pretty much out of the question. The reason being that code has to be gone over with a fine tooth comb by all kinds of govenmental bodies to ensure that it is secure. They seem to think that standard C and C++ library features are unsecure and force you to rewite layers and layers of security wrapping up arrays and STL library containers until they perform so slowly that they are virtually unusable. They also follow language specifications from 15 years ago too.
If you use Java or a Java based language then they just nod it through.

This might not be relative to the question asked by the OP but to reply to Shippou; what do you mean by SQL corruption? If you have a large database, surely you would have scheduled backups?

I'm not an expert, but if I had a server serving a lot of requests, even downtime of seconds to minutes isn't what you want. Corruption also means you can have unexpected errors depending on the situation. And, after corruption, you still have to change the state to the back-up, which costs time and money. Another problem is that you can hardly make backups every second. ;)

The company I previously worked at used Hadoop to handle large quantity of application statistics. This ranges from response time to user activitiy events. If your application is used by 10M+ users, you can expect to generate, at least, 100+ incoming traffic coming from your app per second. Where do you want to store all this data so you can analyze it further? SQL?

I understand what you're saying. For a situation where you've got servers all over the country, and they're aggregating tons of data, and then you need to run searches on the data in a fast, parallel, scale-able way, then the "Big Data" concept is the way to go. I just don't understand why everything I find seems broken, an alpha version, or so complicated that no one at work can get it to work. I felt the same way about the Java EE stuff years ago. A simple "Hello World" service takes dozens of files and hours to setup.

I think, therefore I am. I think? - "George Carlin"
My Website: Indie Game Programming

My Twitter: https://twitter.com/indieprogram

My Book: http://amzn.com/1305076532

This topic is closed to new replies.

Advertisement