What is wrong with "Big Data"?

Tim Wright · 2014-02-10T23:10:04

I am looking at running some code on multiple machines so it will scale. Although I have never done this kind of thing before, I think "Surely I can Google it..." Here's one article I found: https://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/ The title is (click the link if you don't believe me): Open Source Big Data for the Impatient, Part 1: Hadoop tutorial: Hello World with Java, Pig, Hive, Flume, Fuse, Oozie, and Sqoop with Informix, DB2, and MySQL. I thought this was a joke. But it's not. One: Feel free to make fun of the title of the article or explain to me why this isn't a laughing matter. Two: Has anyone used a distributed file system and job scheduler of some kind that actually worked?

GDNet Lounge Community

Started by Glass_Knife February 04, 2014 08:52 PM

18 comments, last by Glass_Knife 10 years, 8 months ago

Glass_Knife

Author

8,637

February 05, 2014 04:39 PM

This might not be relative to the question asked by the OP but to reply to Shippou; what do you mean by SQL corruption? If you have a large database, surely you would have scheduled backups?

I'm not an expert, but if I had a server serving a lot of requests, even downtime of seconds to minutes isn't what you want. Corruption also means you can have unexpected errors depending on the situation. And, after corruption, you still have to change the state to the back-up, which costs time and money. Another problem is that you can hardly make backups every second. ;)

And please don't worry about my OP. I asked this in the lounge to get the conversation going. I'm already lost, so more info is good!

I think, therefore I am. I think? - "George Carlin"
My Website: Indie Game Programming

My Twitter: https://twitter.com/indieprogram

My Book: http://amzn.com/1305076532

Glass_Knife

Author

8,637

February 05, 2014 05:03 PM

I think the hardest part with trying to use a new technology is figuring out what to call it. "Big Data" may not be the right term for what I'm trying to do. I think "Grid Computing" may be more of what I'm looking for.

I think, therefore I am. I think? - "George Carlin"
My Website: Indie Game Programming

My Twitter: https://twitter.com/indieprogram

My Book: http://amzn.com/1305076532

capn_midnight

1,708

February 07, 2014 11:13 AM

You may also want to read up on the differences between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP). It mostly boils down to "store historical state in a separate database, only keep the current state in the high-traffic database." You may have already looked at this, but the PostgreSQL documentation has a few things to say on scalability: http://www.postgresql.org/docs/9.2/interactive/high-availability.html

Also, regarding your experience, unfortunately that is the case with most of the latest tech. It all literally *is* alpha-level technology. My personal opinion is that it's mostly a fad. We've hit an era where people are more apt to write their own programming language or database then take the time to figure out if what they want has been made already. There has been a geometric explosion of projects. Everyone and their mother has their own HTML templating engine. You can't throw a stone without hitting a data persistence layer (I'm guilty of that one, too, but when I started mine, there was no good alternative for fast queries that didn't require a full ORM). Giving up data integrity just to make one particular shard of a database handle more writes (which may or may not succeed, so how many more are you actually handling) is never a trade off I can justify. Also, I think a lot of people think "what's good for Google/Facebook is good for me," and never think two steps further than that. Everybody wants to think their work is important, so they are quick to call 2-3 GB data sets "Big Data".

Take your time, do your research, and if anything seems too good to be true, it probably is.

[Formerly "capn_midnight". See some of my projects. Find me on twitter tumblr G+ Github.]

Glass_Knife

Author

8,637

February 07, 2014 02:29 PM

Also, I think a lot of people think "what's good for Google/Facebook is good for me," and never think two steps further than that. Everybody wants to think their work is important, so they are quick to call 2-3 GB data sets "Big Data".

That is exactly what we are dealing with. We have found something that may work. http://www.jppf.org/ Hopefully it works as advertised.

I think, therefore I am. I think? - "George Carlin"
My Website: Indie Game Programming

My Twitter: https://twitter.com/indieprogram

My Book: http://amzn.com/1305076532

frob

46,292

February 07, 2014 02:43 PM

Big data is relative. How much data and what types of data is required for it to become "big"?

Wikipedia uses a definition of basically "too big for a normal RDBMS system to handle efficiently".

I can imagine some types of data that reaches that database limits with a few gigabytes or even less. If you need to process statistical data, streams of data, or other types of data that a "select ... from ... where ..." style query doesn't naturally fit you are going to reach "big data" quickly.

Video processing, for example, is a horrible fit for a relational database. You might store a few index values in a database along with meta-information about the raw data stream, but that raw data is best process with specialized tools. Similar with statistical processing where you must actually load and evaluate a large amount of data. These would very quickly reach Wikipedia's definition.

I can imagine datasets reaching into the terabytes and potentially even into exabytes that are still a great fit for a relational database. As long as it remains a system of direct lookup and retrieval the various off-the-shelf systems remain viable.

For example, you could build a Google maps style database, but continue the resolution up to something akin to a spy satellite. Not just today's commodity spy satellites, but imagine some of the most powerful deep space telescopes pointed toward the Earth instead of away. After all the slicing and dicing of data the system just becomes a direct lookup of individual blocks of data. Even though the data set is extremely large the data can be organized in a way that simple lookup remains efficient.

Glass_Knife

Author

8,637

February 07, 2014 03:04 PM

Video processing, for example, is a horrible fit for a relational database.

The problem that I keep running into is that just because data doesn't work well with a relational database, doesn't mean that I need a distributed file system and a "map-reduce" algorithm to process the data. Most of that data is captured and then stored in the file system, and other algorithms can process the data using the map-reduce job idea. This can be done in parallel, and can scale well, handling failures of machines in the processing cluster. But when I need to process the data, and at each step everyone needs access to the data, it can't be done in parallel and isn't stored for later examination, this type of algorithm doesn't seem like a good fit. But I could just be misunderstanding something.

Scientific Workflow - http://en.wikipedia.org/wiki/Scientific_workflow_system - seemed like it might work but hasn't yet.

I think, therefore I am. I think? - "George Carlin"
My Website: Indie Game Programming

My Twitter: https://twitter.com/indieprogram

My Book: http://amzn.com/1305076532

SillyCow

1,534

February 08, 2014 10:23 PM

Hadoop is one of the oldest most well established distributed cloud systems available.A lot of the newer projects borrow it's algorithms.

The main advantage is that it's been used a lot, and you can find many implementations of different alogrithims that people have allready written.

The main disadvantages are:

1. You are forced to use map reduce

2. The platform is already "old"

3. The data processing is done offline (one big job at a time).

If you want to take a look at some newer stuff:

Cloud Computing:

1. Storm

2. Akka

Cloud databases:

1. Cassandra

2. MongoDB

All of these projects have "sister" projects on AWS (even hadoop). I suggest you try them out on AWS, to save you the setup time. Then if you like them, install your own.

My Oculus Rift Game: RaiderV

My Android VR games: Time-Rider& Dozer Driver

My browser game: Vitrage - A game of stained glass

My android games : E nemies of the Crown & Killer Bees

Glass_Knife

Author

8,637

February 10, 2014 04:19 PM

If you want to take a look at some newer stuff:
Cloud Computing:
1. Storm
2. Akka

Cloud databases:
1. Cassandra
2. MongoDB

Thanks, I will check them out.

I think, therefore I am. I think? - "George Carlin"
My Website: Indie Game Programming

My Twitter: https://twitter.com/indieprogram

My Book: http://amzn.com/1305076532

DocBrown

273

February 10, 2014 10:18 PM

How big of "big data' are you speaking of?

I've always worked more on the enterprise-level of development/analytics - most companies I've worked for to handle their big data (financial transactions, healthcare data, even some BI) have always used mainframes for this sort of work as your old IBM iSeries and such still can process more transactions per second than any newer machine out there. Then again, we're talking several thousand/hundred thousands of transactions per second - which probably isn't what you're looking for.