With the phenomenal rise of big data in the last decade or so, a number of tools have come to be synonymous with the Big Data space. Of these, few are as well-known as Hadoop.
However, with that fame, a number of misconceptions and myths about the software have arisen. So much so that’s it’s common to hear newbies and professionals alike make the same mistakes. To improve your intimacy with Hadoop, here are six of the most common misconceptions about it.
1. Hadoop is a database
It’s easy to understand why someone would think that Hadoop counts as a database, but that couldn’t be further from the truth. Hadoop is used in the industry today to store and analyze large sets of data, preferably over distributed servers. However, it lacks some pretty crucial features that a full database would otherwise have.
Hadoop itself has underlying core mechanisms that deal with data storage, the Hadoop Distributed Filesystem, but this doesn’t store data like, say, SQL, does. Most importantly of all is the fact that there are no queries involved when pulling data from Hadoop. This makes it more of a warehousing system rather than a database.
Additionally, an SQL database would need the schema of the data to be defined before you’re able to save your data to the database. This is referred to as ‘schema-on-write.’ Hadoop, on the other hand, allows you to store large volumes of unstructured data, referred to as ‘schema-on-read.’
2. Hadoop is cheap
First things first, not every problem you encounter is a Big Data problem. This is extremely relevant because while Hadoop can be in the magnitude of ten times cheaper than traditional solutions, it is still an expensive piece of software to run.
Granted that it’s open source and anyone that knows what they are doing can set it up, you’ll also end up paying a pretty penny to keep the server running.
Big Data by itself is impossible to define specifically, as it depends on the amount of data, the velocity of the data and its variety. A good estimate, however, would be a business that deals with a terabyte or more of data per day. The cost of dealing with this much data will come down to an estimated $1,000, according to an estimate.
With newer features like in-memory computing and network storage, this number might shoot all the way to $5,000. Considering traditional solutions cost about $30,000 or more to maintain, it’s not surprising why Hadoop has become so popular.
3. Hive is pretty much the same as SQL
Hive is a software built on top of Hadoop that’s used to write read and manipulate large sets of data. HiveQL is the underlying language that’s used to transform queries into MapReduce programs.
It’s perfectly understandable why one would confuse SQL and HiveQL – they are both declarative, support structured data and also support schemas. People that know SQL can catch up on HQL pretty fast, but there are a few compatibility issues.
Each tool is meant to read and retrieve the data, but the difference, once again, lies in the how. As mentioned before, Hive relies on ‘schema-on-read,’ which essentially allows the user to redefine the rows and columns to match the data on-the-fly.
4. Hadoop fits every use case
Whether you are dealing with actual Big Data or not aside, the fact that SQL and HQL are so different, and that the HDFS differs significantly from a traditional database has tremendous implications.
For example, let’s take the case of Hadoop and Spark. Hadoop reads and writes files to HDFS and other new project like Spark has RAM based data processing and storage.
Similarly, Hive comes with serialization and deserialization adapters to allow it to read and change database schemas on the fly. However, this is a pretty costly operation and isn’t suited for tasks that have an enormous read/write workload. This helps Hive serve its purpose as it was intended as an interface for querying HDFS data rather than replicating DBMSs.
While schema-on-write may seem limited in its usability, it makes SQL solutions better suited for operations that involve many reads and writes.
5. You’ll need a programmer to set up Hadoop for you
Considering the amount of jargon thrown around when discussing Hadoop – ‘filesystem,’ ‘query’ and ‘warehousing’ – it’s easy for the layman to get discouraged regarding Hadoop. Whether or not you will need a programmer to help you deal with Hadoop depends on what kind of use case you have for it.
For instance, if you want to run Hadoop 2+ on 24 nodes and create a fancy Big Data analytics firm, you’re definitely going to need a lot of help. If you want to run a relatively simple analytics program, there are tons of GUIs what will make MapReduce programming a whole lot easier.
You can either turn to companies like Cloudera for a complete solution if you don’t have the necessary skills or put together a highly customized solution if you do. That’s the beauty of Hadoop.
6. Hadoop is an overkill for small businesses
Considering the number of times ‘Big Data’ has been thrown around in this article and elsewhere all over the internet, most SMBs may resolve to simpler solutions instead. While Big Data solutions were created with Big Data problems in mind, a lot of smaller businesses could benefit greatly from them, depending on the kind of problem at hand.
For instance, Hadoop has some pretty convenient features like Excel reporting that will allow even users with a non-computer science background to harness its full power. The main Hadoop distributors in the market like IBM and Oracle charge top dollar to run a single cluster, but plenty of solutions exist for a smaller business, too.
And finally, to add more confusion to the definition of what Big Data actually is, you could be dealing with gigabytes of data a day and still qualify as having a Big Data problem. If you can fit maintaining a Hadoop cluster or two in your company’s budget, the performance implications are enormous.