SomeTechStuff: 2018

UPDATED FOR 5.15

Cloudera has released 5.15 ..and as usual we will update the installation info based on this.

19th July 2018 => PrathamOS Treebeard V2 is released.So updating content accordingly...
__________________________________________________________________________

Read about Capacity Planning HERE
__________________________________________________________________________

Lets now understand Node Enlistment.
Download below shown Excel Sheet HERE

Check the video below for reference.Direct Link HERE

We will setup a 4 node cluster in this simulation although the process will remain same as above.

Host Machine will be PrathamOS Adyah BaseLine with Updates 1.2+

To showcase PrathamOS TreeBeard V2, 2 Nodes will be VMware Based & Rest 2 will be VirtualBox Images.

Run below commands to extract the downloaded .7z files.

7z x Release_Jul2018_TreeBeard_V2_CentOS7_64BIT_VirtualBox.7z -o.
7z x Release_Jul2018_TreeBeard_V2_CentOS7_64BIT_VMware.7z -o.

Extracted folders copies can be saved for faster accessibility next time around...

Get Latest RepoMaker...

rm -f PrathamOSBigDataRepoMaker && wget http://bit.ly/PrathamOSBigDataRepoMaker && sudo rm -f /opt/essentials/RepoMaker.sh && sudo mv PrathamOSBigDataRepoMaker /opt/essentials/RepoMaker.sh && sudo chmod 777 /opt/essentials/RepoMaker.sh

Check the video below for Installation

Testing the Installation

sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 100

BENCHMARK

* Teragen
* Terasort
* Teravalidate
* TestDFSIO

Terasort Benchmark Suite: Hadoop benchmark.

Combines testing the HDFS and MapReduce layers .Three MapReduce programs.

1. Generating the input data via TeraGen.
2. Running the actual TeraSort on the input data.
3. Validating the sorted output data via TeraValidate.

Teragen:
row size 100 byte, to generate 1GB of data, num of rows value is 10000000
To change the block size of the generated data, you can pass the argument “–D dfs.block.size=sizeinbytes “

sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar teragen 10000000 /hadoop/teragen

Terasort:
Data from Teragen is input for Terasort.

sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar terasort /hadoop/teragen /hadoop/terasort

Teravalidate:
TeraValidate validates the sorted output of Terasort and to ensure that the keys are sorted within each file.

sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar teravalidate /hadoop/terasort /hadoop/teravalidate

TestDFSIO:
The TestDFSIO benchmark is a I/O test, i.e, read and write test for HDFS.
It’s useful in identifying the read/write speed of your hdfs i.e all the datanodes disk and understand how fast the cluster is in terms of I/O.

Write Test:
3 files with each of size of 100 MB, i.e total 300MB file is written in HDFS.
TestDFSIO will write the files to /benchmarks/TestDFSIO on hdfs and the benchmark results are stored in local file, TestDFSIO_results.log.

sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-test-2.6.0-mr1-cdh5.15.0.jar TestDFSIO -write -nrFiles 3 -size 100MB

Read Test:

sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-test-2.6.0-mr1-cdh5.15.0.jar TestDFSIO -read -nrFiles 3 -size 100MB
__________________________________________________________________________

PrathamOS Server Edition "TreeBeard V2" Based out of 64 BIT Hardened CentOS Has Been Released.

Consequently, it would be worth to combine the effort with facilitation for the Hadoop Stack as well.

We need to identify where Hadoop fits into the Project Requirement Cycle, too early & it is unproductive...
In fact it is counter-productive and expensive to say the least.So it is very important to understand that Hadoop is not some new coding language or framework which can just be applied to any project...just because we want to try "new tech on the block".

Let us proceed with capacity definition.The above image showcases some of the vendors facilitating hardware to host BigData.Of course, there are many others.

Cloud or Bare Metal, the choice is yours...Me, being the author of this blog, will exercise my right to "bias-ness" and choose LeaseWeb, primarily because they are Platinum Sponsers of Apache Foundation, the parent organization which keeps people like us in business.And besides I am not a "Cloud" Person.

Cloud CPU’s are not comparable to data center CPU’s and clusters may have different scaling properties and behave quite differently for different kinds of workload. While it is a cheap way to get started fast, it tends to be very expensive per hour.Possibly the biggest opportunity in the cloud is that EC2’s S3 storage is dirt-cheap even compared to HDFS, and modern Hadoop supports it as a storage format. For certain applications, particularly applications with a large proportion of cool storage, this can be compelling, but it is usually not advantageous for typical Hadoop workloads because S3 is less nimble for access as DAS.It is very difficult to ascertain the actual machine cost of running large clusters with conventional storage in the cloud, but one can assume that any cloud cluster, dollar for dollar, will be severely I/O bound, with relatively poor SLA’s and high hourly cost.Another hidden cost of the cloud is the lock-in. Writing data into EC2 is free, but the cost of backing out includes $50,000 per petabyte for EC2 download charges. The real cost, however, is likely to be in the sheer time it takes. Network bandwidth from S3 will probably be 10’s of MB/second, but say you can sustain 100MB/sec. At that rate, it would take 10 million seconds, i.e., 115 days per PB. Whatever the download rate one must include the cost of operating two storage systems for that period of time, etc.

Preferred Location for a global "deployment" would be their NetherLands center at Amsterdam.

Server would be Intel WorkHorse Xeon E5 2420.Read about choice as AMD here.Another interesting choice is AMD Opteron 6272.

At the end of the day, the idea is to get the best returns in terms of Data vs Storage, so that we remain true to the concept of "running commodity servers".

Click on the links below to check out the mentioned configuration.

Let us try and understand what really "BIG" data can be through the example provided.For that, we will have to understand Basics & Hardware limits first.

Production Cluster Setup Guidelines

Compute-intensive workloads will usually be balanced with fewer disks per server.

This gives more CPU for a given amount of disk access.
One can use six or eight disks and get more nodes or more capable CPU’s

Data-heavy workloads will take advantage of more disks per server.

Data-heavy means more data access per query, as opposed to more total data stored.
More disks gets more I/O bandwidth regardless of disk size

Network capacity tends to go up with high-disk density

Jobs with a lot of output use higher network bandwidth along with more disk.
Output creates three copies of each block; two are across the network.
ETL/ELT and sorting are examples of jobs that:

Move the entire dataset across the network
Have output about the size of the input

Storage-heavy workloads

Have relatively little CPU per stored GB
Favor large disks for high capacity and low power consumption per stored GB
The benefit is abundant space.
The drawback is time and network bandwidth to recover from a hardware failure

Heterogeneous storage can allow a mix of drives on each machine.

Just two archival drives on each node can double the storage capacity of a cluster.
The prolonged impact on performance of a disk or node failure has to be addressed though.

As the guidelines above suggest, Hardware configuration will depend and thereby differ based on the case study.However let us discuss generic deployment for "typical" work-flows.

Normal real-world Balanced Compute Configurations usually have dual hex-core CPU’s, 128-192 GB RAM & 12 2TB disks directly attached using the motherboard controller, with bonded 10GbE + 10GbE for the ToR switch.These are often available as twins with two motherboards and 24 drives in a single 2U cabinet.

Light Processing Configuration would have 24-64GB memory, and 8 disk drives (1TB or 2TB).

Storage Heavy Configuration has 48-96GB memory, and 16-24 disk drives (2TB – 4TB). This configuration will cause high network traffic in case of multiple node/rack failures.

Compute Intensive Configuration normally has 64-512GB memory, and 4-8 disk drives (1TB or 2TB).

Other points of mention include that the node has to be optimally utilized, i.e. complete disk and ram should be possessed for the "great churn" and because Hadoop supports replication,starting with a recommended base minimum of three nodes, no process of implementation facilitating less than the same should be undertaken.

Recommendations (Balanced Computation) (Others Adjusted Accordingly)

When to start

Disk space available (3Nodes * 24 TB) = 72 TB.
Non HDFS Space 2%. Cluster Nodes should not be used for shared services by default.Which means that for OS operations, logs etc. 2% of the total space available should be OK.In this case, it would amount to 480GB per node.One can also set a fixed quota for the same as well,say like 1TB or something like that.
Processing Space 23%. In BigData environment logic goes to the data and not vice-versa.All processing is done on the datanode itself and space is required to hold that intermediate result set.This could vary upon specifications of the case study, usage of something like Spark or Ignite would mean that more memory is required then disk.
Total Available for HDFS considering above (75% of 72 TB) = 54 TB.
Assuming that 10% Data in HDFS could be derived, inferred or aggregated from raw data and saved in HBase or other formats, actual space for Core HDFS (90% of 54 TB) = 48 TB.
Taking Replication factor as 3, Actual Data Input (48 TB / 3) = 16 TB.
Just in case, an algorithm like snappy is used, which provides up to 40% compression, Data Input could be accommodated up to an upper limit of (16 TB * 140%) = 22 TB.Processing time would scale incrementally though, and must be ascertained beforehand through sample benchmarking.
Midway between raw vs compression makes the start point as 19 TB.

More than the Data, the critical part is the Data Pattern.It has to be consistently high, grow at a healthy rate, preferably at a compounding one.

As for our example, we are currently at 3.5GB/1$ which doesn't sound very attractive but a majority of those expense are capital in nature and in long term, if you think in terms of the business growth we would achieve by managing 100GB/1Day, per node addition would start bringing down the costs to profitable proposition.Of course the complete exercise has just been a generic walk through and there are still several things that can be done to ensure optimum efficiency.For example we have taken into account annual contracts only, whereas there are further discounts if we go for say, 3 years, and then, there are bulk purchase discounts as well...

All in all, it should be quite clear by now that BigData implementation is more about the proper configuration and a compact relationship between data and storage.

SomeTechStuff

Monday, October 1, 2018

Cloudera 5.15

UPDATED FOR 5.15

Testing the Installation

BENCHMARK

Saturday, September 29, 2018

HADAMS - Hadoop Automated Deployment And Management Stack