PrathamOS Server Edition "TreeBeard V2" Based out of 64 BIT Hardened CentOS Has Been
Released.
Consequently, it would be worth to combine the effort with facilitation for the Hadoop Stack as well.
We need to identify where Hadoop fits into the Project Requirement
Cycle, too early & it is unproductive...
In fact it is
counter-productive and expensive to say the least.So it is very
important to understand that Hadoop is not some new coding language or
framework which can just be applied to any project...just because we
want to try "new tech on the block".
Let us proceed with capacity definition.The above image showcases
some of the vendors facilitating hardware to host BigData.Of course,
there are many others.
Cloud or Bare Metal, the choice
is yours...Me, being the author of this blog, will exercise my right to
"bias-ness" and choose LeaseWeb, primarily because they are
Platinum Sponsers of Apache Foundation, the parent organization which keeps people like us in business.And besides I am not a "Cloud" Person.
Cloud
CPU’s are not comparable to data center CPU’s and clusters may have
different scaling properties and behave quite differently for different
kinds of workload. While it is a cheap way to get started fast, it tends
to be very expensive per hour.Possibly the biggest opportunity in the
cloud is that EC2’s S3 storage is dirt-cheap even compared to HDFS, and
modern Hadoop supports it as a storage format. For certain applications,
particularly applications with a large proportion of cool storage, this
can be compelling, but it is usually not advantageous for typical
Hadoop workloads because S3 is less nimble for access as DAS.It is very
difficult to ascertain the actual machine cost of running large clusters
with conventional storage in the cloud, but one can assume that any
cloud cluster, dollar for dollar, will be severely I/O bound, with
relatively poor SLA’s and high hourly cost.Another hidden cost of the
cloud is the lock-in. Writing data into EC2 is free, but the cost of
backing out includes $50,000 per petabyte for EC2 download charges. The
real cost, however, is likely to be in the sheer time it takes. Network
bandwidth from S3 will probably be 10’s of MB/second, but say you can
sustain 100MB/sec. At that rate, it would take 10 million seconds, i.e.,
115 days per PB. Whatever the download rate one must include the cost
of operating two storage systems for that period of time, etc.
Preferred Location for a global "deployment" would be their
NetherLands center at Amsterdam.
Server would be Intel WorkHorse
Xeon E5 2420.Read about choice as AMD
here.Another interesting choice is
AMD Opteron 6272.
At
the end of the day, the idea is to get the best returns in terms of
Data vs Storage, so that we remain true to the concept of "running
commodity servers".
Click on the links below to check out the mentioned configuration.
Let us try and understand what really "BIG" data can
be through the example provided.For that, we will have to understand
Basics & Hardware limits first.
Production Cluster Setup Guidelines
- Compute-intensive workloads will usually be balanced with fewer disks per server.
- This gives more CPU for a given amount of disk access.
- One can use six or eight disks and get more nodes or more capable CPU’s
- Data-heavy workloads will take advantage of more disks per server.
- Data-heavy means more data access per query, as opposed to more total data stored.
- More disks gets more I/O bandwidth regardless of disk size
- Network capacity tends to go up with high-disk density
- Jobs with a lot of output use higher network bandwidth along with more disk.
- Output creates three copies of each block; two are across the network.
- ETL/ELT and sorting are examples of jobs that:
- Move the entire dataset across the network
- Have output about the size of the input
- Storage-heavy workloads
- Have relatively little CPU per stored GB
- Favor large disks for high capacity and low power consumption per stored GB
- The benefit is abundant space.
- The drawback is time and network bandwidth to recover from a hardware failure
- Heterogeneous storage can allow a mix of drives on each machine.
- Just two archival drives on each node can double the storage capacity of a cluster.
- The prolonged impact on performance of a disk or node failure has to be addressed though.
As the guidelines above suggest, Hardware
configuration will depend and thereby differ based on the case
study.However let us discuss generic deployment for "typical"
work-flows.
- Normal real-world Balanced Compute Configurations usually have dual
hex-core CPU’s, 128-192 GB RAM & 12 2TB disks directly attached
using the motherboard controller, with bonded 10GbE + 10GbE for the ToR
switch.These are often available as twins with two motherboards and 24
drives in a single 2U cabinet.
- Light Processing Configuration would have 24-64GB memory, and 8 disk drives (1TB or 2TB).
- Storage Heavy Configuration has 48-96GB memory, and 16-24 disk
drives (2TB – 4TB). This configuration will cause high network traffic
in case of multiple node/rack failures.
- Compute Intensive Configuration normally has 64-512GB memory, and 4-8 disk drives (1TB or 2TB).
- Other points of mention include that the node has to be optimally
utilized, i.e. complete disk and ram should be possessed for the "great
churn" and because Hadoop supports replication,starting with a
recommended base minimum of three nodes, no process of implementation
facilitating less than the same should be undertaken.
Recommendations (Balanced Computation) (Others Adjusted Accordingly)
- When to start
- Disk space available (3Nodes * 24 TB) = 72 TB.
- Non HDFS Space 2%. Cluster Nodes should not be used for shared
services by default.Which means that for OS operations, logs etc. 2% of
the total space available should be OK.In this case, it would amount to
480GB per node.One can also set a fixed quota for the same as well,say
like 1TB or something like that.
- Processing Space 23%. In BigData environment logic goes to the data
and not vice-versa.All processing is done on the datanode itself and
space is required to hold that intermediate result set.This could vary
upon specifications of the case study, usage of something like Spark or
Ignite would mean that more memory is required then disk.
- Total Available for HDFS considering above (75% of 72 TB) = 54 TB.
- Assuming that 10% Data in HDFS could be derived, inferred or
aggregated from raw data and saved in HBase or other formats, actual
space for Core HDFS (90% of 54 TB) = 48 TB.
- Taking Replication factor as 3, Actual Data Input (48 TB / 3) = 16 TB.
- Just in case, an algorithm like snappy is used, which provides up to
40% compression, Data Input could be accommodated up to an upper limit
of (16 TB * 140%) = 22 TB.Processing time would scale incrementally
though, and must be ascertained beforehand through sample benchmarking.
- Midway between raw vs compression makes the start point as 19 TB.
More than the Data, the critical part is the Data Pattern.It has
to be consistently high, grow at a healthy rate, preferably at a
compounding one.
As for our example, we are currently
at 3.5GB/1$ which doesn't sound very attractive but a majority of those
expense are capital in nature and in long term, if you think in terms of
the business growth we would achieve by managing 100GB/1Day, per node
addition would start bringing down the costs to profitable
proposition.Of course the complete exercise has just been a generic walk
through and there are still several things that can be done to ensure
optimum efficiency.For example we have taken into account annual
contracts only, whereas there are further discounts if we go for say, 3
years, and then, there are bulk purchase discounts as well...
All
in all, it should be quite clear by now that BigData implementation is
more about the proper configuration and a compact relationship between
data and storage.