Cerebral Workout: Pentaho Data Integration

Q&A notes from the following webinar for the benefit of PDI users.

"High Performance ETL using Cloud- and Cluster-based Deployment Web" which is based upon the following whitepaper"

Pentaho Data Integration : Scaling Out Large Data Volume Processing in the Cloud or on Premise

Session number: 713773880

Ranadeep Bhattacharya - 11:48 pm

Q: What do you mean by read in parallel? Does that mean only a part of the file is available in each slave?

Matt Casters - 11:49 pm

A: That's exactly what the algorithm does. It splits the file by size and divides data ranges over the available nodes.

_________________________________________________________________

Ranadeep Bhattacharya - 11:50 pm

Q: But is the file physically located on a single server or split between the 10 or 20?

Matt Casters - 11:50 pm

A: Located on a single shared filesystem. So the same file is read by N nodes.

_________________________________________________________________

abhishek manocha - 11:50 pm

Q: So as i understand is their a limitation of clustering only possible if we choose CSV as our input step? I doesnt work on Table Input step?

Matt Casters - 11:52 pm

A: You can do the same thing with a Table Input node, but you need to tweak the SQL statement that is executed since you only want a part of the rows. Usually it involves using a MOD (%) operator and internal variables representing the node # and # of nodes.

_________________________________________________________________

Robert Folkerts - 11:51 pm

Q: Were there experiments with dimension lookups when populating a fact table? That is my 'bread and butter' case.

Matt Casters - 11:53 pm

A: Not yet Robert. With the new cache pre-load option it would make an interesting experiment for sure.

_________________________________________________________________

Dan Jolly - 11:54 pm

Q: When is 3.2 scheduled for GA?

Matt Casters - 11:55 pm

A: Dan, 3.2.0-stable was released last week.

_________________________________________________________________

Vijayaraghavan Amirisetty - 11:56 pm

Q: Does PDI plan to support other partitioning methods like key-range partitioning or hash partitioning in the future ? - Vijay

Matt Casters - 11:58 pm

A: It's not on our roadmap right away. That being said, it's possible to do now both though Partitioning plugins as well as though a calculation. (simply calculate a partition # and do a mod part on that)

_________________________________________________________________

Peter Schmidt - 11:55 pm

Q: Can you please re-explain the diff between 50/Sort and 100/Sort and 300/Sort.

Matt Casters - 11:56 pm

A: The only difference is the size of the lineitem.tbl file size. 300=1.8B rows, 100=600M rows, etc

_________________________________________________________________

abhishek manocha - 11:57 pm

Q: So considering a scenrio where I have 80 small db inputs and I need to collate them in one central target db, with scehduling of once in a hour (24 times a day) for all sources, clustering make sense?

Matt Casters - 12:00 am

A: It can make sense if the CPU consumption on your one server is a bottleneck. If that's not the case, you don't really need to do it.

_________________________________________________________________

sanjeev sagar - 11:59 pm

Q: i joined late but which benchmark it is?

Lance Walter - 12:01 am

A: The whitepaper on bayon-technologies has more details. It uses TPC-H data, but is not a "benchmark" by design.

_________________________________________________________________

Dan Jolly - 12:01 am

Q: Is this cost model based on the EC2 costs?

Lance Walter - 12:01 am

A: yes, computing as well as storage costs on EC2

_________________________________________________________________

prem brahmandam - 11:54 pm

Q: Can we get a sample transform using "table input" step with tweaked query..

Matt Casters - 12:03 am

A: SELECT * FROM foo WHERE mod(id, ${Internal.Step.Unique.Count}) = ${Internal.Step.Unique.Number}

_________________________________________________________________

Peter Schmidt - 12:02 am

Q: If most of your ETL uses table input/table output steps, what changes need to be made to one's transformations, it sounds like if you were reading from flat files, you wouldn't have to do much to configure this to work?

Matt Casters - 12:05 am

A: Peter, it highly depends on the question if your source database can make use of multiple CPUs, etc. The best strategy is to partition/shard the source and target databases as well. (see also prem's question above)

_________________________________________________________________

Peter Schmidt - 12:08 am

Q: Quick question on the sample transform query, so I am assuming you'd have to put a wrapper around this that increments the count and number)?

Matt Casters - 12:09 am

A: It goes without saying that those internal variables are set automatically in a clustered run.

_________________________________________________________________

Bret Landon - 12:08 am

Q: Is any of this based on the hadoop methodology?

Matt Casters - 12:10 am

A: No Hadoop cluster is needed although we have plans to make use of Hadoop clusters in the near future.

_________________________________________________________________

Laura Moche - 12:10 am

Q: Were the EC2 servers from this test case dedicated to this test? Or were the servers shared with other processing?

Nicholas Goodman - 12:11 am

A: We dedicated the use for the transformations. But EC2 instances aren't dedicated - they are shared with other EC2 users...

_________________________________________________________________

Dan Jolly - 11:57 pm

Q: What is that top level number

Nicholas Goodman - 12:12 am

A: sorted 450k / rec / sec for 40 nodes

_________________________________________________________________

abhishek manocha - 12:08 am

Q: No Matt, building on the Peter's question, what if the source database are really on different machines and partitioning/sharding is not an option as in case of mysql 4?

Nicholas Goodman - 12:13 am

A: There are things that can be done to partition the connection on the PDI side. ie - if you know that host xyz keeps partition 1, and abc keeps partition 2 you can set that up and we'll use just plain 'ole JDBC

_________________________________________________________________

Venu Ambekar - 12:12 am

Q: Is there a capability to handle only incremental changes from a datasource, instead of depending upon the time-stamps of the tables in the datasource.

Nicholas Goodman - 12:14 am

A: Yes. PDI has capabilities for detecting changes from data sources and you can help you only process those changes.

_________________________________________________________________

Lakshman Bulusu - 12:14 am

Q: What about EL-T in the CLoud?

Matt Casters - 12:16 am

A: It highly depends on the situation. It still depends on the capabilities of the database(s) (parallelism etc) used. Suffice it to say that we always recommend you to make that call yourself in PDI.

_________________________________________________________________

abhishek manocha - 12:16 am

Q: Oh I see Partitioning on the PDI side itself, evn if not supported by underlying db ?

Matt Casters - 12:17 am

A: Yes, we refered to data partitioning in the PDI streams earlier, not just database table partitioning.

_________________________________________________________________

Tony Sidhu - 12:13 am

Q: is it practical to run processing in the cloud, that is inputing and outputing to a database that runs in the office?

Matt Casters - 12:21 am

A: Tony, barring any extreme case, I don't think so. We have another partner that did something similar but keeping the data on the cloud because of cost savings (30%). It has to be noted that the machine needed to be hosting reports, analyses, etc.

_________________________________________________________________

Scott Sorensen - 12:21 am

Q: Could you elaborate on 'capabilities to detect changes in a data source' - or provide a reference on how this is done.

Nicholas Goodman - 12:23 am

A: sure... couple of quesitons on this. There are a few different ways to approach this - none of which is any PDI silver bullet. There's a step to compare rows (one stream is reference, other is changed) and output diffs. You can simply parameterize your

Nicholas Goodman - 12:23 am

A: queries so that they only take "update_dt > {last_time_I_checked}"

_________________________________________________________________

Kamal Trivedi - 12:16 am

Q: what if in future the cloud location is moved overseas

Matt Casters - 12:22 am

A: If your infrastructure cloud is moved then it all depends on data volumes, internet speed, etc whether or not you would get in trouble. With the internet getting faster all the time, I doubt this will become an issue soon.

_________________________________________________________________

Steve McAtee - 12:18 am

Q: Will a recording of this presentation be available offline?

Matthew Papertsian - 12:23 am

A: Yes - the recording will be avilable within 48 hours and sent to you via email

_________________________________________________________________

abhishek manocha - 12:22 am

Q: What's the role of Carte in all this clustering? I was under the impression thats its internal to PDI, if I going to EC2, where does it fit?

Matt Casters - 12:24 am

A: Carte is simply a small webserver that listens to the outstide world. It can be given transformations, jobs etc to execute. It's controlled remotely. It's launched during startup of an EC2 host.

_________________________________________________________________

Dan Jolly - 12:20 am

Q: are there any similar case studies?

Nicholas Goodman - 12:24 am

A: Hi Dan - It'd be great if we could get customers to publish some of their own results and case studies for "big data." I'm not aware of any case studies that look at CLUSTERING/PARTITIONING explicitly.

Lance Walter - 12:24 am

A: here's the best public example - here's the announcement http://www.pentaho.com/news/releases/20090210_nutricia_uses_pentaho_on_amazon_cloud.php and here is the technical case study http://tinyurl.com/q9b9hs

Nicholas Goodman - 12:25 am

A: PS - How's the weather in Colorado today?

_________________________________________________________________

abhishek manocha - 12:16 am

Q: Can you give more info on that Nick please... on the incremental data?

Nicholas Goodman - 12:26 am

A: See below... there's nothing really "Big Data" special about change data capture.

Nicholas Goodman - 12:27 am

A: there's a variety of techniques.. check the wiki/pentaho training/dev lists for info on how to do this.

_________________________________________________________________

abhishek manocha - 11:58 pm

Q: It may be obvious question, but in last 10 days I have touched the surface of PDI only and done sample test on single workstation of mine

Matthew Papertsian - 12:27 am

A: Abhishek - can you please complete your question as I am not certain what you are asking?

_________________________________________________________________

sanjeev sagar - 11:59 pm

Q: or which tools were used for these fig.?

Nicholas Goodman - 12:28 am

A: This was PDI 3.2 (pre GA it was inbetween release candidates). The exact build # is in the whitepaper.

_________________________________________________________________

abhishek manocha - 12:28 am

Q: hi Matt, so you dont really recoment it for office use as you mentioned to tony?

Matt Casters - 12:30 am

A: Well, given the fact that you now have "local" clouds in large corporation and that virtualization keeps growing, I might be completely wrong. Then again, it was a very specific question that Tony had.

_________________________________________________________________

abhishek manocha - 12:30 am

Q: Ok, fair enough Matt

Matt Casters - 12:31 am

A: Sure thing!

_________________________________________________________________

Ulrich Riedel - 12:30 am

Q: I have seen in the webcast that sorting 1 billion lines a month costs app. 32,000$. Why is this price said to be cheap? Are there comparable prices?

Matt Casters - 12:32 am

A: It's only 4$ to sort a billion rows. I think the situation was that if you needed to do it every hour or so, you would spend that money. Cloud works best economically if it takes care of peak loads.

Matt Casters - 12:32 am

A: Sorry 6$ :-)

Cerebral Workout

Thursday, June 04, 2009

Pentaho Data Integration - Scalable ETL deployments

No comments:

Blog Archive

Cerebral Workout

Thursday, June 04, 2009

Pentaho Data Integration - Scalable ETL deployments

No comments:

Subscribe

Blog Archive