Ep#98 Focus on the Data and Not the Process

October 20, 2022

Episode Summary

#ptpcloud #hpc #awspartners
High-Performance Computing (Clusters) or HPC within the Bioinformatic field isn't new. Processing large amounts of data take time and go through numerous processes before it can be analyzed. The key here is "It takes time" from hours to days or even weeks waiting for the output. What if you could simplify the processing pipeline from days to hours, and what if you didn't have to worry about the infrastructure involved and just the output?

Sponsored by PTP Cloud

Who are we? Cloud experts. Industry Veterans. Champions of game-changing innovation.

We work with innovative companies to help them access, optimize, govern and secure their mission-critical cloud environments, so they can go about their business of bringing breakthrough products to market more rapidly. We actively monitor your environment, provide you with critical alerting, and provide actionable performance reports. And we do this 24x7x365.

Micah-Headshot2

About the Guest

Micah Frederick

Been working in technology for about 24 years now.  Started out in networking and moved to automation and DevOps.  When not building random things for customers I am an avid runner and have completed 10 marathons.  My wife and I live on a small farm outside of Cincinnati and I am currently figuring out how to manage two horses.  We like to travel a lot and her favorite place is Maine and mine is Disney World…a little bit different.

#jonmyerpodcast #jonmyer #myermedia #podcast #podcasting

Episode Show Notes & Transcript

Host: Jon

Welcome to the Jon Myer podcast. Today we're talking with Micah Frederick, cloud ops lead architect at PTP. Our topic today is high-performance clustering. Please join me in welcoming Micah to the show. Micah, thank you so much for joining me.

Guest: Micah

Thank you. It's been a lot of fun so far in this kind of industry, so,

Host: Jon

Well, you know what, Micah, let's jump into that because you said it's fun, uh, so far with this industry. And I wanna dive into your background because I feel you have a well diverse background between DevOps, automation, compliance, security, and now high-performance computing. How the heck did you get all this diverse background and now why HPC?

Guest: Micah

I started doing networking mostly, you know, network operation center, when a network engineering, routing, switching, all that fun stuff. But I started to find that there was a big need and the companies I was working for, for automation networking. So I started building up all of my automation skills and started getting into network automation. I kind of blended over to us getting into compliance, starting to automate compliance and security stuff. So from there, that was kind of, I was at a point pretty versed in the whole DevOps, you know, architectures doing web development, doing back end programming, all these kinds of stuff. Started filling a niche of just, we have a problem, we don't know how to fix it. And I just come up with some weird solution, um, which is not too dissimilar from some of the HPC stuff. But that transition, when I joined PTP, we started getting into cloud stuff.

Guest: Micah

So, I adopted learning all the cloud, put all my devs information, and knowledge into that cloud, just kind of a natural fit, that kind of dovetailed pretty nicely. You know, doing everything as code infrastructures, code starting to like create that kind of automation. Worked well, in the cloud AWS world. So from there, you know, we started getting some HPC customers and we were like, Oh, what did we do with them? I'm like, I don't know, figure it out. This is where we got. And so I started delving into the world of HPC learning all the pipelines, the technology, the lingo. I've now learned way more about data scientists in research than I would like to know. And I still don't know almost anything about it.

Host: Jon

<laugh>, Oh my, I can see the correlation between your networking, right? All right. Now you did automation and then DevOps has automation built into it, your security, you're building automation into it, and now HPC, you're building automation into the pipelines. Before we dive into it a little bit more, can you explain to everybody really what is HPC and how it relates to PTP?

Guest: Micah

So HPC is just high-performance clustering. So it's the ability to spin up large compute nodes. And usually, they're not singular nodes. They tend to be clusters of hundreds, potentially thousands of CPUs or computers that are all running, you know, kind of in orchestration and coordination with each other to, you know, produce some kind of data or analyze data to do whatever job you've scripted it to do. Um, so all PTP kind of works into this is helping companies. They know they want to leverage that kind of scale and that size, but they don't know how to get into it. Um, so we kind of help them bridge that gap from, I've got these couple things that I run on my computer, how do we turn that into a large-scale production workload?

Host: Jon

Okay. You just said high-performance clustering. What's the difference between high-performance clustering and computing?

Guest: Micah

Well, mostly it just deals with several simultaneous computers. So you can either do one job on one very, very big computer, but you will run into a just limitation of the number of CPUs or whatever can be put into a singular physical server. There are just limits when you get into clustering. You're saying, I want to take potentially hundreds of servers, coordinate them all together and I can run a job that I can split over all of these different computes compute nodes and then collect that information back together, to get one finished product. So it's, it's mostly just a scalability thing. Um, and the ability to, you know, really precisely use the amount of computing power that you need.

Host: Jon

So Micah, how does a customer define and know that they need PTP to help? Do they one day wake up and say, You know what? My current HPC my processing my batch, for my bioinformatic CU customer, , I need to improve performance, right? It's taken me six days to process this. I wanna get it down. How do they even know they can get it down? Or how do they engage with PTP?

Guest: Micah

So it usually does end up being something like that. They've worked on a pipeline or they've worked on a series of steps. Like I run these six tools on this fit sampling data, and I get an output that I can then analyze. They've kind of built that, they've done it on their little laptop or they've done it on the server, on there, like on-premise. And they're going, Well this takes, you know, hours, it takes days. I have to switch each step, you know, and they're kind of going, I, I need to be able to do this better, faster, more efficient. So you've got, that comes from that side. A lot of 'em, these are all, you know, PhDs. So these are all data analysts, like really smart scientists, people coming from research, you know, universities and stuff. Those research universities have high-performance clustering that is available to them with all these tools and sold.

Guest: Micah

And they've been used to in their kind of research academia world to just being able to run things when now they go into a private sector and a company, a company doesn't have that. Like they're not met, they don't have, you know, they haven't set aside tens of millions of dollars to build, you know, big server infrastructures and make that presentable to everyone. So now they're trying to replicate how to do things that they could do in academia now into the more the private sector. So they reach out to us to kind of help that because they know kind of their end goal, they know where they want to go, but you know, they're not infrastructure people and they don't necessarily know all the cloud stuff. So they just know how it was done for them and they, you know, want to be able to utilize some of that functions.

Host: Jon

So Mike, now I'm seeing the correlation where they were once doing it or trying to achieve it or even it's a net new customer saying, I want to do X, Y, and z, I wanna process a data, I don't worry gonna worry about the infrastructure. And by the way, the cost is a huge concern for me. So, Megan, they come to PTP and say, Hey listen, I used to be able to do this in one day, now I am in a private one and it's taken me seven days. Walk me through these steps and do you have a specific customer example you can share with us on how you engage with them? And then what was the ultimate angle?

Guest: Micah

So yeah, I mean a lot of them, that's just kind of how it comes in. So, they would start to do that, and they'll reach out because they may have an idea of what they want to build. So sometimes we get very specific, like I have, I would like to build this kind of slurm cluster. They have a very specific, you know, technology base they want to build. Come with much more of a, we know we need to build this. We don't have a preset idea about how that needs to get done. And we can work with them to build something potentially very different from what they used in the academic world. What's the academic world? They have big physical clusters, they're on-site, and they're big clusters they can use. We're now using cloud infrastructure. So we're leveraging potentially clusters that are, could be hundreds of thousands of times bigger than the ones they use, but they're different. Their access is different, and the technologies are different. So like AWS gives us much more ability to use ephemeral resources so that we don't, aren't paying for large clusters to stay up when they're not being used. We can just run them, you know, as we need them.

Guest: Micah

So for each of those clients, we tend to get their requirements, we get what they're looking for and then we start helping them break that down kind of into the pieces that would be required. What's the software that needs to run, how do we run that software, and what's the best fit for the resources? And a lot of times we take them sometimes from their original thought into a completely different, you know, kind of vein to say we can do what you want, we can just do it cheaper and we can do it faster over here. We can kind of futureproof you a little bit, um, from your original idea.

Host: Jon

So, Mike, I wanna talk about costing a couple of minutes, but I want to go to VAR and back through the process. This was net-new, they had an idea of what they wanted to achieve, and they had an idea of what the result potentially could be.

Guest: Micah

Yeah, so what they're mostly doing is they'll know that they have X amount of sampling data. They have their lab, they're producing, you know, this raw data from samples and they need to, they need to evaluate it, they need to analyze it, they need to splice it, slice it, dice it, whatever they wanna do. It's kinda like a kitchen utensil I guess. Um, and I don't know all of those tools they do, I've started to get more familiar, but they'll start just whipping out terms and I'm like, yeah, I don't, I don't know what you're trying to do to that DNA sequence. That doesn't make any sense to me. But we will help them go, you have raw data, these are tool sets you want to be able to build like tools that you want to be able to use to evaluate the data and you're trying to get into some kind of instate that they can then do more data analytics at visualizations, all that kinds of stuff.

Guest: Micah

So we'll take that and we'll start kind of breaking that process apart. So like we'll take all those tools, we'll start building those tool sets, like we'll get them installed into containers, um, whether it's data, data, docker, condo, um, you know, any kind of containerized methodology. So we can kind of self-contain those tools, kind of control their, their prerequisites control their dependencies, um, and we start creating them a library of these tools that they can use. We start building a landing zone for all that incoming data. So a big bucket where they can stick raw data and then big buckets where, you know, finished data can go. And then the bigger thing we've helped a lot of 'em with when with is then adopting an actual pipeline technology. So like the big one in the industry at the moment is the next flow. So that's the one we're working with.

Guest: Micah

And that just helps automate those steps, processes of doing, um, all of those tasks. So instead of them going, Oh, I have data, I'm gonna run this one job, and then it gets output and then they, oh, I wanna run these three jobs on that output and then give what that is, and then they're gonna run two other jobs instead of them kind of doing all that, we start building these like pipelines so that we can just say, we just put input into one side of this pipeline. It's gonna run this task and then it's gonna run these three tasks, then it's gonna run these two tasks and it's gonna combine everything and run these other five tasks and then we'll combine all the data, we'll reevaluate it, I'll stick it in an output spot somewhere.

Host: Jon

Nice. So it's an automated workflow utilizing step functions because there are also AWS step functions. But next flow, what you're utilizing is to automate the process because the person that's sitting there through the company is not necessarily the one that you want for managing the infrastructure. They should be worried about the output and analyzing the samples. Your job is to get this data process as quickly as possible into a result for them to utilize and evaluate.

Guest: Micah

Yeah, they went through a lot of schooling too, you know, about analyzing DNA and RNA sequencing data. They don't need to be focused on why am I getting this compiling error when I install this software. Why is this library dependency not working? Why is, you know, oh I got the data, let me, let me go type in to click the next job. We want them focused on what they're good at. We're good at doing this part, the infrastructure, building that automation, building that tool set, getting that to flow nice and seamlessly so that way they can focus on what they do well, what drives their business, makes them money, you know, and, and leads to whatever next discovery they're looking for.

Host: Jon

What I imagine that you do is you take the output, whether it's in any type of format, I'm just gonna use CSV because it's the most common one. And you take this output and you, you parse it to exactly what it needs to for the current input that you're gonna send it to. And that's going to evaluate it, thus provide an additional output and it's gonna rinse and repeat throughout the cycle. And to it finalize is the compiling of the data so that the data scientist can analyze it. Is that pretty much what sums it up on how you're doing the process or is there more involved?

Guest: Micah

Yeah, I mean that's a lot of it. What they usually start with is more like DNA or RNA-sequenced pairs. So there are these giant files of like sample data, which looked like nothing to me, but yes, it is slicing that, stripping it down. They're looking for very specific things. They're trying to like isolate parts of that strand or whatever they're looking for, getting that into output at the end that they can visualize, that can be like graphed that can be looked at so that they can compare results across hundreds of samples. Looking for like similarities, things that match up, things that don't, it varies widely. And this is, um, just depending on what company it is and what kind of research they're doing and what they're trying to look at,

Host: Jon

Mike is PTP able to analyze and realize this, the individual steps and functions that can be performed in parallel or sequence to prof improve the processing cost and efficiency.

Guest: Micah

So yeah, we a lot of times do help with that. Um, so a lot of times we've had customers that have come with their pipelines. Like I said, I've, I've got, I've got this all figured out. I get one sample out here and this is my end sample I've written this next flow pipeline and it's 42 steps long and it does all this data and it collects all of these outputs that we can analyze. And where I go, that's great, looks good. And then I take it and we will start helping them put it into the cloud, get into AWS, but part of that we'll look at that pipeline and go, you know, steps like 16 through 25 can all happen at the same time. They don't need to be in series, so let's move that around and then, oh well these steps can paralyze for this way and this way. And then we start shrinking that pipeline of down that they go, Hey, we, we got an automated pipeline, it's automated. They don't have to do anything but it's taking 24 hours or two days to finish it all. I can go, we can, we can do that in like four hours. Like if we rework the system, we can do that in four hours. You'll use the same amount of computing. I just would rather do it in four hours or you rather do it in three days.

Host: Jon

So the improved efficiency is going from like 50 to 75% for you are sitting down and analyzing their current or existing pipeline that they set up. And it's meant about educating your customers about the efficiency of doing the automation. Great. You have a pipeline in place. You're not trying to tell them that, uh, it's wrong. You're just saying we can improve this efficiency so you can get the data and you can do your job instead of waiting for this to complete. Because you know, I imagine in the three days of IT process and they're like,

Guest: Micah

Yeah,

Host: Jon

I'm gonna take a vacation now it's like four hours. Oh man, I can process more stuff.

Guest: Micah

It's just a long lunch at that point <laugh>.

Guest: Micah

So yeah, I mean it, it works, in many different ways. Sometimes they have an um, they have a pretty, um, pretty, pretty good pipeline. They have a pretty efficient pipeline, but they may be using an older clustering technology and we can introduce them to a newer clustering technology to make use of better use of cloud resources. So we make it more efficient on that side. Um, so it just depends on the customer. So they're all a little bit different. So we usually try to sit down in that same way of, you know, what are you trying to do, You know, what kind of data are you starting with, where are you trying to get it? You know, how can we build these things and then help them get them into like the most efficient model for, you know, whatever their particular business is?

Host: Jon

Mike, I'd like to talk about cost a little bit because the cost is a huge role in doing high-performance clustering, right? I have, I can run this huge cluster and it could cost me a lot of money, or I can run this small cluster for a longer time and it cost me a lot more money. How are you aware of the resources that need to be allocated for this processing or the processing that needs to happen? How are you deciding what's more efficient?

Guest: Micah

So usually we try to do it, you know, a lot of times we'll do some unit testing pipelines. So we'll build some different pipelines, um, utilizing different resource types like even different CPU or GPU architectures behind the scenes, different amounts of memory and CPU. Try to get an idea of how much each step potentially can cost. Um, sometimes that involves running those jobs on an isolated server, just running it. Being able to track the metrics on that server, look at kind of your memory and CPU usage, getting an idea on how, how much those things take, you know, and then usually trying to get them into a technology that leaves those servers running for the least amount of time. So in general, almost always it is more efficient to spin up some bigger servers for shorter periods that you can cram multiple jobs onto and get them all done and just be finished.

Guest: Micah

And, and then other, I mean there are times it might be smaller servers, but you have a push and pull between the cost efficiency of the compute running versus the cost efficiency of people waiting to do things. So you usually try to push that things get done quickly because if you run if you set up the right kind of cluster using the right kind of technology, those clusters spin up, the pipeline spins up, it runs all this compute and everything shuts down, there's nothing left running. So when it's not running actively a job, it costs you zero, you know, compute and you're only ever paying when you're processing. So in that kind of model, you might as well process as fast as you really can get the data sitting there, and then it can just, you know, it may not run for days, it may not run for weeks, just depends how they get data in. Sometimes it comes in big batches, you've got hundreds and hundreds of samples that you're running through, you know, may go for a week or several days where it's just computing all kinds of data and then nothing for a month.

Host: Jon

Okay. You said something interesting where it might be more efficient to utilize a huge high-performance cluster, big CPUs, and massive amounts for a short amount of time rather than the person who is waiting for this to process sitting around and wasting their time. And it's a unique perspective because if you can process this in maybe three to four hours or several jobs for several people now they're not sitting and waiting and you gave 'em their data, you shut down this cluster. That's a good way to look at it.

Guest: Micah

Yeah, I mean you can spend, I mean arbitrarily, let's say you spent a thousand dollars to, to run a cluster for a, you know, for three days, whatever, it's, and you're gonna run this job and it's gonna take three days and take a thousand dollars. Well, there's all the technology out there in the cloud especially. I can take that same thousand dollars, I can spend it all in the span of like six hours and just be done. So there's usually just no reason to let them run on small nodes for very long periods when you could run on bigger nodes faster, you know, just a bigger cluster with more parallelism and get things done. And then the data is now available. So now the scientists and the people of the company, they're not waiting for data to do their job. There's always like new stuff ready for them to process and they may easily start processing and we've had this happen, they'll process some stuff and they go, Hmm, we are missing a piece of information that we need.

Guest: Micah

Can we add this task to the pipeline and we can go, Okay, yes, and then we can create a second pipeline and reevaluate all that process data for this additional piece? So they're not waiting there, It's not like they've, oh well you know, we, we waited weeks to finally find out that we forgot something or this sample's just bad. Like sometimes data is just bad, it just doesn't work for what they need. So you might as well get it all processed so they can quickly go, yet that's all, this is all bad. Like those can just be ruled out. Let's get more data in, let's get more samples, and let's keep going with this processing. There are some places where using smaller nodes can be advantageous. You can kind of make use of some of AWS's spot pricing, um, to kind of really pick and choose at instances you want, um, to kind of really maximize cost efficiency.

Guest: Micah

Um, and in those cases, you're running very specific types of jobs that do a lot of highly checkpointed jobs, um, that can withstand interruption and restart constantly, you know, but in those worlds you can potentially play a game that if you're like, Well I don't need the data for a week or two that you're like, I can just let this run and I'll just pick and choose when like spot instances are cheap and run and, and once the price goes up to a certain point I can be like, eh, I'm out. I'll just wait and then I'll keep going and then I'll wait and I'll keep going. So those are kind of more niche cases. So you have to evaluate kind of, um, what their needs are and, and what kind of jobs they're running.

Host: Jon

I like how PTP is taking a, you know, customer approach, right? A detailed approach to these jobs. Not only the cost of running things in parallel or how they can minimize, you know, some of the tasks that are involved and taking the infrastructure management away from the customer so they can focus on the actual job of analyzing the data. They're considering all customer aspects. Who's building some of this automation around it? Uh, are you jumping in and building like, okay, I need to shut down this instance when not in use and we're gonna save some cost, or do I need to build a script to parse this amount of data? How, how is that defined?

Guest: Micah

So normally we use a lot of like tools that are kind of built-in, um, either to AWS or with Next Flow. Uh, we use to like to use a lot of AWS batch in conjunction with Next Flow. So jobs are run as needed we're not even really desperately worried about the infrastructure where we've got a job running next flows kick this job off once that job finishes, those compute notes just go away. We just don't even ever worry about keeping anything running. Things only exist for the length of time that we need for that job to process. Um, so that we do that. I mean that's kind of an easy way for the pipeline to work. Um, we've got some other, um, issues with some analyst boxes and stuff and they want to analyze data posts, um, that we kind of can play around with some CloudWatch, you know, logs and like alarms to look for inactivity on boxes, automatically terminate those, you know, kind of stop that data. Like those are a little more, um, in this case, that's kind of more after the fact of the HPC, but most of the stuff we do with the HPC is all very ephemeral.

Host: Jon

So, Micah, you mentioned AWS, you mentioned internally. Are you seeing a lot of customers doing private public or a hybrid type of method?

Guest: Micah

There are some. Um, so I would say a number of 'em do a form of hybrid. Um, just maybe not exactly what you're thinking. Um, hybrid. So we do have some customers that want to have some on-prem, uh, HPC. They have some on-prem servers. Some of it's with um, having big like GPU clusters, the availability of GPS and clouds avail availability of GPS anywhere has been a little questionable recently. I think, you know, maybe the current Bitcoin plummeting might help that. But, so they'll build these big GP clusters that they can use On-Prem, but then they want to be able to like push some jobs into the cloud, run some locally. Um, so we've got some of that. Um, and Next Flow has some ability even inside of its pipeline to control where the process gets sent. So it can send it locally, it could send it across the network and you know, at that run of the cloud, the bigger thing we get into, it's kind of more hybrids, are companies that have like labs.

Guest: Micah

So they have wet labs, they're doing sampling data, they have big samplers. Those data all get to run and get stuck onto network storage on their local on-prem. That needs to get moved into the cloud for processing. Um, so we help build some of that for them to kind of give them an automated, like storage gateway that AWS puts out there so that they can run data from their sampler. They get stuck onto the storage gateway that automatically shows up in S3 in the cloud. And we can build automation so that as soon as that data arrives in the cloud, we automatically kick it through these pipelines. So they don't even have to initiate them. It is just a, Oh, we ran some, we ran some labs, we ran some uh, experiments in the lab. We got the sampling data. We, you know, they finished up at five in the afternoon. They come back the next morning, Oh that's a sample that I already went up into the cloud. It's already been processed, it's come back. Um, that's just all done nice and tidy.

Host: Jon

I enjoy the automation part because you just indicated it's five o'clock and a lot of people are knocking off for the day, right? And they're like, Oh, I'll start processing this tomorrow. Now they've just lost like 12 plus hours of moving and managing the data where if the data is sitting on storage, like a storage gateway, then it's up to AWS, and once it detects that there's uh, you know, new content that's been uh, uploaded, it's gonna start processing that utilizing the batch. And when they walk in the next morning, now they can start analyzing the samples and doing more samples off of that, and reprocessing the data.

Guest: Micah

Yep. Um, it, the cloud never sleeps basically. So we automate this, we put in all these pipelines, these things can run at any time of the day, any hour. So sometimes it's an issue if they're all doing stuff and it gets to the end of the day they finished their experiments. When we can run overnight, that can all be done. There's another element where some of these companies are global. They have scientists all across the globe. You'll have some guy in Germany working, he can finish up, put data in there, you know, it can be processing. So by the time, you know, people in the US get to it several hours later, it's already gone through certain steps. By the time it gets to like people in China, it's had time to process. So you're not stuck on waiting for some people's schedules, you know, based on your, your time zones either.

Host: Jon

I imagine you're considering that when building the pipeline of the different time zones thinking okay, over in the UK this needs to process in three hours to be available for those in the US. Now you tried to improve your processing and pipeline to be more efficient and available for those who come into work. So they're not sitting around waiting for this data.

Guest: Micah

Yeah. So we can optimize that, and try to get that shrunk down. You can start figuring out like, oh well the person in, you know, in England or the United States, whatever, they only really need this subset of this information so they can continue. So we can front load that into a pipeline so when it starts that gets processed first, it gets automatically as these pipelines run outputs from different steps get made available to the end users as they go. So you can make sure that that data gets available while the rest of the pipeline kicks off. You can have branching pipelines or simultaneously run that. We have a giant pipeline that runs data end to end. Once it gets to step three, it outputs this data in this directory, at which point you can have automation pick it up and say, well now I take that output and I run a whole different pipeline that may be completely independent of this, that does some different kind of analysis that transforms that data in different ways for a different part of the company or different end user.

Guest: Micah

So you can kind of line those up so that you can kind of better best utilize that time away from the office that people, so when they're not there or you know that it just keeps running. So when they are in the office, they're not waiting, you know, for data. And a lot of this gets into that point from earlier of, you know, well why would you necessarily wanna have a pipeline that runs for three or seven days when you can make it run for hours? Like just run hundreds of things right off the bat, just keep 'em constantly running, and then you just don't even have to worry about this as much. Because it's like everyone has such a backlog of stuff to start it working on and there's not a hyper concern of, oh I gotta wait for this to finish Cause you're like, well we already finished it for 500 samples. You've got some stuff to crunch through for a while.

Host: Jon

Oh, nice. I like how PCP is analyzing all aspects and handling all variations of the pipeline from beginning to end and automation. I have about two more questions for you before we wrap things up. Okay. You're handling everything for this pipeline, but what if I wanna learn as a customer to handle this pipeline and be able to do some of the tasks myself? Or do you help and educate the customer saying, Hey listen, if you want this data, you wanna move out, you wanna manage this type of a pipeline, here's how you do it? Or do you guys just wanna, you know, to own the whole process?

Guest: Micah

No, on most of their customers, um, a lot of them get involved in writing parts of the pipeline. Um, so we usually get those pipelines up into either using native tools like AWS, is CodeCommit where we can put it there, or are they, some of 'em have it like bid pocket or GitHub, the more obvious one for a lot of people it's kind of a common code repository so we can all kind of work on it. Um, and it lets us kind of work with them to do things. So we'll get questions, Hey, I would like to add this. Can you show me like how would I need to do this? And sometimes it's, well I'll kind of show you the example and I'll put it in there like this is kind of how you would need to phrase it. Here are your inputs, and here are your outputs.

Guest: Micah

And then they can go in and go, Oh well these are the actual commands, these are all the flags, you know. And then sometimes I'm like, I don't know what the outputs are. So you're gonna have to, you put that in there, I'll kind of help you with the framework of how it goes. And then you decide what the outputs were from that, from that task, and what inputs you need. And then we can do some testing. So some as I can just test locally, find out it didn't work, you know, works didn't work, syntax errors, whatever. And then we'll work with them like, um, so if they want to get into it, we're, we're perfectly fine with letting them, um, kind of get in there and start managing their pipelines. It makes it, you know, easier at the end of the day sometimes.

Guest: Micah

Because they don't always have to wait for us if they need to make a change. Like, oh we need to change the flag on this one command because we want, we need, like that's not, it's not the right like it's, it's not the right option or we need to add an option. Think just going there, changing it, pushing it out. All right. The pipeline keeps running. Um, and sometimes, as I said, they'll need to run subsets of pipelines so it's easy to like, oh we'll just take a section of this, create a new pipeline, rerun parts of old data, or rerun data. Um, so yeah, we have plenty of 'em that'll help build their pipelines and they are interested in how that works. Um, because it, you know, kind of helps them as they go forward because it gets some ideas like, oh well I could, could we do this also? Yeah, start sticking that in there. We'll put that in and you know, they'll, they'll ask us for help or they might have a high plan finished and they're like, we need to automate this now. Like, can you help us auto-hook this into the data so that it automatically runs and processes?

Host: Jon

Nice. So Mike, uh, what about the security of all this data? How are you aware of or handling the data whether it's in transit or at rest? Because there's always that concern that even though these are samples, you know, people with PHP need to understand and use and analyze them or can read them, but how are you handling the security of this data?

Guest: Micah

So we monitor all that and it's all kind of built according to best practices as we set these up. So we like to stick data and s three buckets. That's where we start and stop. Those things are all encrypted. We keep them all at rest, you know, encrypted at rest. Um, when the next flow pipelines and these docker jobs, you know, run tasks, all that data is moved across S three across like secure tunnels, HTTPS or across s threes kind of internal gateways, um, has moved into those buckets. For most of these pipelines, they do not need to reach the internet. So they're all run inside of private subnets. So none of these boxes have any access to the external world. So they can't get out, nothing can get in, there are no public is that even exist on them. They run, and we pretty much tend to isolate all of the HPC cluster environments anyway.

Guest: Micah

So the only thing running inside of those VPCs is this HPC stuff. So there aren't other computers or boxes or servers in that company that even exist in that same space. So data cannot link sideways either. And the biggest thing for it is they're all ephemeral. We just don't leave servers up. So the server launches, it runs, it shuts down, everything gets wiped off of it, and it doesn't exist anymore. And quite frankly, you know, you don't get hacked. Servers like that don't get hacked the, in a matter of hours. That's just not kind of how that works. And there are no user names or anything on this. So social engineering doesn't get you into this environment either. So we keep it pretty well locked down. Um, and, and then we'll work with the end customer to help them understand how to protect data coming into the environment and protecting data, leaving the environment. Because there's a point where it passes outside of our control. Um, so if they take data outta S3 and start sending it wherever you, I, I can't do anything about that except, you know, kind of explain to them what the risks are and what they're trying to do.

Host: Jon

Security at every turn. Plus educating your customers on security awareness and what they're doing and how they're handled analyzing, uh, analyzing the data Real quick. Mike, uh, I know I said two questions, but real, I have one for you. I'm sure you've run into resource limits on a lot of these, uh, HPCs and environments, not only like us, there are hard and soft limits within the environment, but also even in the physical or private, uh, data centers. How are you handling resource limits? Are you aware of them? Monitoring in place is?

Guest: Micah

Yeah, I mean we look at 'em. Um, and you certainly can get some of these pipelines that'll run into like the thresholds of what they're allowed to do, um, at a company. Um, some of that gets into prioritization, helping them prioritize data, prioritize pipelines that need to run over other ones. Bash has built-in ways to prioritize different pipelines so you can prioritize what gets access to resources over other things kind of lets you, you know, kind of scope that out to say some of these big long-running ones, they might get shifted down so that we can always run quickly available things, you know, but you can always run into a point where you're just, you've run into this, you know, the limit of what AWS resources are even available. Um, so in those you just have to start working with them to help understand, you know, how many tasks need to be done, what tasks need to be done, what are you looking at, at an endpoint.

Guest: Micah

Is there data in here that we don't need to say that we don't need to process, that we don't need to copy? And these things can shorten the length of time that it takes to do this. Some of it is, well if we run out of resources in this region, this is all basic code. We can take it, I can spin up a pipeline in this region. I can spin up a pipeline in Europe, I can spin it up over here, I can run them in multiple regions all the same time all over the place. So, you know, sometimes it is just kind of moving that data around, dispersing it so that you're getting access to a wider amount of resources. Um, but a lot of times there's just a better way to do some stuff. So sometimes it's helping them understand the software they're running, the options that are available to that software.

Guest: Micah

Like, oh, you know, like can we, how many threads can we use? Can we control that threads to make them more efficient? You know, maybe we put different options on to limit the scope of what we're looking at. Can we put tools in a different order to make one go faster? Can we, instead of doing this huge processing on a full sample, can we first run that full sample through a couple of smaller tools that'll like splice out the important bits that they're looking for and then take that beta and then process it, you know, as opposed to looking at larger fully sample sets. So that's kind of a game you're always gonna play. There's a little bit of a whackable kind of aspect in that, that there's always something that kind of can happen. Um, so we just kinda work with him too, to monitor those resources, to monitor what's being used to make sure we're using stuff efficiently to kind of, you know, look at trends. Are we using more CP than we used to? Or are we ramping up? Are we ramping down? Can, how can we tighten that down? And we just kind of have to take that as it comes and kind of help work with them on it.

Host: Jon

Nice. I like how you're analyzing all aspects of the environment and, I like that you mentioned, uh, utilizing different regions. You're like, well I may have a limit in this region, but now I can spin it up in all these other regions and handle and handle the data at the same time. Micah, thank you so much for joining me. This has been a very interesting conversation.

Guest: Micah

Thank you. It's been fun.

Host: Jon

I wanna thank you, Micah, for joining the show. I appreciate it. Everybody. Micah Frederick CloudOps, lead architect at PTP. My name's Jon Myer. Don't forget to hit that Like, Subscribe and notify because guess what, we're outta here.

Guest: Micah

See you.

 

Comments are closed.