HistoQC, an open-source way to control the quality of pathology images

0:00

1:7:56

HistoQC, an open-source way to control the quality of pathology images

Subscribe via RSS Google Podcasts Apple Podcasts Spotify Stitcher Download

This episode’s guest, Andrew Janowczyk, is a computer scientist who has been active in the field of digital pathology since 2008. Before turning to the field of digital pathology he worked across the globe and across industries.

He was a salmon fisherman in Alaska.
In Austria at the United Nations (UN) International Atomic Energy (IAE) Agency, he significantly contributed to the work that won the UN IAEA a Nobel Peace Prize.
He taught English in China.
He helped build an oil facility in the Nigerian jungle and lived in Nigeria for a while.
Then he lived in Germany…

A close family member diagnosed with cancer made him aware of the field of pathology and he decided to switch gears and put his energy and brainpower into advancing this discipline.

He moved to Mumbai, India to get his Ph.D. and started his digital pathology research. Currently, he is working at the Case Western Reserve University (OH, USA) and Lausanne University (Switzerland).

Fast forward to 2018, after 10 years in the field and after overcoming many challenges, he encountered another one: the suboptimal quality of the whole slides from the TCGA data set. To solve this he writes software that excludes all the non-usable regions of the slides and makes it open-source.

Why? Why not commercialize such a useful tool?

Andrew’s answer: “I wanted to release it open source just to fundamentally change the world. I wanted to change the way that we enacted digital pathology as a science, and one of the problems with digital pathology science versus other sciences is that we don’t take measurements. And as soon as we start taking measurements, we have the ability to do better.”

In addition to his main work, Andrew also runs a blog with resources for computer scientists working in the field of digital pathology.

Other resources from this episode include:

Publication: HistoQC: An Open-Source Quality Control Tool for Digital Pathology Slides
Publication: Assessment of computerized quantitative quality control tool for kidney whole slide image biopsies
HistoQC download page
An article on Andrews blog about how to download TCGA digital pathology images

Transcript

Aleksandra Zuraw: So today my guest is Andrew Janowczyk, and I’m pronouncing this with my Polish pronunciation because it’s a Polish name, but you can later say how you pronounce it. Andrew is a computer scientist and a researcher at Case Western Reserve and Lausanne university hospital. He has been active in the field of digital pathology since 2008 and is an author of an open-source software HistoQC.Hi, Andrew. It’s great to have you on the podcast. How are you?

Andrew Janowczyk: Hello. Thank you for having me. I’m doing quite good today.

Aleksandra: Great. So that was about yourself. I gave a little introduction, but tell us what’s your background and how did you end up in the field of digital pathology?

Andrew: So, I think I’ve been quite fortunate. I’ve had a very interesting path in order to get here. So initially I studied computer science and [00:01:00] applied mathematics at Rensselaer Polytechnic Institute in upstate New York. I also received my master’s degree there in computer science. Afterwards. I actually moved to Alaska for about three months where it was a salmon fisherman.

After that I moved to Vienna, Austria. Where I worked in the United nations international atomic energy agency. And I wrote the software that inspectors use to verify material at the different nuclear sites around the world. So, it was one of the developers of that. I was quite junior at that time, so I don’t want to take too much credit there.

After that contract, I moved to China and taught English for about eight or nine months. Then I started working for a large oil company. Where we built a billion-dollar oil facility in the Nigerian jungle. So, I lived in Nigeria for a while. I lived in Germany, in Hamburg for a while. Then my life kind of took a little bit of a twist.

A very close family member of mine was unfortunately diagnosed with cancer. And, and she’s, she’s fine now. But I realized that there’s [00:02:00] probably other things I could and should be doing with my life. So, I quit my job and moved to India to Mumbai. And I received my PhD there in computer science and electrical engineering, I believe is the topic.

And that’s actually where I started my digital pathology research. So since then, I’ve basically been, been, been on that train. Yeah. Interestingly, no fantastic weather.

Aleksandra: Sure is a crazy story to get to pathology. I don’t think I have heard a story like this before

Andrew: Yeah, it was fun.

Aleksandra: And you’re originally from where.

Andrew: I’m originally from long Island to New York from the US

Aleksandra: Okay. So how do you pronounce your last name?

Andrew: We pronounce it. Jenna WIC, which I think is a very Americanized version of that.

Aleksandra: It’s the Americanized version. Correct. Okay. Well, that is indeed a crazy story. So, we said where you’re working, what are you doing there now?

Andrew: So I think our work basically [00:03:00] focuses on, on two components, really. One of them, I think we’ll discuss later, and you mentioned this is HistoQC. So, some component of our work is to build tools that make digital pathology research easier. So that’s basically the idea that we have a lot of experience.

As you mentioned, I’ve been doing the students 2008 and we have a lot of things hanging around. We have a lot of experience. We have a lot of small scripts and little tips and tricks and basically experience that you build up over the years. So, some component of that is to try and take that and turn it into tools that encode that experience.

So people that are less experienced can still obtain great results without having to, for example, spend 10 years going in and studying the fields. So, one of my main goals really is, is what I call like this democratization of digital pathology, such that people anywhere can get involved with it until kind of lower that bar for entry, such that they don’t need to be.

We’ll say, especially skilled or have access to let’s say digital pathology lab or have access to high performance computing clusters. Question is what we can do to make it such that everyone cannot work on this topic and make it more approachable regardless of what their surroundings are. So that’s kind of the first component of our research.

The second component of it research is more in the biomarker discovery fields, where we actually want to go and look at patients and their digital pathology slides and try and for example, in the context of cancer predict the progressiveness of their disease, how aggressive that disease might be as well.

We’re very interested in trying to predict therapy response. So, for example, you might know immune therapy is a very hot topic now, and it has great results for some patients, but not all patients. It turns out a percentage of those patients. In fact, experience near life-threatening complications as a result of taking that therapy because their body essentially just starts to attack itself very aggressively to the point where the person is close to death and they have to intervene in order to stop that from happening.

We don’t know ahead of time where those patients [00:05:00] are, which, which of those patients are, which the ones that are going to respond well, which are not going to respond well, and which are going to have what we would consider a very poor response. So, our hypothesis is that by looking at digital pathology, science, there may be certain characteristics, features presentations of disease tissue microenvironment components that we can very precisely quantify that allow us to predict which patients are in each of these categories ahead of time.

So I see both of these facets, that tool building and the biomarker discovery as kind of going hand in hand, as we learn more in our, let’s say biomarker work, where we’re really using the most cutting-edge state of the art. Tight technology and we fail a lot. Right? I mean, research is a lot about trying and failing.

So we fail a lot and say, Oh, this didn’t work. This didn’t work. This didn’t work. This didn’t work. This didn’t work. And that’s okay because we have the experience to do that. And our failures tend to happen very quickly on the order of, let’s say a few days or a few weeks. But then as soon as we identify something that does work, that we think is generalizable, the goal is to then translate that into a tool where we say, Hey, this is going to work 95% of the time.

Now, inexperienced people that would require six months or a year to, to have that failure, now, first of all, don’t have that failure. And second of all can use that tool immediately to get where they need to be in order to start doing the types of research that they’re interested in.

Aleksandra: [00:06:19] So democratization of pathology. This is interesting because I have heard this concept from computer scientists. They kind of recognize the value of their contributions to this field and want to make it accessible to other people. I think we pathologist look at it from the other end so it’s like, we’re on the tools.

[00:06:43] We are the people who, who want this. So yeah, that’s very interesting. And obviously biomarker discovery, like you said, is a hot topic and high rate of failure. So, if we have tools to accelerate that, like in the whole drug development, if you fail quickly, [00:07:00] like you said, you are ahead of the curve.

Andrew: [00:07:04] Yeah, and I think that there’s some, some concepts there. Given, and this is one of the nice things about being involved in what I would consider highly translational, but also a highly collaborative science is that every field brings with it, its own best practices and its best experiences. So, one thing that I’ve seen in my own lifetime during my, my career.

[00:07:24] Really, if I can be really concretized that in that period of time is when we first started very few people released their source code for the tools that they were developing. So as a result, during my PhD, for example, I would read a paper and then I would spend three months trying to implement this paper because ultimately the written form of that algorithm is quite different than the implementation itself.

[00:07:46] The written form is more of a general idea while the exact nuance and the details that are needed. Or trial and error. And of course, the author of that paper, doesn’t go and have an appendix where they say, here’s the 55 small variables. If you use 0.001, it [00:08:00] doesn’t work. But if you use 0.00012, it does work, and you have to go and discover those things yourself.

[00:08:05] And that was expensive in terms of time in terms of costing it in terms of effort. So, during my lifetime, we’ve seen what I think the benefit of, for example, working with computer science folks is they say, well, we do open source. We know what open source looks like. We know how to share code. We know how to package code, and write, read me files and how to put things in a nice structure so that someone else with a similar background is more easily able to digest it.

[00:08:31] And within my lifetime now, any super impactful paper that you see, almost all of them come with code where they say, you don’t believe my results, go download it and try it yourself. Because there was always this discrepancy where you would go and try and program someone’s work from a paper, then you’d meet them at a conference and say, Oh, I tried it, but it didn’t really work.

[00:08:49] And they said, well, how did you do it? Oh, I did it like this. Oh, actually I meant it like this. So, there’s, there’s so much variability where I don’t know if it just doesn’t work because I implemented it differently than they intended. Or [00:09:00] does their algorithm just not work on my specific data set or where’s this disconnect.

[00:09:04] Now, once we’ve released the code, we say, this works on this data. Here’s the code. Here’s the data. Try and read it with yourself. And you can you click run, you wait sometimes a few hours, sometimes a few minutes. Sometimes a few days you come back. Wow, this did work on this. Okay. I now know that something is working that is extremely useful because especially in the concept of the context of digital pathology, I can go and say, how difficult is it for me to replace their data with my data?

[00:09:29] Oh, you’re using whole slide images. I’m using whole slide images. Let me put in my data. Now I’m in a position where I can simply just quickly replace their data with my data, click run on something that I know already works. On their data, right? So, one of those variables is now completely removed. Does it work?

[00:09:43] Yes. I know it works. That’s removed. Now the question is, does it work on my data that you knew variable? There is my data. I click run if it works fantastic. I’ve potentially just saved on the order of, of years of effort trying to get to that point. And biomedical engineers, and I think medical doctors before were, [00:10:00] they were aware of the concept, but maybe they didn’t know how necessarily to do that in the best ways possible.

[00:10:05] And weren’t exactly sure of where the boundaries were. So, I imagine if you’re a medical doctor, you are sworn to some level of confidentiality. You’re not going to go and say, well, here’s my data set. I’m just going to put it out on the internet with all your patient names and their addresses and their phone.

[00:10:19] Like that’s, that’s obviously completely illegal and ethically wrong as well. So, a question then is say, well, what is the minimum amount of data that that’s needed to be released that still is useful for the community, but also respects the privacy that’s needed for the patients. And I think a lot of the medical doctors really never ask themselves that question because

[00:10:39] they didn’t understand what data was actually needed to do this type of research. So, through these collaborations, they say, oh wait, all you need is the image data. And I can strip away this. There’s the yes. Yes. Oh. They’re like, oh yeah, I can easily get approval for that. Give me a few days. And then they’ll go.

[00:10:53] And they’ll, they’ll go through the process in their hospital. And voila. Now you have these extremely large, beautiful datasets that are very useful [00:11:00] and rich, more rich than we could develop on our own. But they didn’t know that they should do that. And at the same time, they didn’t know how to do it properly.

[00:11:08] So only through these, let’s say large collaborations are we able to combine all of these insights and all of this knowledge to actually make things viable for people to use at scale.

Aleksandra: [00:11:19] So regarding the collaborations, have you been at those institutions from the beginning of your digital pathology path or did you switch?

Andrew: [00:11:27] So initially I started, as I mentioned in India, in Bombay my advisor there, professor Sharat Chandran actually met Anant Madabhushi at a conference a few weeks before I joined the PhD program. So, I went and spoke with Sharat and I was like, Oh, I’m interested in doing cancer image detection type stuff. He’s like to be honest, Andrew, I don’t really have a lot of experience in that my experiences in image analysis, but not the digital pathology component, but I just met Anant.

[00:11:53] You should contact him and maybe he can be a co-advisor. So, I contacted Anant, and working together. I think we [00:12:00] actually just celebrated our 10-year anniversary last year. So, I’ve been working with Anant for a long time because he’s fantastic to work with. I really have nothing negative to say in the slightest there.

[00:12:10] So he was at Rutgers at the time, so I remotely, essentially worked on projects with him there. And then in the last, I don’t know, maybe five or six years, it’s difficult for me to remember. He moved over to case Western where I also transitioned with him. So, we continue to work there. And then the loads in, in Switzerland, Switzerland is starting, I would say, on this digital pathology revolution as well.

[00:12:32] So I’ve started working there with more people and we’re, we’re starting some projects. It’s more, I would say in its infancy as the infrastructure that’s needed, for example, the digital slide scanners and figuring out how to correctly submit ethics reports for approval and those types of things that are figured out.

Aleksandra: [00:12:48] I asked this because I wanted to know if this knowledge that you have developed is within the same institution or is it fragmented? It looks like it’s within the same team. You’ll have had [00:13:00] the Anant as your collaborator for so many years. So, I guess it also travels with you. The other thing to comment on the open source.

[00:13:09] So in parallel though, what’s happening in the like you said, with the source code in the publication currently, I have noticed that now you can see as an appendix whole slide images with the markups of the image analysis, which was not the case before, before it was, you know, you pick the best picture you have, or the best zoomed in fragment of the picture.

[00:13:31] You have, you put it in the paper, and this is your proof of concept that worked.

Andrew: [00:13:36] slightly biased potentially.

Aleksandra: [00:13:38] biased. I learned very quickly working in a digital pathology company, not to believe screenshots. But now you don’t have to believe screenshots. I mean, not everyone is doing this yet, but this is a trend also our pathology side to include the whole slide images with all the markups people can see without having the computer [00:14:00] background to just see, okay, does it match the morphology in the tissue.

Andrew: [00:14:04] So, if you think about the reason why that’s possible now, A lot of it has to do simply with technological development. So, for example when we first started research, you would try and go and buy a hard drive to hold the whole slide image. The size of a whole slide image has not fundamentally changed in the last 10 years.

[00:14:20] They’ve always been about one gigabyte or two gigabytes, whatever it is. But the cost for storing that has, has changed. So previously you would buy a 10-gigabyte hard drive. It would be $500 or whatever it is, you’d be able to store 40 slides on it. And that would be it right. You would have to process and save it on this, comparatively, very small, hard drive.

[00:14:41] And that was it. So, the idea, and as well at the same time, the internet was much slower than right. People didn’t have fiber and things like that. So, you were in a situation where now you have multiple gigabytes of data that you need to transfer to someone over, let’s say a dial up connection. You know, you’re looking at 55 years of transfer time and $2,000 [00:15:00] just to store this data on the other side, which makes it not feasible.

[00:15:04] Now we have things like Amazon cloud and cloud storage, which is super cheap. You can buy hard drives and set up your own server, which are super cheap. We, we go and experience upload speeds to Amazon from case Western, because they’re both on that fiber backbone. It’s like 60, 70, 80 megabytes a second. So, you’re able to upload slides in, in minutes instead of years.

[00:15:23] Really, it’s really that order of difference. So, the, I don’t think people intentionally didn’t want to share their slides or share their information or, or those sorts of things before. But the cost associated with doing it both in time and money, which it was just insurmountable. It was just not feasible to run a digital slide server at scale 10 years ago.

[00:15:44] And now a high school student could realistically set it up on a very modest budget and it’ll work, it’ll work and do what it’s supposed to do. So there. It’s I don’t, I’m not even sure how much of it’s been people’s desire to change. Just that. I think maybe people have always wanted to do it. I certainly have [00:16:00] maybe I don’t speak for everyone, but I certainly have, but now it’s feasible now.

[00:16:04] They’ll say how much is it going to cost us to release the data? Oh, it’s going to cost $10 a year on Amazon. Okay. Fine. That’s a rounding error. Like no one. It’s such an insignificant amount that, and I think you’re pointing out the more impactful the paper. The fact that you release that data makes it more impactful, as opposed to now, if you don’t release that data, it actually limits the impact.

[00:16:24] So there’s even further benefit. I think there.

Aleksandra: [00:16:27] Yeah, I think I think the same, it was not intentional. And even, you know, if it was a little screenshots and just tiny peaks into what was happening at that point, it helped drive it forward. And we can see everything in that we can verify everything. Let’s go back to the concept of open source.

[00:16:49] Can you tell the listeners what open source is or what is open-source software and a little bit about this initiative and the idea behind it.

Andrew: [00:16:58] So I think open [00:17:00] sources, it kind of makes sense if you think about it in the context of working with the lab. So, you’ll go and work on a project and maybe you work on that project with three or four different people, and everyone is contributing different components to that, to that project. The idea of open-source software is really that, but I would say at a global scale, so a lot of the similar ideas in the best practices and how to go about doing it, follow basically that paradigm where you have an idea, you want to share it with other people.

[00:17:29] And you also want them to be able to contribute. You want them to be able to improve upon your idea, or even in some cases to maybe radically change your idea or take a component that you’ve developed and say, I fundamentally disagree with how you’ve used that I’m going to go and make something else that’s that I perceive to be much better without negatively impacting your work.

[00:17:47] And that’s possible through basically sharing this code and it’s slightly different than we’ll say. I email an author and say, Hey, can I have the source code for your, for your program? And we’ve done that over, over the course of many years, and [00:18:00] people do it to us as well. And I would say 99% of the time people say yes, right?

[00:18:04] One of the concerns has always been, you don’t want to put random stuff on the internet without any type of control over it. So, if people ask you, when you say, Oh, they’re from a university, this is obviously a university student, you’re very likely to give them the source code because you’ve been in a similar position as that.

[00:18:19] Now the idea is, well, what does that control actually get? What is the worst thing that can happen? If someone goes and has access to this sorting algorithm? Like the worst thing that could happen is people use it. Right. That’s, that’s the worst. In fact, maybe the worst thing that could happen is people don’t use it.

[00:18:37] Where you spend hundreds of hours on this and then you find out there’s no interest in it. That’s probably the worst thing that can happen. So, a lot of the ideas, especially in an academic circle where we’re not really intending to let’s say, commoditize these or it would be difficult to commoditize and productize.

[00:18:54] Some of these things it’s easier to open source it so people can actually start to use it and provide your feedback in hopes [00:19:00] of making it better. And there’s still room for, I would say intellectual property. So, there’s that you can have a patent on something and still open source it and still retain some control over the licensing that you have.

[00:19:11] Can decide whether or not other people can take your code entirely or they need to cite your code or what the circumstances are that they can use that particular code. Maybe the most forgiving license is one where you can literally release a project and someone else can productize it and sell it.

[00:19:26] And not write a single line of code, but the question then is say, well, if someone has a free version of that and someone has a paid version of that, why would that person go and take that paid version? So clearly if that person is going to go in and make some money, they would have to add some value to that.

[00:19:39] And the price that the people are willing to pay is going to be based off of what their perceived and perceived value of that added value is. And that sounds like a fair, fair market to me. Right. People can decide, Oh, I don’t actually need that service. I need these other components as well. The open-source paradigm is quite interesting because what we’re seeing at the same time, these people building consulting companies [00:20:00] around open source software, so we can discuss HistoQC, but we have a very defined scope. I think of what we want HistoQC to do, and the features that we are interested in implementing.

[00:20:11] We of course receive requests from, from people in the community. And yes, we’re happy to, to answer them and see if they’re interested to us and we have the power to do it. Yeah. We’ll do it. Right. So, so there’s, there’s obviously a lot of potential synergy there where if you want something and I want to do that something, and I didn’t think about it I’ll do it for you right now. I’ll, I’ll pull it into my scope and say, this is now my job. I’ll take care of it. There are other things though, which are perhaps radically different that HistoQC could be used for that are not within our scope and would basically spread us too thin if we try and implement every single thing.

[00:20:46] Let’s say you really, really want that though. As a, as a pathologist, let’s say you really, really want a specific feature, but it’s unique to your unique institution. So, there’s no one else in the world that would gain value from us implementing that. I’m [00:21:00] sorry, but we’ll have to look at that and say, listen, we have other things that we would like to implement that would, that would benefit more people more greatly than a single niche thing. You would, if we had a closed source product, you would be stuck, you would say, well, this, this product just doesn’t do what I want it to do. I’ll have to suffer quietly.

[00:21:16] Another option now is that you can hire a consultant and say, here’s an open-source tool. Here’s all of the source code. Here’s the documentation of how the tool works. I want you to go and add this feature and I will pay you X amount to do that. Now the consultant can go and say, Oh yeah, absolutely. I’m a computer programmer will say I have read the source code. I understand the source code. It’s well-documented it makes sense to me. I you’ve given me some data to test it on. I will go and implement this for you at this rate per hour. And that’s something that they can work out themselves. Now there’s an opportunity where that pathologists can get what they want.

[00:21:47] Of course they’ll have to pay for it. I think that seems fair. And at the same time that pathologist now has that, that code, that pathologists can also opt to make that code open source. So, in fact, they could pay for a feature [00:22:00] and say, well, I know no one else needs it in its exact form, but maybe other people could benefit from 10 minutes of modification from, they could also opt to open-source it either through their own.

[00:22:09] We’ll say get hub repository or they can send it to us. And we can include it in a third-party contribution, a folder that we say, we don’t maintain this, but this is here in case as an example, in case you want to do this. So open source goes and creates this opportunity for lots of different people of different backgrounds to work together, to improve something.

[00:22:26] Well, if it was completely closed source, it’s basically stuck it in that it’s frozen in time. It’s frozen in time. And it’s only that person that owns it can benefit from it in that context.

Aleksandra: [00:22:38] So, what does HistoQC do?

Andrew: [00:22:42] Guess as I probably,

Aleksandra: [00:22:44] and who is it for?

Andrew: [00:22:45] yeah, so maybe we should have tackled that one into reverse order. So maybe I’ll start with why, why HistoQC came around. As I mentioned, one of our major works is to try and do biomarker discovery, right? That’s what we’re interested in doing and to do great [00:23:00] biomarker discovery, you need to have a large number of slides, right?

[00:23:03] You need to have a sufficiently powered dataset in order to be able to detect the particular effect that you’re interested in detecting. So, I had this idea for a study, and I said, you know, I’m going to use the TCGA data, which you might know of the cancer genome Atlas that have, I think it’s about hundred thousand or 300,000 slides now available at different cancers at different stages, fairly well, annotated, beautiful rich data source.

[00:23:26] So I said, I’m going to use the TCGA data. So, I downloaded about a thousand breast cancer whole slide images. And I started to look through them and realize that the quality was more variant than, than realistically anything I had seen at that point.

Aleksandra: [00:23:41] And that I don’t think people know that about the TCGA dataset.

Andrew: [00:23:46] And the, one of the challenges is of course, that TCGA, which is really the component that we’re talking about is, is it’s called a DPR or digital pathology repository. But that, that DPR is [00:24:00] multi-institutional. And that’s why there’s variability in these slides because someone that makes slides in New York doesn’t necessarily make them the same way in California.

[00:24:08] or doesn’t necessarily make them the same way as someone in Michigan, just because of, let’s say small staining times. This particular stain that they use, the stainer or the machine itself that they use, and everything fix it right, slightly, for example, the temperature in the room, the humidity in the room, all of these have some type of a chemical component effect on the underlying stain, which affects the ultimate presentation.

[00:24:30] Thankfully, the TCGA all uses the same scanner, but if you go to larger slide repositories, there, every institution may have a different scanner. This is also going to go in and part some, some differences in brightness differences in contrast white balance. So, when you get these, these, all these samples together, the variability is ginormous.

[00:24:48] It’s huge, as well at the same time, the TCGA was originally developed to facilitate molecular discoveries, not digital pathology discovery. Right. So that’s an important distinction because when they were [00:25:00] making this, they focused very heavily on, let’s say the mutation calling the methylation data, the copy number variation, they all have these omics data, and they did a great job in the quality control there.

[00:25:10] The digital pathology slides were just an extra basically where they say, well, we have this slide anyway, because they’ll see they’ll scan the slide before they do the sequencing, because sequencing is expensive. Making a slide is not expensive and they want to make sure there’s actually tumor on what they’re about to sequence so they can go in.

[00:25:26] Now they’ll make an in slide. They’ll look at it. They’ll estimate how much. What the tumor purity is and say, Oh, it’s the purity zero. We’re not going to sequence it. So, they’ve, they’ve saved a few, a few thousand dollars there and they’ll keep going until they find a sample that they like. And then they provided those digital pathology slides as essentially as a service to say, this is what that RNA data looks like.

[00:25:48] In disease presentation, just in case you’re interested, this is what it looks like. So then if you have some questions about, Oh, this RNA, I wonder how many lymphocytes were there. You can now look at the H&E slide and come up with a rough estimate. [00:26:00] So that was the original purpose of that. Then folks like myself came along and said, well, we can do computational analysis on these.

[00:26:06] Of course, everyone was very open and, and loved this idea because it makes sense the data data’s there, why not use it, but the challenge is that that data wasn’t very well quality controlled before it was put onto that repository because no one had initially intended for it to be computationally analyzed.

[00:26:21] So ultimately if you’re going to go and do something, if you know you’re going to go and do something, you’re going to build a standard to that thing that you want to do. If you don’t know you’re going to do it, then you’re just like, well, I’m just going to do you know what a bake a cake. It doesn’t matter if it’s good or bad, it’s going to be a cake.

[00:26:34] Right. But if you say, I need a cake that has exactly these properties. Now you think about how do I get these, these particular properties? And at that time, they didn’t do that. So, we ended up with a very. Varying quality of, of data. The way that we typically went, and we do this when, for example, when I did my PhD, our data sets was much smaller.

[00:26:52] We would have maybe 30 patients, 30 patients are, is small enough that I would manually open each and every slide. I would find [00:27:00] any artifacts in there. For example, a crack on the cover slip tissue, folding blurriness. Overseeing regions that were too thickly cut, whatever it was, I would manually go and look at each and every single slide.

[00:27:11] And I would sit there with a digital pen and circle the regions that were good for computation. Right. So that means that there will say the artifact free and there were of high quality. And that would take about an afternoon, we’ll say with 38 slides. Now, when we start going and downloading TCG data, it’s, it’s more noisy than

[00:27:29] slides were made specifically for computational study. While at the same time you have thousands of slides. So, it’s no longer possible for me to go and say, I’m going to open up 5,000 slides and quality control, all of them and manually go and identify where the blurry regions are or where the cracks are.

[00:27:46] It’s just not feasible anymore. I realized that as I started to do that process, I went in, I annotated five of these images and like this,

Aleksandra: [00:27:54] Thank you

Andrew: [00:27:55] just not scalable. I’m like, forget, this is this, this is never going to finish. I’m going to die. Before I finish analyzing these, [00:28:00] these slides, then I realized at the same time that we had tons of code that did various parts of this, right?

[00:28:05] So we have like very basic blurred detectors just to identify very regions. We have some, some of these little scripts floating around. So, the idea was why don’t we codify this and solidify it into a tool that’s actually usable instead of little pieces of MATLAB, some of it’s in Python, you know, we have some of it was in C plus plus all just kind of floating around.

[00:28:28] Why don’t we take the time we’ll do this properly. We’ll organize all of our experience, even beyond the code. Try and imagine what the different types of quality control failures are, and we’ll build a tool specifically for that. And that’s why HistoQC came about. So, in the end, what HistoQC does is you can provide it with whole slide images, and it will identify where on the slide, there are artifacts like cracks, like pen markings, like tissue folding regions that are too dark regions that are too light.

[00:28:55] It’ll identify those and mask all of those out for you and give you a binary mask that [00:29:00] says these are the good regions for computation at the same time, it will go and produce metrics that measure the properties of that slide. For example, how bright the images, sorry, how bright the tissue is.

[00:29:12] So we’ll go and extract all of the background and all of that stuff. And when we. When we have finished doing all of the artifact detection, that mask that’s left is just tissue. From that tissue, now we’ll go and extract color metrics. How bright is this? How is it overseen? Does it understand? What is the hue?

[00:29:29] What does the saturation very kind of first order statistic type stuff. And it turns out once you start doing that, you are able to go in group images by quality, and you’re able to even identify batch effects inside of them using just very, very simple statistical properties. And that’s something that’s extremely important, especially in the context of, of what I would consider these new deep learning paradigms, because deep learning while on one hand, it has the power to learn of a specific disease presentation is associated with a more [00:30:00] aggressive type of disease. At the same time, deep learning has the ability to learn this slide. All of the patients that have a good prognosis. Have a very light slide and all of them that have a bad prognosis have a very dark slide. So then during test time, if you give it a light side, it’s like, Oh, it must be bad prognosis.

[00:30:15] And having not studied any bit about the actual disease presentation itself. So, in order to kind of develop better classifiers, we need to have an understanding of what potential batch effects may be present within there. And HistoQC now goes and provides metrics. I set of them. I mean, it’s, it’s not, let’s say bold, inclusive, maybe about 30 or 40 of them where you can go and try and predict the likelihood of a batch effect from a specific lab we’ll say, and then that’s going to go and help inform you how to make better experiments later on at the same time we’re we worked with Michael Feldman on that who’s works at university of Pennsylvania is the head of the pathology department there.

[00:30:54] And he’s a co-investigator on the NCI grant that funds the development of HistoQC. And [00:31:00] his idea coming at it from a pure pathology perspective, or we’ll say even a lab manager perspective, he wants to HistoQC every single slide as it comes out of the scanner. Simply because he wants to know immediately two things.

[00:31:13] One, if that slide is of poor quality, for some reason, because then if it is, they can immediately address that problem. So instead of waiting for that tissue block to get put back into the repository, the slide eventually makes it to his desk. He looks at it and says, I can’t use this slide because it’s of such poor quality.

[00:31:27] Then you have to retrieve the tissue block. From wherever we cut this. I mean, you’re talking about days of delay. Now it’ll come right out of the scanner immediately, you’ll have a red flag that HistoQC says, Hey, this is probably a bad slide. And you say, okay, let me take a look at that slide. Oh, this is a bad, so like, let me remake it right now.

[00:31:44] All of the material is physically in my hand. So, you, you immediately experience a boost of efficiency there at the same time, he would like to know. What the pattern over time of his lab is. So, for example, in the beginning of the month, when you have new stains, maybe those stains stain [00:32:00] more darkly as you get towards the end of the month, maybe those stains have started to oxidize.

[00:32:03] They don’t stain as well. Well, if you have quantitative metrics that measure the output of those slides each and every single day, hundreds of times per day, you’ve now essentially can have a quality control process where he can go in and say, I’m willing to accept one standard deviation away from this metric.

[00:32:20] If a slide, if five slides in a row are produced that are too dark by this metric, I want to receive a, a page immediately and I want to shut the lab down. To address that problem. And I want to address it after five slides instead of at the end of the month after 50,000 slots. So now you have this quality control, which is it, which is very common.

[00:32:40] For example, in building cars or building buildings or building little, small parts or building iPhones, everyone else has a quality control process, that’s quantitative where you say, I’m going to try this phone. Get it work. Yes. Okay. Go to the next one. Did it work? Yes. This one didn’t work. Okay. One bad phone out of every thousand is okay.

[00:32:56] As soon as you get to 10 phones, they shut down the factory and they figure out this [00:33:00] is way too high. Why is that? Now that we have digital pathology, we’re transitioning from this analog science to a digital science. It enables us to use the same quality control and standard practices that other digital fields have been able to do over the years.

[00:33:14] Now we can finally start to apply them, and I think HistoQC also fits that niche, to help address the challenges there.

Aleksandra: [00:33:22] So you said the worst thing that can happen to an open-source thing is that nobody uses it.

Andrew: [00:33:27] Yeah.

Aleksandra: [00:33:29] How many people use it? How popular is it?

Andrew: [00:33:31] So that’s a great question. It’s actually the challenge, I think with open source in particular with HistoQC is that it’s very difficult for us to know how many people use it. Other products, but I don’t want to consider project, we’ll say projects, other projects have different abilities to keep track of their users.

[00:33:49] So a good way of keeping track of users is you have, let’s say a software tool and the software tool, when it starts up, it goes in, it pings a server somewhere. They say, Hey, I’m a new installation. And a [00:34:00] lot of open-source tools will do that, so they have at least an estimate of how many people have installed this, this particular file, because keep them, like, you only need to download it once.

[00:34:08] But if I go to a hospital and install it 50 times, that has a different, that, that shows up in a very large impact of difference. Unfortunately, HistoQC can’t use that type of a mechanism because it’s designed specifically to work with confidential data. And as a result, we don’t have any components that connect with the internet at all. So, it’s designed to work entirely on a standalone machine without any interface with, with, with anyone else, basically. So, we lose a lot of options as a result of that, while a lot of things like, for example, Microsoft word, they know how many people use Microsoft word.

[00:34:44] Right? So. Specific projects due to specific constraints, don’t have different ways of measuring this. I think a good way that we were able to measure it. I guess it’s two components. One is that we see citations for the paper that we use. So as people put out [00:35:00] publications in academic settings, hopefully if they’ve used our tool to quality control their data, they’ll, they’ll make reference to, to our tool.

[00:35:06] And of course we would appreciate that that doesn’t necessarily work in labs themselves that don’t publish papers in the public. Another way that we can keep track of it is we see how many people ask about it. So, for example, I get probably five emails a month from different people all over the world that are saying, Hey, I’m trying to use this tool.

[00:35:25] This isn’t really working for me. This is working for me. How do I do this sort of things? So, we kind of try to gauge it from there. Git hub has the ability to keep track of some statistics. So, if you make a release, you can see how many people have downloaded that release. So, there’s, there’s different metrics, but.

[00:35:40] No, I have no idea. I think the ultimate one is I have; I have no idea more than, more than 10, less than 10,000.

Aleksandra: [00:35:47] Okay. So why did you do it open source? I mean, building code for people who don’t build code, it’s the software business principle, [00:36:00] and now the digital pathology industry is booming with different image analysis companies, slide management companies, anything that you have to do around the slides without knowing how to code.

[00:36:14] I don’t think anything is really addressed in the quality control space. I think it’s a little bit, okay. Every company develops for themselves. They want to have high quality data, but it’s not something that goes out to the users. This is the only thing that is solely dedicated for this that I have heard of and correct me if I’m wrong.

[00:36:36] Why did you not want to commercialize it? I mean, you had the idea. I know now it’s a part of a grant and you know, the bigger infrastructure that I mean, you still can, they get them do some other things, add to it and commercialized, what was the reason why you didn’t do it the way of them wanting to do it?

Andrew: [00:36:53] So I, there was a very similar question when we released this tool at the ECDP and [00:37:00] in Helsinki a number of years ago, I think it was maybe 2018. And I don’t think my response to that has changed. I wanted to release it open source just to fundamentally change the world. I wanted to change the way that we, we enacted digital pathology as a science, and one of the problems with science.

[00:37:19] I guess one of the, one of the problems with digital pathology science versus other sciences is that we don’t take measurements. And as soon as we start taking measurements, we have the ability to do better. And that’s, that’s this idea, right? There’s this old, old saying, and I’ll paraphrase, something like that which you don’t measure it doesn’t get better. Right. So, if I can go and say, this is what I’m expecting. Now I have a quantitative metric for that and the importance for it being open sources because everyone needs to get on board with it. So, if I go in and let’s say commercialize it, and five people buy that product, that that’s fine, but you’ve now isolated yourself from the other 95%.

[00:37:56] So it’s almost like you’ve kind of created, you have the potential of creating lots and lots of [00:38:00] different standards. The challenge here is to have, get as many people on board as quickly as possible so that you can develop better algorithms. So, one of the advantages of HistoQC is that what’s happened to us is, I will go and train a deep learning model. Let’s say to do segmentation or identify where a tumor is. I’ll train a model. The model will work fantastic on my personal dataset. Someone else will go and say, can I use your model? Yes, I give them my model results are very bad. Why are those results bad? I don’t know.

[00:38:32] So then I say, can I see some of your sites? They send me some of their slides, their selects look fundamentally different than my slides, right? Maybe they’re much brighter, a much darker we know. And a lot of people will put up numerous publications showing that there is a deep learning is not robust.

[00:38:45] We’ll say two slides prepared from different sides. So, a fair question would be to say, how can we address this disparity? One is to make better deep learning algorithms. And I think everyone’s trying to do that. The challenge, I think [00:39:00] becomes, pathologists want to know if they can trust certain results?

[00:39:03] How can we go and start to instill trust in medical doctors that this algorithm is going to work or that it’s been tested appropriately? HistoQC can help with that because now we can say I have trained my deep learning model using slides that have exactly these metrics. Right. It is exactly this dark.

[00:39:19] It is exactly this much hematoxylin stain. Exactly. Not, not an approximation, but down to five significant digits. This is exactly what my dataset looks like. Now, before you use my algorithm or use my model on your slides, I want you to use HistoQC on your slides. If your metrics are fundamentally different than mine do not use my model.

[00:39:40] Don’t use it, because maybe it’s going to work. Maybe it’s not going to work, but I don’t know if it’s going to work. That’s an indication to me that you’re now using it on data. That’s outside of what I have tested it with. That should immediately give people better confidence to say, Oh, this person now is saying, let’s, let’s take a step back and proceed with caution.

[00:39:59] Maybe, [00:40:00] maybe it will work. That’s that would be great. But you should test it. You should test it and be very, very careful with it until it’s fully validated. Previously that wasn’t a possibility because you would just say, here’s my model. Oh, it didn’t work. And then go on with your life simply by having metrics allows us to now start going and doing these things.

[00:40:17] And I think. The reason why it has to be open source is because everyone has to do with this in order for that to be successful, everyone has to go and start taking these measurements and releasing them with their models. So, if you release a deep learning model that let’s say segments glomeruli in the kidney, you should, along with it, release a summary of what that data set looked like in terms of quantitative presentation characteristics.

[00:40:39] So that, you know, if your samples are within that distribution, this is probably going to work for you. If it’s outside of that distribution. Maybe don’t use it at all, or maybe be very cautious about it that only works if everyone has the ability to do that, if it’s going to cost you, I don’t know how much a quality control tool would say.

[00:40:56] Let’s, let’s say $500 or a thousand or 10. It doesn’t [00:41:00] matter. As soon as there’s a barrier of entry to that. Then there’s a disconnect between the number of people that are going to use it, because they’ll say, well, why don’t I just use the model and see what the results look like? Why should I bother to go through this quality control process?

[00:41:12] If it’s going to also cost me money, as well as time by making it open source, now everyone can just simply approach it and try it. And I think that once you start walking on that path, you see that the benefits greatly, greatly outweigh the very minimal amount of time that it takes to, to use that tool.

[00:41:30] So I really don’t see another way that this, this could have worked. And it’s basically like the same exact problem that we see. If I can continue this rant, the problem, as well as with, for example, these whole slide image formats, every company has their own whole slide image format. Guess what? As a researcher, that’s the hugest pain that you can imagine.

[00:41:48] It’s people who have had to go and write specific wrappers and try and maintain wrappers. Open slide is no longer being maintained. So now we go in and get slides from someone and there’s no standardization. We’ll spend a [00:42:00] day and a half maybe. Oh, what format of these annotations and Oh, this person uses XML.

[00:42:04] Oh, this company uses Jason. Oh, these people are using this type of XML schema. Oh, I need to rewrite my parser to go and do this. Oh, look, I can’t go and load their files into my perfectly validated and established pipeline because it’s a slightly different format. So, all of that is created so much inefficiency and created so many bottlenecks to collaboration and to research.

[00:42:25] That it almost seems like the right solution is to show people what should be done. And that solution is make an opensource answer. The open-source answer is we can use HistoQC it’s free. If you’d like ideas, you can contribute other, other metrics and things like that to there. Now there’s at least a rallying point.

[00:42:39] Now everyone can say I used HistoQC. These are exactly my metrics. It costs me nothing to do that. These are exactly my metrics on the other hand say, well, I’m going to buy, let’s say a Phillips scanner. Oh, but am I familiar with their SDK? Right. You think that the conversation ends up becoming something that’s completely unrelated to the problem that you’re actually solving?

[00:42:56] I want to do computational research. I don’t want to spend days [00:43:00] writing parsers for file formats. That’s not interesting. That’s not exciting. And quite frankly, I think it’s a waste of time. It’s a necessary evil until we have some organization. And we’ve seen things like this throughout big, even human history where we’ve seen.

[00:43:14] There used to be, for example two, two different types of DVD formats. You might remember you had DVD R plus and DVD R minus. And some of them were the Phillips ones and some of them were the Sony ones. And you couldn’t read these written in like all of this crazy stuff. USB was another, everyone came out with their own little different adapters and.

[00:43:32] It may be remembered in the eighties and nineties; everything had its own type of adapter. So, Oh, my mouse has its own power supply. This has its own power supply. This as it’s like, it’s just a crazy amount of added inefficiency. There’s no other word for it. It’s just inefficiency. And now you have one USB connector, and you can charge your headphones, you could charge your phone, you could charge this, you can charge this, you could charge this because it was a viable standard that people were able to rally around. So as far as [00:44:00] quality control is concerned, I think it has to be that. Other things, maybe there’s more, more leeway. And it’ll take more time to figure out, but if you really want to have a fundamental impact in a quality control space, it has to be something that’s so incredibly easy.

[00:44:14] To use and so available to use that people realistically do not have a choice not to use it because it’s really that easy.

Aleksandra: [00:44:23] Okay. So how can I use it? I want to, let’s say I have a project with not so many slides, some slides for a project. I want to have the HistoQC metrics for this. I go through GitHub. Where do I go? I’m going to put all the resources of the paper and everything you’re now telling the listeners about in the show notes that they can click and get it.

Andrew: [00:44:46] Great. So, there is a GitHub for that. First, I would recommend reading the paper to at least and it’s fairly short paper in JCO CCI 2019. So, a quick, a quick read there we’ll, we’ll certainly help. I think there is a [00:45:00] little bit of a disconnect. So, the people that would have to use it would have to have at least some technical experience because it is more of a, we’ll say a lower-level tool as opposed to a higher-level tool. So, if you’re interested in using it, for example, I think you could probably install it yourself. And it will work on, for example, a laptop. You don’t need any type of very sophisticated computer.

[00:45:22] You should be able to run it just by going to the command line and typing the command that is shown in the documentation. I mean, it really should be that easy. I think we’ve seen that for we’ll say, pure pathologists that don’t are not very technically heavy. It might be a little bit too much, but there’s a hundred percent of the time people within their organization that are, are capable of doing it.

[00:45:45] The technical skills needed, or really, we’ll say just slightly above introductory. And as well, we have people on our side that can help as well. So, when people don’t know how to use it, or it’s not working properly, part of that, that NCI funding is for us to provide support [00:46:00] so people can reach out to us, and we’ll have, let’s say a screen-share we’ll say, okay, now click here, do this, do this, do this to this year. So, there is even an opportunity I would say for people that have no idea how technology works to be walked through this, to the point where they are able to go in at least process their slides using the standard pipeline.

[00:46:19] Of course, if they’re interested in, in modifying it and using more, let’s say sophisticated features and that sort of thing, there is a little bit more of an investment in both the user’s time to, to kind of fine tune the parameters and things like that. But for, I would say routine H and S lines, the defaults that we provide, they seem to work really well.

[00:46:37] They seem to work really well. We haven’t had a lot of complaints in regard to defaults. We’ve had questions about extending the functionality to other stains and things like that. We have a paper now, currently in review where we’ve employed HistoQC on a kidney slide repository that looked at. Silverstein trichrome stain, PAS and H&E here’s.

[00:46:58] So HistoQC performed admirably [00:47:00] across all of them. And we released configuration files as well for each of those stain profiles as well. So that was done by one of our PhD students. And this is kind of this idea of open source, because again, he went, and he did this work. He didn’t have to change any of the source code.

[00:47:13] He only changed the configuration file instead of him keeping it private. We open source it. Now, if you want to use PAS stain, you can simply download that configuration file, click run and potentially that becomes something of a, of a standard that we can start the conversation around. I certainly don’t think that HistoQC is the end all be all in a hundred years.

[00:47:33] If we’re still using HistoQC, we’ve probably failed somehow in our quality control endeavor, but it’s, I think it fills a very needed niche in the short term and has sparked a lot of great questions and great conversations where people say maybe, maybe we should be, why aren’t we, why aren’t we wait if it’s digital, shouldn’t we be doing digital quality control, like every other science that we have, and the answer is yes.

[00:47:56] And finally we have an opportunity to do that.

Aleksandra: [00:48:00] So let’s say in the lab, you mentioned the one scenario with. Where it’s plugged in directly after scanning and they already developed in this lab metrics that alert them. And the, what I mentioned would be okay before I started the project, I run it on my dataset. Is there any other place where you would recommend plugging it in the lab or digital image analysis pipeline?

Andrew: [00:48:24] What’s interesting is I think that any, any time that you scan a slide, you should, you should use it. Right? So, it’s easy for me to identify when to use it, because I just want to, I imagine where the scanner is. And then I imagine the next box, if you imagine a workflow that next box is some type of quality control and a lot of scanners have some type of, of quality control, but I don’t think it’s extensive. And the problem is that no one, those, those metrics are not comparable, let’s say across facilities, across scanners, across that. So, you really do need a, let’s say a unifying way that people know what it is, right. Because there is [00:49:00] this kind of, I guess, goes back to our, one of our initial discussions.

[00:49:03] Even if I tell you what HistoQC does and you program that yourself, you’re likely to have a different implementation. So, you may say, Oh, he used mean squared using this equation. But in fact, I use this equation. So, then you would still not have comparable statistics, even though you think that we’ve implemented it the same way.

[00:49:20] So by releasing it and connecting it all the time immediately after the scanner, at least there is that consistency across, across sites and things like that. I think one thing worth mentioning as well that I find super interesting is you said you wanted to use it for a research cohort. And that’s, that’s really one of the main, main places that we’ve looked at.

[00:49:39] So keep in mind that when you go in and receive a thousand slides for study, you don’t necessarily use all 1000 slides. You’ll look at some of them and say, Oh, these 5% are so bad that I have to throw them away. Right. They’re just so bad, they have to be removed. Now the, the interesting point is, is that our study that that’s being reviewed now, it seems that if [00:50:00] you do that process with three different people and you give them the same set of a thousand slides, the 5% that they remove is not always the same. Right. And it turns out from, from our estimations that concordance can be as low as 0.4. Right. Of, of which ones you think are of low quality based off of three different readers. I was one of the readers in a postdoc that we worked with and a PhD student. So, we all went and went through this manually, looked at all, I think it was about 250 slides looked at all of them, said, I think these are the bad ones. And I think these are the good ones. Right? And then we analyze it. Wow. Point four. That’s very poor. And this is, this is pre analytic. This is even before you’ve done any type of experiment whatsoever. This is just in creating the data to go and do the experiment already you have a huge variability. Imagine how much that impacts downstream experiments. People wouldn’t even end up with the same data. And this is startling when you think about it in the context of the TCGA where now you have 10,000 slides, everyone downloads the same slides, right?

[00:50:57] All the slides are freely available to everyone. Everyone downloads [00:51:00] the same slides. I hope everyone does some type of quality control, even though reading some of these manuscripts, there’s no guarantees that they do. But now of course, when I read a review and manuscripts, I go and I look and say, did you do quality control?

[00:51:12] I know for a fact, some of these are, are not suitable for computational analysis. So, people are now starting to address that naturally in their papers, which I’m very thankful for. But then you have the next question you say, well, if you do quality control and I do quality control, and you say that this is, this is the result of your experiment.

[00:51:28] Can I reproduce your experiment if I don’t even start with the same data and the answer is probably not, because now you’ve imparted some bias in your selection of the good quality and the bad quality. So, what we did then was we used HistoQC across the same cohort, and we had the same three readers, do it again, and look at the HistoQC metrics to identify, which are the good quality sides and which ones should be removed.

[00:51:50] Concordance jumped to 0.95, 0.96. So almost perfect concordance between these three readers simply by using a quantitative tool that allows you to go and [00:52:00] say, Oh, these are different than the others. These are not suitable. Let’s remove that. So now we’re at least starting with a much more similar dataset that you’re likely to have better reproducibility of your experiments downstream simply by using a free open-source tool.

Aleksandra: [00:52:17] And another thing that you’re doing that I want our listeners to know about is your blog. Tell us a little bit about the blog what’s there. And what’s the profile who is it for.

Andrew: [00:52:31] we published a paper in journal of pathology informatics in 2016, maybe. And it was called deep learning, seven use cases, something like that. And. All of the different challenges. And this was kind of when deep learning was starting to become popular. All of the use cases that we were seeing in our lab, I realized that you could solve using deep learning, which I think has been well-proven since, since then.

Aleksandra: [00:52:56] Do you incorporate deep learning in HistoQC? Or is it?

Andrew: [00:53:00] No, no, no well, the challenge. The challenge with incorporating deep learning is that you have to train the model. And the problem is, is that as soon as you train a model, you need to have associated input data from it. And if we go and use HistoQC across 10 different sites, it’s not obvious to me what that input data should be.

I can’t go and say, this is the definitive ground truth of H&E simply because other hospitals will reject that notion. So instead HistoQC is, is using what I would consider first order image analysis and image processing an image statistic technique that. Are purely data-driven. So, there’s no site-specific component, which allows us also to remove some variability, because now if you train a model and someone else trains a model, how do I guarantee that those two models are performing the same?

[00:53:48] You can’t? So, then I have to perform the model. Unfortunately, if I have to train that model, I would have to train a model for breast tissue, brain tissue, prostate tissue, colorectal tissue. And how do I get all of this data? How do I ensure there’s [00:54:00] enough variability in the data? And then you kind of have this, this problem where now I need every single slide ever created in the world in order to train a model, that’s robust enough to address it.

[00:54:08] So you simply just avoid that too.

Aleksandra: [00:54:11] And this is a super important thing to mention because now I think the perception of image analysis is that the traditional image analysis is dying. And now everything’s going to be deep learning. Everything’s going to be training models and then you don’t have to annotate. You can do unsupervised.

[00:54:28] I think like you just said, there are applications and use cases for different types of image analysis. And it’s going to be continued and continued in parallel depending on the use case.

Andrew: [00:54:43] Keep in mind that, that over the, if you look at the history of image analysis in general and deep learning and machine learning, there’s kind of been a wave, right? So, there was this, this time where everyone was like, Oh, neural networks are going to solve all the problems and, and they didn’t. And then we kind of reverted back to more, if they didn’t shocking shock, we’re still [00:55:00] here.

[00:55:00] Still here. And then we kind of reverted back to more handcrafted features and those types of things. And then now the flip is to the other side of machine learning and artificial intelligence, which, which is great because it does really solve a lot of the problems that we experienced. I think we’re starting to see some of the limitations there and I wouldn’t be surprised if once everyone is kind of familiar with deep learning machine learning and, and we solve a lot of the basic problems and those basic problems are things like cell segmentation identification of, of where the tumor is on the slide. And these are all great things for, for deep learning that I don’t think a case can be made for using traditional image analysis techniques anymore for, since we have the data and the use cases are, are fairly straightforward. We’ll see, we’ll see a transition. I think once everyone has, has really caught up, right? So once the companies have built the tools and once people are using the tools and once the students are being trained with these concepts that will have its place, and then we’ll be able to return back to a lot of the [00:56:00] handcrafted and the image analysis components, which are still, I think, especially interesting in the context of biomarker discovery.

Aleksandra: [00:56:08] Yeah, this is important. And what about the blog? Tell us about

Andrew: [00:56:11] the blog. Ah, yeah, so we published this JPI paper. So, my, my notion for that paper was that it’s possible to use a single. Repository and solve seven digital pathology problems. So, the question was, can we, we without tuning anything without changing a single line of code, I would like to use the same deep learning architecture, the same training, not.

[00:56:36] Not tweaking any parameters, all I’m going to increase the learning rate for this use case know exactly everything the same across all of these seven use cases. And the seven use cases were exactly the things that we were struggling in our lab. So, nuclei segmentation lymphocyte identification, mitosis identification, segmenting, tubules, and glands identifying invasive ductal carcinomas and [00:57:00] separating different forms of lymphoma from each other and identifying epithelial and stromal regions. So those were the seven use cases. And I, I wanted to know, is it possible now to use deep learning will change the annotations, right? So, the data coming in is different on one someone’s annotated nuclei, and on the other one someone’s annotated epithelium, and on the other one someone’s annotated cancer region, like all of this sort of stuff. So, the, the, the input is different, but everything in the middle is the same. And then everything on the end is the same. So, we’re going to use the same output generation scripts. So, we’re not changing anything. We’re only changing the input data.

[00:57:31] And it turns out it works right. Deep learning was, was really powerful at that time. And those networks that we use were very unsophisticated compared to what’s available now. So, the patch sizes were small 32 by 32 batch normalization hadn’t been invented yet. I know maybe dropout had been incorporated.

[00:57:47] I think we use that. Yeah, we did use that in, in that paper. So, it was still very early stage, deep learning time. And we were able to show comparable or state of the art results across, all of those use cases. [00:58:00] Without changing any code. So that created a case where we said, well, I wanted to, and I knew as soon as that work, that I wanted to go in and release all of this, right.

[00:58:10] As soon as these experiments started to actually be successful, I said, wait a second, I have a general solution to all of these problems that are good enough, right? I’m not saying that we had F scores of one, you know, we didn’t have a hundred percent accuracy. We had great results. We had great results in three hours that would typically take us a year and a half to develop.

[00:58:31] So we saw these, this magnitude of efficiency improvement. So, what do we do with this information? It made sense to me to open source it, right? So, I’m going to go and release all of the code open source so that other people can use it. Fantastic. How do I go and tell people how to use the code? Well, code is it’s pretty dense and it’s not obvious how to use code.

[00:58:50] So usually you have 3d files, something like that. I realized that that wasn’t going to be sufficient in this use case because deep learning was so new that people didn’t [00:59:00] understand even the fundamentals of it or the concept of it. So, they would need to be further explanations. It’s very difficult to, let’s say, embed images into a read me file in a GitHub page.

[00:59:10] So then I said, well, we’ll start a blog. We’ll put all of this stuff in the blog. So, the first blog post itself was me writing seven tutorials, one for each of those use cases explaining all of the design decisions made in collecting the annotations, how we validated it, what the steps were to rerun it, exactly the lines of code that you would need to copy and paste into the terminal to replace to, to reproduce those results. At the same time, we released all of the data itself. So, every single piece of data that went into that manuscript was released. That was the largest release at that time of, of any type of open-source digital pathology data.

[00:59:44] We released the annotations as well, and people flocked to it, people flocked. I mean, there’s that, that blog receives probably a thousand hits a month at this rate. Just these people going and use it for reference to come back. Then at the [01:00:00] same time, I realized. That people were interested, the positive feedback was insurmountable.

[01:00:05] It was just an amazing amount of positive feedback. And then I realized it didn’t actually cost me that much to do that. Right. And I realized at the same time that as training students and we’re teaching them, I’m teaching them the same things over and over again, because we have a new student that doesn’t know how to do these, these sort of things.

[01:00:22] So now my general policy has tried to be, if I have to explain something twice. I’ll try and write it into a blog post because then that third time I no longer have to explain that third time I could say here, go read this blog post. Let me know if you have any questions. There’s usually some source code or things like that at the same time.

[01:00:40] It also enables me to. And it’s, it’s perhaps it’s very selfishly, but it enables me to go and create standards for what I expect for my students. So, for example, one of the things that, that I wrote is some code, quite simple. That will go and take your results, your segmentation results, [01:01:00] and it’ll put them programmatically into a PowerPoint slide.

[01:01:03] So it’ll take your,

Aleksandra: [01:01:04] I saw the post.

Andrew: [01:01:05] Original image, Ground Truth output from your algorithm and some comparison, some statistics, the name of the file that it was, et cetera, et cetera. Now you can go and just change two or three lines of code. My files were over here, and this is how big they are or whatever the required changes are.

[01:01:22] Click run. And now you have a PowerPoint presentation. That is exactly the way that I prefer to look at results. So instead of a student going in and let’s say, ad hoc, giving me weird, weird file formats, or, Oh, here it’s this. I’ve gotten ones where you know, and it’s not their fault. It’s just, what’s easiest for them in that, in that particular moment.

[01:01:41] And they don’t have the experience yet to realize that someone else. Does it know what they’ve done or how to interpret what they’ve done. So, you’ll see, I’ve gotten situations where the original images have one type of file name. The output has a different type of file name. And it took me one or two minutes looking at the file names to try and match up like, [01:02:00] Oh, this is the output from this image.

[01:02:01] And then it takes me another minute to look through all of the masks to say, Oh, this is the output for that. I mean, it’s just not efficient. So, I say, well, go and present your results like this. Now the students can go and read that blog post, blah blah blah, oh yeah, this makes sense. They can go and present their results.

[01:02:14] Like that. It’s much more efficient for me because now all of the results that I see are in the same format. So, I don’t have to expend any additional mental effort to try and parse through each individual person’s preference for presenting the results. Now, this is how you present your results. And I would like to see it like this.

[01:02:30] Now it’s easy for me. It’s easy for them because once they’re familiar with that script, they use it for every single project that they have. So, it becomes much easier for them to go and look at the same time. I think it addresses one of your other points where you said, well, people will only go and use the best, the best results, for example, in their paper or they’ll cherry pick something like that.

[01:02:48] Well, if you have to manually go and click on each image and the output and have them on yeah. Or screen and say, Oh, this kind of looks good. And then you do that to the next image. I would argue that no [01:03:00] one is going to do that for a hundred images. It’s just annoying. It’s annoying. And it’s time consuming.

[01:03:04] On the other hand, if you can click one button to have a PowerPoint presentation made. Now you just quickly scroll through with your mouse wheel. All of the information is perfectly paired. Now you’re able to actually look at your entire dataset and get a better feeling for how the algorithm is performing instead of randomly looking at it or cherry picking it.

[01:03:25] And you’re able to go and convey that information to other people as well. So, it’s really these types of best practices that I think ended up making it into that, that blog simply because it’s becomes more efficient, I think for me, but it becomes more efficient for the students as well to be able to understand what they need to do.

[01:03:40] And in some longer form explanations of why those particular decisions were made.

[01:03:48] Aleksandra: [01:03:48] So it is more for computer scientists and they’re doing digital pathology, then the other way around for pathologists. I have one favorite post, which is how to download the [01:04:00] images from TCGA data, because everyone’s like, Oh, TCGA, TCGA. I want a couple of images just for illustrative purposes. How do I get them?

[01:04:08] And this is the post I’m going to link in the show notes as well. But from what I have looked at, I don’t have too much use as a pathologist from this but correct me if I’m wrong.

Andrew: [01:04:20] Yeah, no, I think you’re right. Yeah, I think you’re right. So, it’s more geared towards what I would consider computational pathologists. So, the, the people that are interested in, in loading the data and analyzing it that are familiar with Python, because there is my GitHub account as well that has the parallel level of the code that’s available in those in those blog posts.

[01:04:41] But there certainly are. Components that I think pathologist would be interested in, for example, we announced the HistoQC papers available there, and we have a larger explanation that, for example, we’ve discussed in this, this blog posts this, this podcast that isn’t necessarily appropriate for [01:05:00] a manuscript itself, that kind of gives a larger, broader vision of what may be possible.

[01:05:05] So there is some, I think, commentary on that. But in general, I use it, I would say, as a tool to really codify my experience, such that I can more easily and efficiently pass it on. To the, that third person. And then once it’s, once it’s written that third person is not a third person, the third person is, is a few hundred people, simply because it is freely available and publicly available.

[01:05:25] So I get emails from places. I didn’t even know that they did digital pathology. I had, I had email me a bachelor’s degree student from the country of Laos. Email me and say, Hey, I’m super interested in it, but thanks. I downloaded the data, blah, blah, blah. You know, this isn’t really working for me. How do I get?

[01:05:43] And I was like, this is fantastic. This is exactly, exactly what, what I would hope for while at the same time, I don’t. I don’t think every place in the world has sufficient access to let’s say the infrastructure for digital pathology, but as well as the people that have experience with that digital [01:06:00] pathology or the, the mentors that have the experience to help say help them save some time.

[01:06:04] Right? So, it’s, again, this idea of helping other people fail as quickly as possible so that they can succeed sooner. If you have no access to a pathologist you’re probably not going to advance as quickly, but if we can go and take some best practices, Hey, you might want to try this. Or these have worked really well for us.

[01:06:21] Here are some code showing how to do this and formalize that. Now you’re, you’re making that more available, right? So, this again is this democratization of digital pathology for other folks.

Aleksandra: [01:06:34] Okay, thank you so much, Andrew, for this great conversation. And as I said, I’m going to link every resource that we talked about in the show notes. If you have anything else, please feel free to send me and it’s going to be down there to easily click on and the have a great day.

Andrew: [01:06:52] Thank you so much. Have a great, ciao!

Aleksandra: [01:06:55] Bye bye.

Get Your FREE E-book Here!

HistoQC, an open-source way to control the quality of pathology images

Transcript

Related Projects

Swarm Learning: The Future of AI Collaboration in Digital Pathology w/ Oliver Saldanha

Current Digital Pathology Trends: Post-conference Recap from USCAP 2024 (Unscripted)

Digital Pathology for Community Hospitals | Dr. Elizabeth Plocharczyk