Community Blog

Community Blog

OSR Presentation: Insights Out of Chaoss, Cali Dolfi Red Hat

June 10, 2024

Learn about 8Knot, a new tool designed to enhance open source community analysis, with a video of the recent FINOS Open Source Readiness presentation by Cali Dolfi of Red Hat on June 8th, 2024, or check out the transcript. In her presentation, Cali provided an in-depth look at 8Knot’s features, explaining how it utilizes Augur to collect and structure data from Git repositories. 8Knot offers comprehensive visualizations and dashboards to provide insights into community health, contributor activity, and project sustainability. The session also included a dynamic Q&A, where participants discussed potential future enhancements and integrations based on the audience's feedback.

 

8Knot Github

Recap

Cali Dolfi, a Senior Data Scientist at Red Hat’s Open Source Program Office, presented on the 8Knot project, a tool designed to provide comprehensive data analysis for open source communities. 8Knot uses Augur to collect and structure data from Git repositories. It offers a range of visualizations and dashboards to assess various aspects of community health, such as pull request staleness, issue activity, and contributor growth. The project aims to fill gaps in existing tools by leveraging robust data science methodologies, making it easier to analyze and understand the dynamics of open source projects.

During the presentation, Cali highlighted the importance of understanding community behaviors and trends to make informed decisions about project involvement and support. She demonstrated how 8Knot can be used to get insights into licensing, security, and overall project activity. The session also included a lively Q&A, where participants discussed licensing data quality and expressed interest in integrating 8Knot with other tools like Backstage. While this integration is not a current feature or future plan of 8Knot, it emerged as a potential area of exploration based on audience feedback. The presentation concluded with a call for community contributions to improve the tool and address any data accuracy issues.

TRANSCRIPT

Cali Dolfi: [00:00:00] Hi, my name is Cali Dolfi. I'm a Senior Data Scientist in Red Hat's Open Source Program Office. And yeah, I'm here to talk about all things 8Knot and all of the related projects around it.

And then we'll do a demo and we can talk all thing community metrics. Just doing Insights Out of CHAOSSS. We'll go through everything, but a lot of the tools that we use and what's where we get our data is from the CHAOSSS Project. And we work closely with them as well. So this is just the coined phrase that I like to use for all of my presentations around 8Knot.

So first I wanted to open up this question. If people don't know the answers to this, think about it, put it in chat. We can come back to it later. But what are the type of questions that you have about the open source communities that you're involved in or the ones that you depend on? I don't know.

Let me see if I can get back to chat. I'm like, there we go. Anybody can ask them out loud. I'll give it a second. Anything about communities? Anything you're concerned about? Pie in the sky? Anything?

Brittany Istenes: Essentially just the viability of the community itself, right? Is the community based around my project active? Are they doing the right things? Who are these folks that are, actually contributing to my projects? Like those sorts of thingsI'm very curious about.

Because sometimes it's so vague and you don't, [00:01:30] you can't see everything, especially right at a glance. 

Cali Dolfi: Yeah.

Who's supporting the project. 

Jamie Slome: I've also one of the things that seems difficult, but maybe this is, this will be educational for me is where you've got open source communities who rightly or wrongly I guess self implode their projects for whatever reason being able to track that before the case, if it's possible, if it's not.

Yeah, that would be one of those things. 

Cali Dolfi: Nice. If you'll say them verbally, can you all put them in the chat as well? Cause I'll try to, the ones that I do have visualizations that answer, I want to make sure I can show which visualizations those are and then how I read the graphs to answer that.

License policy, yeah, that one's huge. Especially these days, license changes. How to make quieter voices for sure.

Wait, go to my second question to, to start us. Oh, wait. What do you want to know about a project before you get invested or dependent? Like in that early phase, you probably a lot of times there's a couple of different projects that solve a similar ish technical prod has a similar ish technical solution.

What do you want to know about those projects before you make that big decision?[00:03:00] 

Security issues? Huge these days. 

Jamie Slome: Is it active? You know was the last commit eight years ten years ago versus, was it last week? So they have like good release cadence. Do they you know, do they release minor patches on a good basis rather than releasing once every four five eight nine ten weeks months, whatever it is? 

Cali Dolfi: Yeah, definitely Viability, supportability.

Yeah, awesome. These are all great questions. And continue putting them in the chat because my demo is planned is pretty short because I like to just go off of and look at things that people are actually interested in and not just my spiel. Sweet. Okay, how do I so I'm first going to go into the motivation beside behind 8Knot why we made a project when probably some of the people in the group may be more familiar than others.

There are different tools that have been available for a while to try to do data analysis around open source communities. I came into Red Hat around 2020. Mid 2020 as a Data Science Intern, and I was doing a lot of one off requests and I did an audit of all the projects in the space and.

[00:04:30] Now, even more so than at the time, it became very obvious to me that it was all created by usually community managers trying to create a data analysis platform, rather than data scientists or people with like taking in the data science workflow and applying that to open source communities.

There is 25 plus years of research and things in Python and R. And coming from the academic background that I had, I wanted to capitalize on all of the research and the tooling that had been applied to other spaces, and especially the methodology, and then take it to open source. And so that's really what came to be 8Knot

and now I just want to go into some of the general projects that feed into eight not and just the general community metric space. The first one to know about is CHAOSS. There's two major sides of the CHAOSS project under the Linux Foundation. You have the metric side. There's a lot of different working groups that talk about the definitions around me around metrics.

And I can say, as the person who's done a lot of the technical work around 8Knot honestly, 90 percent of the work for metrics is the definitions around them. There's a lot of thought that has to go into that. And so it is great to have the resources that CHAOSSS does. With a very wide range of defined metrics.

And if you go onto the CHAOSS page and if people are interested in [00:06:00] this, I'm sure we'll probably have time. We can go and I can show y'all. I honestly, whenever people come to me with questions, especially within Red Hat, the first place that I look is the CHAOSS defined metrics. And when people say sustainability, you could go onto CHAOSS's website, type in sustainability.

And there would have a decent like portfolio of metrics and how those metrics come together to answer the overall question of sustainability and on the other side, you have the software, which usually is broken down into. There's a project. And then the 1 that we work directly with is the Augur project.

And so Augur is a project that you give it repositories and it populates a relational database. With pretty much any information that you can get around it's GitHub or like it, you're talking issues, PRs, it handles identity and handles the large amount of data engineering that comes with trying to do this type of analysis. It is not realistic as like the data scientists to also be responsible for the data engineering. And so we'll get into that a little bit more, but it takes the absolute mess that is the GitHub API and create structure around it to provide us the ability to do the in depth analysis that we do Project 8Knot fits under a larger branch called Project Aspen.

8Knot, this is the community data analysis dashboard. It's [00:07:30] based in a cloud native container. We use Augur, and it is a fully Python native data science toolchain using Dash, Plotly, Pandas. NumPy, any of your common data science packages in Python. There is a publicly hosted version of it, actually two of them, we'll be showing those.

And there's also the ability in their setup for you can host them yourselves. And this is the group for financial institutions. A lot of people are going in this group, are going to be wanting to do this, and they're on their own internet. And Augur and 8Knot is set up specifically, so then you can host your own instances, collect on the repositories that you want, and that's something that you can do pretty easily on your own.

The second portion of it, which we won't talk about too much today, is the open research side of it. And right now we're working on developer network graph analysis, so trying to understand how people move between open source projects so we can understand the churn cycle. See where people are going.

What are those emerging projects and understanding the behaviors around them? So this is a little bit of a visual representation of like very high level. What is the technical architecture of 8Knot. You give your repos and organization URLs to Augur. It's a relational database with GitHub data and an enforced relational relationship structure.

That data populates 8Knot which is the dashboard and we have all the different [00:09:00] visualizations. And how do these projects come together? You can think one of the side is the data engineering. One side is the data visualization, data science. And the Augur side is the main maintainer is Sean Goggin, anybody who is familiar with him under CHAOSS.

And then 8Knot is maintained directly under the Red Hats Open Source Program Office with me being there. So now. Let's go into a demo. This is our publicly hosted version that's managed by Red Hat. Anybody can use it. And play around with it, see what they think that the analysis CHAOSS also runs a instance of it. And if people want to use that one, I can provide the link to that as well.

And so this is our, yeah, this is the open up page. I like to point out that on this, most of the things that I'll talk about on how to use it is on the welcome page tells you about some of the built in functionality that comes with Plotly that are maybe not as intuitive if you've never used Plotly graphs before explains how 8Knot works at a more in depth level and then how to log into Augur because you can log into the Augur front end, our database And you can create your project groups, which is great because you can, if you have a similar set of groups of projects that you always want to see, then you don't have to select them, but the use case that a lot of people use it for, especially me, is that triggers [00:10:30] collection and the Augur database.

So if our public instance doesn't have a repository that you want to look look at, go create an account. Add that to a user group and that will trigger collection. So then the next day or two, you can come back and you can see all the visualizations populated around the ones that you care about. So I'll do a kind of a vague walkthrough of the app and then we can start answering questions and going into things.

And I'll go through the chat to try to point out which visualizations I would use to answer some of the questions that had been asked. This is our just repo overview page some of the things that I find really interesting, this is something we've started to implement is knowing what type of languages, the files that go into this project by files and by lines of code package versioning updates, which can be really important if you're considering your a project for the first time, understanding how up to date are they on their dependencies, which can tell you a lot about the overall maintenance of the project.

And then we have the SSF scorecard, which I know people ask, have been asking a lot for as of late. It hits some of those overarching points. That y'all said in the chat, it's a good first stop. It's not where you should end, but it's a good place to start, and they have it very specifically defined what these scores are based off and what they mean, and you can look at them here any of the graphs have a info tab that will tell you exactly what data goes into the graph and how [00:12:00] it's synthesized, and then the general info, which some of the people had mentioned that they want to know what type of licenses that are being used and so we have that for and then what type of the last release time the average time between releases and so this page is really great to be able to get a first look at the repositories that you're looking at, and I'll point out, we only have 1 repository and the search bar right now. I just put it. It's a project that I know has a pretty substantial amount of data, but if you have multiple repositories, this will have a drop down and you can select a couple. Once I go through all of these, it probably will where I did end up collecting on the FINOS one, but some of their secondary collection hadn't completed yet.

And so I wanted to show it with a fully populated results. And then I can go back and show some of the stuff that's specific to the FINOS org. I know that's something that you are more familiar. And so yeah, we have the different pages. This is looking at different forms of contributions. Some of the things that I personally look at whenever I'm looking at a project for the first time is the pull request staleness.

It's one thing to know how many pull requests are open over time, but I find it to be a lot more rich to understand the actual behaviors of the pull requests. How long are. Is there just a bunch of pull requests that have been open for a year or two years [00:13:30] and they're just stale and sitting there?

Is there a churn of pull requests? A lot of this stuff, and people ask me all the time to make scores for different things. As OpenSSF does that for specific things, that's their thing. For me personally, I don't like to put like a good or bad rating on visualizations and metrics like this because they're just different.

It's context. Sometimes like one, if you have a project, for example, with a pull request activity that has a set of pull requests that have been open for two years and they just don't close them. Yeah, it would be great if they would close those because they're obviously stale and not going to be used.

But if they have a, still have a pretty significant churn of pull requests that are getting open and closed every month. That's more of a, it's a niche detail rather than something that you should be really concerned about. But that's one of the things I also look at that pretty significantly for the issues activity, the staleness works the same way.

And almost all of the visualizations, you can choose the parameters for it. What is considered stale or any of the other ones is very dependent on the size of the repository and the behavior of it. And so we allow users to be able to select the parameters that make the most sense for the communities that they're analyzing.

Another thing is we have the assignment count. So it's understanding how many of the issues or the pull request actually end up getting assigned. That's another one of those things. It's good, bad [00:15:00] or otherwise. It's just the behavior of the project. One thing that I also really like to look at is the time to first response around pull requests.

I think it gives a pretty interesting view of how active the community is around when pull requests are being made. And one thing that's good to note that I forgot is that. All of these visualizations automatically filter out bot data or GitHub identified bots. If you want the bot data, you can click to turn off that filtering.

I think most people just want that out of the case. So these are first responses by people. And you can see how the conversation rate is happening around the pull requests over time. And I see the question, is GitLab supported? Yes and no. That's something that we're actively working on right now.

Augur is set up to be compatible with any Git based repository. Right now we're actually actively working on GitLab of going and doing an audit because it doesn't, the API doesn't exactly line up. And while you can choose to select on GitLab it's not a, we're not, you're not getting as in depth data at this point and GitLab doesn't provide as much data as the GitHub API.

So we're trying to figure out how to turn off different visualizations for people don't get any misconceptions about those projects. Yeah, misconceptions about the project just because that data doesn't exist on that platform. Yeah, we'll keep on, I'll keep on [00:16:30] trucking through and going through some of the visualizations that I look at pretty regularly.

This one, I would say, is probably the The biggest, the most important visualization that I look at in every single use case is contributor growth by engagement. Your contributor base is always, over time, is always going to go up and to the right. You can't have like negative total contributors over time.

And so that's always going to be the case. But seeing the consistency of the drifting and active contributors, I think tells you a lot about the community. What is the trends? Is it staying really stable? And the size of it, is it going down? Is it going up? This right here, I think tells you a whole lot about the active state of the community and where it's at in its life cycle.

And so I'll keep on going through another one of the big ones. I don't know how familiar people are with different common data analysis around community metrics. One of the big ones is people will talk about is bus factor. The analogy is that okay? If one, two people get, which is a pretty, which is why we say a lot, or we'll say the bus factors, the common term we use lottery factor, which is if one or two of your contributors wins the lottery and never is seen on the internet again, tomorrow, what is the impact on your community?

And one thing that I like to look at, I might pull up another visualization to, we can see this a little more specifically but [00:18:00] looking at lottery factor over time. And this is by each individual contribution type. I'm just gonna see. Over time, you can see here that. The concept of lottery factor bus factor, at least in this case is we can see here is that it's over a six month window, and it's how what is the least amount of contributors that make up for 50 percent in this case of that style of contribution.

For example, in this month, it means that. At the least amount, there's 51 people that contributed 50 percent of the open issues, which is actually an incredibly high lottery factor which is really great for this community. That means that there's a whole lot of people. This is also a kind of showing some of the functionality of Plotly seeing that.

Okay, that's great to know, but I want to get a better view of the rest of the data. I now know what the issued open one is. Let me click that and get a and start to look at some of the other activities. Overall, actually, this project has a pretty solid lottery factor. A lot of times you end up seeing your lottery factor at 2, 3.

Once you seeing a lottery factor. But you can see over time. Now it's becoming more and more dependent on a small amount of people. And so when people are considering, okay, what is the viability of this project? These are the type of things that you can consider once you get more in depth data, you're not going to be able to look at a repository and know that over in Q three of last year,

there was only three people made up 50 percent of [00:19:30] the commits of that community. And so that's. Pretty, a pretty substantial piece of information to know. I like looking at this over the span of time. This one looks at it in a pie chart, and you can choose what time period, how many contributors you want to see.

So these are two different views of what I view as a really rich data set that wouldn't have been possible in the tool base that it existed before 8Knot. I think this is a really good like, why did we create this? And I point to visualizations like this people are asking who is involved in the project.

This is where I'll point to the affiliation page. Whenever you're looking at company or organizational affiliation or project, it is a very hard question to answer. And so the strategy that we go about doing it is looking at that question from a couple of different angles. This is never something where you should be like, Oh, this is the exact number of contributors are for X or Y company.

Unless you have very specific data that's much more specific than this is. This is more to get a general idea. Okay. After looking at it from a couple of different perspectives, I know that X company is significantly involved and then Y company is a little bit less involved. And here are some of the other major players.

And so this is about six different. We have five different views on, organizational diversity or the affiliations. And this is one of those times where it's good to, you can look at the about graph and it'll [00:21:00] tell you the exact definition for each of these visualizations, but I'll point out each one of these and how they're different.

So this first pie chart is just having the unique contributor count by email domains. So that means if contributor A has, I don't know, a thousand commits and contributor B has one commit. Those two contributors are counted the same versus in the second graph contributor A would be accounted a thousand times and contributor B would be counted once another thing that we find to be pretty rich is the ability to look at associated activity. GitHub has a storage of every email that's ever been associated with your account. So if you open one pull, you have one email for your notifications, and you use three different emails at different points for commit logs. All of those are associated with a singular account, and you honestly don't know which of those identities or domains you're committing that singular person is contributor being on behalf of so it's just the associated activity. A lot of times I like to exclude Gmail, exclude GitHub. Actually, yeah. There it is over there. And so you can see Silo, Apache, IBM Microsoft, Red Hat for Istios.

That makes a lot of sense. But you can get a pretty good idea of the overall large players in this. And what's interesting is a lot of times in these actually, it actually does make sense because Google, [00:22:30] the Google at Google is different than Gmail. And Isteos was owned by Google for a very long time.

And so you have Google is the big player. And then you have a couple of different companies that are heavily involved. And then this one's just the one that people choose to put their companies and their profiles. So I'll actually skip over this one because I want to show these visualizations with a little bit better data for those.

But whenever people are, I know somebody mentioned something about, okay, there's a a big exodus or, that there's some type of implosion that's happening in a project. How can I predict that or understand where there's risks? This is another one of the more complex visualizations. I'll try to.

Keep it relatively light. But a lot of times, especially if you're managing a community, you want to know where the problem areas are before it becomes a larger issue. So for example, if you have a large portion of your code base, that's unmaintained so you can see what is the contribution activity versus how long has it been since somebody who was active in that portion of the code base has been seen in any part of the community. Even if somebody say for this example, we can go like in the common folder that somebody has not committed in that folder in three years, but they are involved in other parts of the project opening issues, they're still actively involved in the community, still have that [00:24:00] knowledge base.

But if you have sections of your code base where that knowledge retention isn't there. You want to know that to be able to manage it before a huge backlog of PRs is open, for example. And you also know if you want to try to convert people from a contributor to a maintainer. So this is just a good example of a, of some of the more in depth visualizations, once you get into some more mature community management or trying to understand a project from the outside of being like, okay, is there's portions of this code base that seem like they're pretty unmaintained. How do I feel about that?

Does that impact my want to get involved or does that impact the way that we get involved? So I just talked a lot. I'm going to put it in the FINOS repo, even though I, or even though I know it's it's not completely full, but actually I'm going to do the CHAOSS one to show the last visualization that I want to really.

Go into so I think it's a different visualization than a lot of the other ones. And this is project velocity. All of these other visualizations group together the contents of the search bar. So all the different repos and project velocity. It compares the different repositories and you can choose what you care about.

And being like, okay, issues being opened or PRs being merged, what date range, and you can see different repositories activity compared to one [00:25:30] another in a pretty systematic way. And so if you're trying to choose between a couple of different repositories or understanding the activity and size and comparison, this is a great visualization to go to, to understand what the how to compare them all.

So that's my main spiel and let's hope everyone has lots of questions and we can talk more about it.

Brittany Istenes: This is one of the coolest metrics tools that I've seen, right? You know what I mean? When you showed me this at the Member Summit, I was just gobsmacked. And to see how it's grown especially that community management piece that you just said, as well as the dependencies, like that's new. 

Cali Dolfi: We also have a complete redesign coming soon. This has been something that we, it's always the joke of like engineers when engineers are left to design, that's what happened with this. And now we've actually have some amazing designers in CHAOSS africa have done a complete redesign.

They're wonderful UX UI designers. And so we have that in the works. Okay. Is my firm. Yes, and so you can create a view of select. If I wanted to put Istio, and then, I don't know, the FINOS Perspective, any set of repositories I can put in there, put in the search bar. And that's also one of the main reasons why we have the user groups, because if it's this, there's the same like set of 20 repositories, you're always wanting to go and look back.

It gets annoying to select each one each time. So you can just make your account, select your [00:27:00] user group and look at them all together. A lot of times I do that for if I'm doing like CNCF analysis for Red Hat. I just have the different the different groupings so that whenever I log in, I can just say, I want to see a CNCF incubating projects and select instead of having to put a hundred and something repositories in a search bar.

Rob Moffat: Awesome. Thanks, Kelly. 

Kay XiongPachay: Kelly I had a question, but thank you for joining the call and demoing this. This is really great. I joined a little late. Can you explain the difference between Augur and 8Knot? Or 

Cali Dolfi: yeah, definitely. Yeah, I have a good visualization for this actually that I like to use. So how you can think about a very high level is you give Augur your repo or organization URLs.

And that stores it, populates the data in a relational database, that Augur database is its own thing, has all the data, and the data goes into 8Knot. So if you've ever used a tool like Superset, maybe, or any of the dashboarding tools, you put in the database credentials into the visualization platform, and that's how it accesses the data, and that's how it works for 8Knot.

When you Like launch 8Knot, if you run it locally, you just give it your, the database credentials and then that's how it populates all the visualizations.

Rob Moffat: Yeah so it reminds me a bit of some of the stuff you get out of Backstage. I was wondering if this plugs into Backstage or it can be used with [00:28:30] it, or? 

Cali Dolfi: I've heard the name before, but I'm not super familiar with how it is.

Rob Moffat: Yeah. James, can you describe what Backstage is? I think I'm probably going to mess it up. 

James McLeod: Yeah, I can. So Backstage is a developer portal which was contributed into CNCF by Spotify. And it's very versatile. It comes with a whole host of different plugins and extensions. But really what it does then it enhances developer experience around open source, maybe even in InnerSource projects by giving you like a An entry point where you can search for something that may be of interest to you and your engineering team.

And then there's an easy onboarding and easy contribution model. What Rob was saying about metrics coming into Backstage is actually really interesting. So I didn't think about that. And the background, I've been pulling together a bit of a roundtable surrounding this subject.

Yeah, that would be really interesting to see whether, Backstage as a dev portal could support this type of view. 

Especially against, project repositories. 

Brittany Istenes: I definitely think that it could like Backstage essentially what James was saying, plug and play tool, I want to, I got to spin up this project, I need to find all my reusable components.

Backstage, it's like a curative library based on all of your reusable components there. What would be really nice is to have that metrics backing point, like 8Knot to bring in to a Backstage integration, because essentially it's just another tool connected with all of the repositories that [00:30:00] you're currently leveraging.

Ideally, so I'm saying with like InnerSource, what James was saying to it'd be perfect because all of a sudden you have your 50 tools that you're leveraging in Backstage. Now you can have all the metrics behind it, like quantifiable quality metrics. That's actually a really good idea is to potentially look at a Backstage integration and they have a large.

Large maintaining pool. 

Cali Dolfi: Yeah. I'm going to definitely look into that. I'm going to write that down because it reminds me of some of the conversations that I'm having around analyzing SBOMs because it's now just like a huge huge thing and there's some tools that I'm hoping to get open sourced.

Where they're analyzing SBOMs and then trying to use 8Knot to provide more in depth analysis on the communities that are involved in the dependencies identified in an SBOM. 

Brittany Istenes: Yeah, and isn't there, there's a DevOps, John Mark, I think he runs the DevOps SIG. John Mark and a couple other folks run the DevOps SIG for FINOS.

I think that happens on Thursdays. Cali, I could forward you the invite. They talk a lot about Backstage and Backstage integrations if you wanted to join in and jump in and meet some of those folks too. 

Cali Dolfi: Yeah, send me over that information. I'll be curious. I said wrote some notes down. I'm gonna look into it.

Rob Moffat: We actually have a Backstage Working Group as well every couple of weeks. . But yeah, I think it's a good fit because Backstage has got this kind of, it's got a data model, but it's. It's federated, so you can plug in different sources of data. And so I think this would work quite, I think it would work quite well.

Cali Dolfi: I'm also just going to put this in the [00:31:30] background while we're talking, people want any pointers on getting involved with something that you're interested in. If you have any specific community questions or visualization ideas, open an issue. You hit a bug, please open an issue. We're a young project and that even just going and playing with the hosted application and breaking it is helpful in knowing .

We know where those pain points are and we can fix them. And if you have a company that's looking to our project, that's looking to leverage 8Knot email me, let's talk about it. I can help talk about how that would work and have your own instance. How that would all work and Augur. And then if it's something that you're more deeply looking to get involved in what the different areas of support from like a technical contribution standpoint some of our needs are.

Rob Moffat: It's worth us just going back through the questions that got raised at the beginning of the meeting and see that we can try and answer now. 

Cali Dolfi: Yeah, let's go back to Drew. 

Rob Moffat: Amol raised a whole bunch of things like I think you've answered a couple like finding out who's supporting a project, but he asked about licenses and security issues.

Is there anything, do you do any reports around that kind of, I suppose you have to, yeah 

Cali Dolfi: so the, I don't want to take CHAOSS out or yeah. For the security side of it, that's when I was leaning into the OpenSSF scorecard. The hope is in the future that we'd be [00:33:00] doing some security analysis from an SBOM perspective.

GitHub now provides like SBOM outputs of all the repositories, which is super useful. 

Thomas Steenbergen: Please don't use those. Please don't use them. 

Cali Dolfi: No, I haven't gone down this alley at all yet. So I'm glad to know that they're bad. 

Thomas Steenbergen: They're terrible. And that's saying somebody that works on SBOM standards for basically I'm the co author of the SBX one.

So we love that they're an SBX, but the quality is the, is an issue. And at the same way, just as a note, I know the licensing data and the dependency data is not accurate that you have. It's a good indicator 

Cali Dolfi: for which one? 

Thomas Steenbergen: I'm from Augur , I know Sean. So I've been talking to him about how it works.

Both the dependency data and the licensing data is not accurate. Okay. And that's just the way how it works it for me. That's also not how I'm using 8Knot. 

Cali Dolfi: Oh. Have you been using 8Knot like regularly? That's off. Like 

Thomas Steenbergen: for me it's an indicator, it's an indicator tool of something is needs further research.

Cali Dolfi: Yeah, 

Thomas Steenbergen: it's not. And that's maybe that's to be clear. Like there, there's different types of tools that have different kind of color for me. It's I look at the, for me, this is a community assessment tool. I look at the health of projects and just give me some insights. I don't use it for licensing.

I don't use it for security because the data is simply not accurate enough. 

Cali Dolfi: Yeah, if there's any data that you're seeing that's [00:34:30] inaccurate, I really ask to please open an issue. 

Thomas Steenbergen: No. So I discussed this with Sean. So the problem, so there's, again, maybe too much detail. The way how Sean does dependency analysis is, yeah, it gives some, but it's inaccurate.

Similarwise, the licensing reported is only the main license or the declared license. It's not the actual license. 

Cali Dolfi: Okay. 

Thomas Steenbergen: And that's, look, I work in like this field for now, for eight years, I was running my OSPOs and basically looking actual, actually having the concrete data because working in automotive, we actually get fined for this stuff if we get it wrong.

Cali Dolfi: Oh yeah. A hundred percent. 

Thomas Steenbergen: So again, this whole field of I know a lot of people want to have a tool for everything and connect everything to everything, getting dependencies, licensing is way more complex than most people think it is. Yeah. 

Cali Dolfi: Oh, yeah, 

Thomas Steenbergen: we're working on open sourcing this, like making a data pool, like a lot of the German automotive manufacturers are now working on open sourcing way more data on there.

But yeah, we're not yet there. 

Cali Dolfi: Yeah, like completely understandable. But again, I please open issues of anything that you see is inaccurate. For then we can change at least the representation of it, making sure that people who don't have as much knowledge as you do understand like what they're looking at.

Rob Moffat: Yeah, I was wondering if can you give us a bit of an insight about, because you're using this, right? 

Brittany Istenes: I am using it externally, right? Just working in, just doing my own stuff. But what we're trying to do is we are trying [00:36:00] to bring it in house, right?

That's my goal. There's only so much I can say, but we are. I do want to bring it in. It's just a matter of being able to support it. And then also, I am of the mindset that if I'm going to be consuming something from the open and I would want to file issues. I want to be able to give back to it. So that's I'm doing like a two prong attack here. So I want to actually have some open source maintainers be able to give back to this project. So that's my goal. So I'm hoping, hopefully maybe by the end of the summer, I could, Cali and I could come back and come back together and talk about how we were able to leverage this tool and to Thomas's point too. Yep. Not all data is going to be completely accurate with any particular tool, but I'd say If you could, honestly, if you could find a more comprehensive metrics tool for, from a community standard, right?

I'd love to see it. That's not like a FASA or not something like that. I'd love to see it. 

Thomas Steenbergen: So what we have been working on we now got funding pools, is to basically pull in things that are Sean doing into our complete sets. Because again, we already have a really good open source SA tool.

We already know all of the dependencies. We already have good licensing data. We already have some security data. Now the next thing is adding community stuff there, but we do it slightly differently. So we don't. I use tools like 8Knot to visualize things, but since as an OSPO I wanted to do things at scale and at speed, we actually write policy rules on top of it.[00:37:30] 

So our engine has for licensing, security, and InnerSource, all kind of policy rules and the similar ones we want to write on community data, we want to write also policy rules. So tons of people have asked me like, and I think it's personally, I think it's a stupid rule. We don't want to have dependencies that have a single maintainer.

I think good luck because 40 percent of the node ecosystem is a single maintainer, but hey if they want to write those rules, be it. But for me, it's more interesting to see if you have what you showed, the shift in data. So if you see hey, new contributors are all coming in down. That would be, for me, an interesting signal to write a rule to let the development team Hey, your core dependency, the product that we, the open source project that we just invested millions in to build something on top of it.

It seemed to be losing contributors at a rapid rate. You might want to look into that. Look, we're not going to block their builds, but as I said, we want to write rules that we can give a signal to the team to basically Hey, Critical dependency. And maybe then as an OSPO, we might want to look at it like, Hey, if the team is not responding, then maybe as an OSPO, I now need to step in because maybe in my aggregate data, I can then see Hey, we have nine teams using that thing if that dependency goes down, I have a problem in my organization.

Cali Dolfi: Yeah I really like how you're like a lot of times people, especially early phases and I get it just want me to say, okay, if my number is here, or it looks like this, and this is good, bad or otherwise, it's more, you can't like, just investigate everything endlessly. [00:39:00] You don't have enough time. My hope with a lot of these visualizations is that they point you into an into a direction of investigation to understand more about your community.

And how you want to invest or how you want to protect yourself or any of those things. 

Thomas Steenbergen: Yeah. So our workflow that I've been looking at, but now I got distracted with other stuff was to basically have a inserted data, rules on it. So rear it. And then in the rules, we have a so called how to fix me text, do a link to 8Knot to basically show like, Hey, this is where you actually see the actual visualization of what is happening.

Cause that's what we don't do. Our tools is ingesting data, writing rules on it, potentially blocking a build. And then in the, how to fix me text, that's where the, where we then link to another tool to basically say here, you see the actual visualization, why it is, again, the whole idea for us is to basically consume multiple data streams, fire it against our open source policy as code, then make digital decisions.

And then basically refer to instead of what we, what I hated is like, Duke, we have all of those great tools, but for my developers, it's Oh, I have this tool for this tool for that, this for that, this is it's just. Too much. 

Cali Dolfi: And also, who is I'm sorry, who is we in your who do you who are you like, what organization or company are you talking about?

I'm just curious. 

Thomas Steenbergen: So we're the project that I work for is a tool kit where we're collection, most of the things are German automotive manufacturers. [00:40:30] 

Cali Dolfi: Oh, nice. 

Thomas Steenbergen: And all kinds of suppliers. And I'm basically currently I'm in between jobs. So I'm just now one of the, I'm, I've been one of the founders of the project that was, we're trying to really get all of the data in and there's other commercial tools that claim to do the same half this community data.

The problem that I have with them is that they define the metrics. 

Cali Dolfi: Yeah. 

Thomas Steenbergen: And I don't agree with this matrix. So I had conversations with both Dawn and Sean from the CHAOSS community to basically like how first we were like, Oh, we are going to also define our metrics. And now we are more, as I'm talking to Dawn and Sean, it's no, we should just basically figure out a way that.

Our users can define by themselves how they want to do the metrics and how they want to do. Maybe we have reference implementations where they could just pick and choose. And I think that's also where our CHAOSS community is, has nice overview for where they're now making Oh, if you want to do this is the menu of metrics where you can.

So we, we might want to mimic what they're doing and make them like an a la carte menu where you can pick from. 

Cali Dolfi: Yes, I'm just putting up, bringing up the metrics just for people who aren't familiar with them. Because I. Big fan of it's especially at the starting point. Once you go down the rabbit hole, it's, you start to learn a lot more, but when you, I still honestly refer to a lot of this stuff but they have so many defined metrics and they're not the ending point, but they are a really good starting point.

I'm just clicking around where you're like development response to this and [00:42:00] they have metric models, user stories, and so it's a good, especially if you're starting out on your open source metrics journey, CHAOSS has a ton of awesome resources. And so the people, Dawn Foster the, had a bit of data science for them and Sean Goggins are just fantastic people to work with.

Rob Moffat: Good tip. Thank you for that

that. 

Question: So first of all, great project. I had one observation. So is there any thoughts, are there any thoughts around. What type of contributors does a good, healthy open source project need? For example, as you mentioned, now we have designers, right? Similarly, if you don't have technical writer contributors, the project suffers.

Some of them might, some of them might not, or maybe the engineers are taking up those roles. I was wondering, how do you assess and maybe then later on visualize Profiles of the contributors like engineering background or product background, design background, technical writers, etc. 

Cali Dolfi: I'll be honest. I'm not quite sure.

We've never gone down that alley. On the spot, if somebody was like, figure out from this project, what type of activity we have around different types of contributions, if there was [00:43:30] different portions of the code base where those things normally happened, I'd probably look at the activity around those portions of the code base to get an idea of how many people are doing that style of work.

But that's just like an. On top of my head here, but I've never, it's something that I've never explored and I'm not really sure how to do that directly without people like self identifying themselves as designers or different 

Question: I was thinking the same. GitHub doesn't have a way to specify what profession you're in.

Sorry, go ahead. 

Thomas Steenbergen: I can help you out since I'm wait for it. Gromer, the other tool that CHAOSS has. So maybe if you go to the, if you go to the software tab, there's two, two tools listed from CHAOSS. So Grimoire is the other tool and in, in gromer, you can make for your. You can make no, don't quote me on the extract feature name, but it's think of it as a people map.

So you can augment data from particular users. So you can basically say it's like I, for instance, I have multiple GitHub and GitLab accounts and you can basically tie those all together to my identity. And then you can add additional data attributes to me, for instance, saying Oh, I. I am maybe a JavaScript developer, but you have to actually, Grimoire has the capability of this, but you have to do that.

There's a couple of people who wrote some scripts, but you have to do it yourself. So the data models are there to do that, but you have to do the data collection and the mapping and [00:45:00] all the stuff. And then in Grimoire, you can then do the drill down and you can see these are developers. These are designers and it's all there, but it's that you need to do a lot of more legwork before you do.

So it's technically, it's designed if you have in your own organization and you, for instance, have active to your active, your internal active directory. So one of the ToDo Group members that I know they basically linked, they wrote a little tool to take the active directory where they know, like the person's department and then they get that data and then they generate the map for it.

And then they can then see their metrics. Oh, these people are from the design department. These people are from engineering, et cetera, et cetera, et cetera. But yeah, the problem is you need to have the data source. 

Brittany Istenes: Yeah, but also, Cali and I, we talked about this before, too. Yeah, Grimoire is, it's a great tool, but it's insanely complicated, right?

It is. It's incredibly complicated. You don't want to necessarily go into your LDAP system. 8Knot simplifies a lot of different facets of this. So also, too, you need to be careful when you're talking about getting your Active Directory data and pulling that in and the contributions. That's a whole different level of privacy.

And I don't know how many companies right now allow that sort of situation. This person does X amount of times of contributions. You really shouldn't be doing that. So any company that's doing that's like psychological safety. And that's like bringing up some ethical issues in my mind.

But what you can do within 8Knot, what Cali showed me is you can break it up relatively a little bit easier than Grimoire. Grimoire is genius. Super smart, really hard to support, but you can actually break it up by organizations like your, enterprise architecture or [00:46:30] your principals or things along those lines.

There's a way that you can do that, but like I said I gotta file some issues Cali. I'm sorry, I do, I gotta file a couple things. , 

Cali Dolfi: please, 

Brittany Istenes: because I wanted to develop this project too. I'm gonna open some issues and I'm gonna learn how to do a couple of things because of this. 

Thomas Steenbergen: Yeah, no, but I agree.

Grimoire is looking at a massive switchboard, but too much dials. It's very powerful, but it's really an overload. 

Brittany Istenes: Yeah it's definitely complicated.

Rob Moffat: All right. Look, I'm going to jump in at this point and just thank Cali for coming along to present today. That was brilliant. And. Obviously it invoked a lot of passionate debate as a result. Well done. That was great. I'll get this up on the OSR website in a week or so, so we can share it with the rest of the internet to have a look at, but yeah, I think this is, yeah, certainly prompted me to think about a lot of things.

Yeah, thanks for coming along. Does anyone have any last things to say or should we leave it there? No, we're all good. Yeah. All right, brilliant. Thanks. Thanks for joining everybody and we'll see you next time.