Need direction on weighting/normalizing "keywords" in a dataset

My apologies in advance: I am not a statistician.

That said, i’m trying to determine how to weight keywords from a data set of biographies so as to determine what the people do for work.

The data set comprises the biographies of attorneys. And in those biographies, certain keywords will be indicative of particular legal practice areas. For instance, the terms “private equity”, “M&A”, “acquisitions”, are indicative of an attorney the practices in the private equity sector. As another example, the terms “litigation”, “court”, “jury”, are indicative of an attorney that practices litigation.

Now, very few attorneys solely do one specific practice area. In other words an attorney might practice finance, structured finance, capital markets and private equity all at once. So, what I’m doing is running some formulas in Excel against the cells that hold the biography's language, in order to count the number of instances of keywords that are specific to a particular practice area.

So, when I run this formula it might return (32 private equity keywords), (15 finance keywords), (22 regulatory keywords). Now, in a vacuum, what I would do with that information, is turn those into percentages of the whole. This means that their practice is 46% private equity, 22% finance in 32% regulatory.

The problem I’m running into is that there are not an equal number of keywords for each practice area. For instance, there might be a total of 22 potential “private equity” keywords, 16 potential “finance” keywords and only eight potential “regulatory” keywords. And "private equity" might only return 8 specific keywords, with X number of those 8 keywords, to make a total of 43 (as an example). "Finance" might return 14 different types of keywords, with Y number of those 14 keywords, to make a total of 31 (as an example). So, when I’m simply adding the total of keywords together after I’ve run the formula, I don’t think that it’s being accurate by using the raw numbers. I assume that I need to weight/normalize the numbers in some way shape or form.

What I had resigned to do was get an average number of keywords from the total amount of keywords from all practice areas, and then weight the totals accordingly. Below are the actual numbers that I have used. These are all of the practice areas (that I am dealing with right now), and the number immediately to the right of them is the total amount of potential keywords that would be indicative of that perspective practice area. The average number that I got from the total was 12.125. I then use an algebra equation to weight the numbers (properly). So, if the average is 12.125, I multiply the "private equity" total accumulated keywords by .58 (to "normalize" that number to the average). As you can see from the numbers below, there are also a couple of huge outliers — tax in particular, which is only one — but I think probably are throwing the average way off.

I am not, at least at this moment, seeking to weight the keywords WITHIN their own category - for example, "private equity" is 2X as indicative of the practice "private equity" as the word "acquisitions."

Number of keywords | Weighted against the average

Private equity -------------21---- 0.58
Project development------20----0.6
Capital markets------------24----0.5
Finance---------------------17---- 0.7
Fund formation------------6---- 2
Intellectual property--------8----1.5
Labor and employment----8----1.5
Structured finance----------4----3
Sports and entertainment--14----0.87
Venture capital--------------9----1.35
Real estate------------------14----0.87
Insurance--------------------6---- 2

In looking at these numbers, just to give an example, someone who has a very heavy tax practice might only rack up eight or nine instances of keywords in their biography. At the same time, someone who has a very heavy capital markets practice might rack up 50 total keywords in their biography. In this case, the eight “tax” keywords would be equally indicative of a practice area as the 50 “capital markets” keywords are.

So, what I’m looking for is some sort of methodology in order to normalize these numbers in a way that would be accurately indicative of what each attorney’s practice comprises.

I realize that this is a bit abstract and while I've done my best to communicate what I’m doing and what I’m looking for, I may have missed the mark. If you have any questions at all, please let me know and I’ll answer them ASAP.

Thanks so much.