3
$\begingroup$

So I have the following question assuming I start with N academic papers, though I was thinking to make this simple I start with one academic paper. And say it has C citations, and each one of these C citations, has their own D citations. Is there a way I could estimate when there would be a convergence of the number of citations. Basically when I would end up reading citations that I had already read? (So basically D would be one of the N papers)

(The inspiration for this question is because I have to do a literature review in Bioengienering, and I was wondering, what depth I should go to to read relevant papers, that are cited by a relevant paper, that is sufficient to have done an in depth analysis of my sub field)

  • 0
    Thanks, forgot this is really a graph theory problem, just thought it was an interesting question to consider nonetheless.2012-12-14

1 Answers 1

1

I'll sketch an example of how we can mathematically model this citation network to estimate the number of nodes. It's going to be a fairly crude model, but it should give an idea as to how more sophisticated models could work. (The idea presented here is similar to the mark and recapture method used to estimate animal population sizes.)

We model the citation network as a directed network (no loops, no parallel edges).

Empirical data. We first collect the following statistics:

  • The average out-degree $d$, which we will assume is an integer (to simplify later calculations).
  • The probability $p$ that two distinct nodes $X$ and $Y$ are (a) not neighbours and (b) do not have a common out-neighbour.

Null model. The choice of null model will be critical for the accuracy of the result, but I'll assume a fairly basic model to illustrate the concept. We have a null model with the following parameters:

  • There is some fixed number of nodes in the network, $n$.
  • Every node in the network has exactly $d$ out-neighbours.
  • The $d$ out-neighbours of a vertex are distributed uniformly at random among the $n-1$ other nodes.

Then, $p$ is approximated by $\hat{p}_n:=1-\frac{\displaystyle\binom{n-2}{d}\binom{n-2-d}{d}}{\displaystyle\binom{n-1}{d}^2}.$ This estimates the probability that, in the null model, two distinct vertices $X'$ and $Y'$ are (a) not neighbours and (b) do not have a common out-neighbour.

Estimate for the number of nodes. Since we have estimates for $d$ and $p$, we can estimate $n$ from the above formula (pick the value of $n$ for which $\hat{p}_n$ is closest to $p$). This is the maximum likelihood estimate.

More sophisticated null models. More sophisticated null models would give better estimates, but would be mathematically harder to analyse.

  • Inspection of the data should suggest which properties of the underlying network are important to incorporate into the null model, and which can be ignored.
  • If it were possible to feasibly sample from the null model on a computer, then it would be possible to estimate $\hat{p}_n$ (and hence $n$), under this null model, computationally.