The "Total accounts" figure in the top line is essentially the number of records returned by the query:
SELECT DATE_FORMAT(timecreate, '%Y-%m-%d') AS 'datereg',
DATE_FORMAT(NOW(), '%Y-%m-%d') AS 'nowdate',
UNIX_TIMESTAMP(timeupdate) AS 'timeupdate'
(where the results are also used to compute the "new accounts by day" and "users updated in last n
days" statistics — note that a record is returned for each user even if they never updated their journal). Unfortunately this doesn't tell me whether deleted and purged users are held in the "userusage" database. However, they must be in at least one database because if you try to view their userinfo then LiveJournal tells you they have been deleted and purged. Clearly, users who have been suspended or deleted but not purged must still be in the database, though they have to be filtered out of the results of any directory search. Incidentally, it must certainly include communities, and I suppose it includes syndicated feeds too.
Renames are a slightly different matter. LiveJournal has to keep the old username around because it either pretends the old name has been deleted or forwards you to the new username (depending on what the user chose when they renamed). I can't currently find any "befores and afters", but it looks like you keep your old userid number when you rename, so I guess that your old name has to be assigned a new number. Unless, that is, you've renamed to a name that was deleted and purged. I'm not sure what happens in that case, but it would make sense for the purged entry to be removed entirely from the database (to be replaced by the user who renamed) when that happens. So, the total accounts statistic probably doesn't count the accounts which were deleted, purged, and then replaced by someone else — but this is pure speculation on my part.
The maximum value of userid is stored in the stats database and can be read from the text dump as "size accounts". The above total accounts number is stored as "userinfo total". In the current text dump, we have:
size accounts 7433711
userinfo total 7421711
which means that there are 12,000 completely vanished accounts (it surely must be a coincidence that this works out to such a round figure). It is left as an exercise for the reader to speculate whether this could be accounted for by purged-and-renamed journals.
Now the gender information is retrieved on a cluster-by-cluster basis from the "userproplite2" database. I've no idea how the clustering actually works or what this database is (or indeed the exact meaning of the SQL query). However, what happens is that the data for each cluster is saved in the partialstats database, and when this is complete the records from partialstats are summed and placed in stats. The code claims to count every possible value of gender except for ''
, and according to the text dump it comes up with four possible values: blank (with only one matching account), 'F', 'M' and 'U'.
I don't know whether it's relevant, but the clustered code asks for
c.clusterid IS NULL OR c.clusterid=?
(where "?" is the cluster under consideration). If there are any records with "c.clusterid IS NULL" then it looks like they'll be counted several times — once for each cluster.
However, let me speculate on where the extra 1.5 million users (not counting those with blank genders) are coming from. The stats database is never cleared out (I'm assuming this because there are several statistics in the text dump which haven't changed for years and aren't mentioned in the program). Suppose, when they occasionally rename some or all of the clusters, they accidentally leave the stats for the old cluster names in the partialstats database. When the code comes to compute the sum of any clustered statistic, it will include all the out-of-date info for the clusters which no longer exist and thus produce an inflated figure. Of course, I have no idea whether the clusters are referred to by name in the database or by some other identifier which would render my theory invalid, and since I don't have access to the database there is no way to check whether I'm on the right track.