?

Log in

No account? Create an account
There's a common theme running through these, but I can't tell what it is - Many a mickle maks a muckle — LiveJournal

> Recent Entries
> Archive
> Friends
> Profile

June 8th, 2005


Previous Entry Share Next Entry
12:33 pm - There's a common theme running through these, but I can't tell what it is
1. I'm not going to shut up about this for a while: if you enjoy logic puzzles, whether or not you think you're any good at them, register (register! register! register!) for the online qualifying test for the World Puzzle Championships, taking place at 1pm EDT on Saturday 18th June. If you're positively predisposed towards puzzles - if you're enjoying the Su Doku craze - then don't worry about the competition and you'll still enjoy the test.

Practice is going well. Yesterday I resat the 2000 qualifying test which I hadn't looked at for five years. I scored 120 and it would've been 145 but for a copying error. In 2000, I scored 55 on it, with the top British score being only 90 and the top US score being 220. Perhaps having seen the puzzles before helps a lot, but this is pretty good progress. British folk, 55 was enough to get me onto the British team in 2000; if you can score anything like 55 on the 2000 test - that's just four or five puzzles in 2½ hours - then you really are UK team calibre.

2. At least four people on my Friends list make, or have made, at least part of their living by teaching people how to do well on the SAT, GRE or similar college-entry tests. Now this shouldn't be surprising because y'all are damn smart, but would you like to get to know each other? Is there a community where you can hang out, share tips, find employment in the field and so on? (And why don't you use the WPC online qualifier as a test for logical thinking skills?)

3. Google are providing patronage to students who write software for the good of the world through their Summer of Code promotion - and, if you're so inclined, you can get paid to work on LiveJournal. I can understand people's concerns about Google's privacy policy, they've buggered up the interface for Google Groups and Froogle was a bit naff, but in my book Google do so much good for the world that I have lots of love and time for them. Plus they're sponsoring the online qualifying test and the US team for the World Puzzle Championships.

4. 284,376 LJ accounts, according to the stats, list that the poster is based in the state of Massachusetts (a state with lots of smart people and puzzle fans, who might enjoy the WPC qualifying test). The number of people in Massachusetts with LJ accounts will be lower than that, because people may well have more than one account, but it will be higher than you would expect based on that, because there will be people in MA with accounts who have not listed them as being in MA. Accordingly, let's guess at 200,000 and regard that guess as conservative.

The population of MA is something like 6.2 million. Accordingly at least 3% of people in MA have a LiveJournal, and it seems likely that at least, ooh, 6%-10% of people in MA know what LiveJournal is. These are tremendously high proportions - LJ is approaching being mainstream! (Based on this post to lj_research.)

5. Talking of the stats, you might observe that there are 2.2 million LJ accounts registered male, 4.5 million registered female and 2.1 million registered unspecified. total: 8.8 million. However, there are a total of 7.35 million LJ accounts! What's the discrepancy due to? A vexing puzzle of the sort that you won't find on any online qualifying tests for the World Puzzle Championships.

I asked support and got an answer back quite quickly. It's probably impolite to quote, so you'll have to trust that I'm not misrepresenting the position when I say I was told that the million-and-a-half account discrepancy can be attributed to accounts that have been deleted and possibly purged in the past. Perhaps we can use this figure to estimate some sort of LiveJournal churn percentage. (It is not clear whether those deleted accounts are included in the 284,376 figure quoted above or not.)

6. The BBC report research suggesting that the difficulty some women have in reaching orgasm may be genetic and hint that it's possible that there might be drug therapy some day which could help those who have found that even the most desired partner (if any) and the best technique are not sufficient. This is entirely cheering news and I hope that some appropriate drugs without hazardous side-effects can be discovered. One would expect that such a drug would be bigger news than Viagra.

However, it does illustrate a double standard in me and I'm worried about this. I am not embarrassed by adverts for Viagra, but should such a drug treatment eventually exist, I can't imagine that adverts for it wouldn't be horribly embarrassing. I don't think this is a double standard of mine along gender lines, it's more that the concept of "do something you used to be able to do" is less embarrassing than the concept of "do something you've never been able to do and you feel you're less of a person because of it" - a similar product for men who've never experienced orgasm would be just as embarrassing. I don't know why I feel this way; perhaps it's because it's closer to a purely hedonistic drug than we have legally yet reached. (ETA: I think I've worked this out. See comment.) Cough cough World Puzzle Championship qualifying test.
Current Mood: impressedpuzzling

(29 comments | Leave a comment)

Comments:


[User Picture]
From:imc
Date:June 9th, 2005 07:19 am (UTC)

Stats

(Link)
Yes, it's probably impolite to quote, but I don't think it's impolite to link (especially when LiveJournal provides a special syntax for doing so).

Although the general ethos of LiveJournal support is "don't guess — only reply if you know the answer", I think you should be aware that the answer you got is a best-guess (on the basis that this is usually the right answer for such numeric discrepancies) rather than an unequivocal answer. Stats are somewhat of a black art, and I suspect no one fully understands how it works except the person who wrote it and a couple of people who have read the source code. I notice, for example, that the raw data contains a "postsbyday" figure for certain selected dates between 1997-11-12 and 2003-03-11 and then stops. There's also a big section of "supportrank" results which I believe is equally out of date.

Anyway, I doubt that the figure 8.8 million represents the number of accounts ever created. I noticed a new user being created the other day (4th June) and his user number is 7326971. This agrees very well with the 7361446 being quoted in stats.bml for the total number of accounts today, especially given that the raw data tells us that nine to ten thousand accounts are being created daily. So that figure almost certainly tells us the total number of accounts ever created, not the total number of accounts still in existence. Unfortunately this still leaves us with the mystery, which I think we will only solve by reading the source code.

Regarding the figures listed for each location, again I believe the numbers include deleted journals (or some other random error term). To pick a country at random, the raw data says that 953 users are from Barbados but the directory search results contain only 824 matches.
[User Picture]
From:jiggery_pokery
Date:June 9th, 2005 02:00 pm (UTC)

Re: Stats

(Link)
Aha! Didn't know about the links.

Are you sufficiently interested in the question to investigate the source code to unravel the mystery? :-)

I also wonder about non-standard accounts like communities and synidcated journals. They all get user numbers as well. Presumably they're counted as unspecified-gender accounts?

Very interesting answer, thanks!
[User Picture]
From:imc
Date:June 15th, 2005 06:40 am (UTC)

Computing the statistics (1/2)

(Link)
Unfortunately the stats-related source code isn't that helpful, as it doesn't tell us what's in the database. Probably reading the source code for all the database-related stuff would help (though I know practically nothing about databases in general), but when it comes down to it only the on-site staff can tell us what's actually in there.

Anyway, stats.bml is the code which fishes the stats out of the database and formats them into a web page. Although the stats page claims "certain parts are live", those happen to be the sections (latest updates and newest users) which are currently disabled, so in fact all the data on that page is pulled from a database called "stats" which is generated nightly.

stats.pl is a program (using utility functions from statslib.pl) containing the statistics-related maintenance jobs to be run nightly (or, in at least one case, weekly — though that one hasn't been working since late 2003 and I don't know if it's that it doesn't work any more or just that someone forgot to install the weekly cron job). These fill the stats database with computations from the real database, and then print out the current contents of the stats database into stats.txt. I note that a configuration variable was added on 25 April 2005 to allow some statistics to be considered private and not dumped to the text file, which might explain how it is that stats on account types are no longer available (though it seems they were never on the main stats page). The precise value of this configuration variable doesn't seem to be visible from outside.
[User Picture]
From:imc
Date:June 15th, 2005 08:00 am (UTC)

Computing the statistics (2/2)

(Link)
The "Total accounts" figure in the top line is essentially the number of records returned by the query:
SELECT DATE_FORMAT(timecreate, '%Y-%m-%d') AS 'datereg',   
       DATE_FORMAT(NOW(), '%Y-%m-%d') AS 'nowdate',
       UNIX_TIMESTAMP(timeupdate) AS 'timeupdate'
FROM userusage
(where the results are also used to compute the "new accounts by day" and "users updated in last n days" statistics — note that a record is returned for each user even if they never updated their journal). Unfortunately this doesn't tell me whether deleted and purged users are held in the "userusage" database. However, they must be in at least one database because if you try to view their userinfo then LiveJournal tells you they have been deleted and purged. Clearly, users who have been suspended or deleted but not purged must still be in the database, though they have to be filtered out of the results of any directory search. Incidentally, it must certainly include communities, and I suppose it includes syndicated feeds too.

Renames are a slightly different matter. LiveJournal has to keep the old username around because it either pretends the old name has been deleted or forwards you to the new username (depending on what the user chose when they renamed). I can't currently find any "befores and afters", but it looks like you keep your old userid number when you rename, so I guess that your old name has to be assigned a new number. Unless, that is, you've renamed to a name that was deleted and purged. I'm not sure what happens in that case, but it would make sense for the purged entry to be removed entirely from the database (to be replaced by the user who renamed) when that happens. So, the total accounts statistic probably doesn't count the accounts which were deleted, purged, and then replaced by someone else — but this is pure speculation on my part.

The maximum value of userid is stored in the stats database and can be read from the text dump as "size accounts". The above total accounts number is stored as "userinfo total". In the current text dump, we have:
size    accounts        7433711
userinfo        total   7421711
which means that there are 12,000 completely vanished accounts (it surely must be a coincidence that this works out to such a round figure). It is left as an exercise for the reader to speculate whether this could be accounted for by purged-and-renamed journals.

Now the gender information is retrieved on a cluster-by-cluster basis from the "userproplite2" database. I've no idea how the clustering actually works or what this database is (or indeed the exact meaning of the SQL query). However, what happens is that the data for each cluster is saved in the partialstats database, and when this is complete the records from partialstats are summed and placed in stats. The code claims to count every possible value of gender except for '', and according to the text dump it comes up with four possible values: blank (with only one matching account), 'F', 'M' and 'U'.

I don't know whether it's relevant, but the clustered code asks for
  c.clusterid IS NULL OR c.clusterid=?
(where "?" is the cluster under consideration). If there are any records with "c.clusterid IS NULL" then it looks like they'll be counted several times — once for each cluster.

However, let me speculate on where the extra 1.5 million users (not counting those with blank genders) are coming from. The stats database is never cleared out (I'm assuming this because there are several statistics in the text dump which haven't changed for years and aren't mentioned in the program). Suppose, when they occasionally rename some or all of the clusters, they accidentally leave the stats for the old cluster names in the partialstats database. When the code comes to compute the sum of any clustered statistic, it will include all the out-of-date info for the clusters which no longer exist and thus produce an inflated figure. Of course, I have no idea whether the clusters are referred to by name in the database or by some other identifier which would render my theory invalid, and since I don't have access to the database there is no way to check whether I'm on the right track.
[User Picture]
From:imc
Date:June 15th, 2005 08:01 am (UTC)
(Link)
…and that's the first time I've ever been told to go back and shorten my comment. :-)

> Go to Top
LiveJournal.com