[RAS] mistake: data corruption in db

Ivan Kurmanov duraley at gmail.com
Sun Aug 21 04:12:40 CDT 2011


here is a status update on the issue.

The update process in the update daemon was running until midnight by
the server time, but have not finished. By then it has processed 750
archives of 1356 total. It was then interrupted by the nightly script
job, which I did not disable in crontab. The nightly script has
restarted the update daemon, which caused the update process to stop.
When I got up this morning, i've requested an update again, with
parameters to (hopefully) run quickly -- without thorough processing
-- through the parts that were processed yesterday.

my estimate is that the processing will take 20-28 hours from now to
get finished. i'll make sure the nightly script does not interfere
this time.

Then I'll do some checks and probably some more selective updates via
the update daemon and then we would be ready to put the service back
online.

-ivan

On Sat, Aug 20, 2011 at 8:43 PM, Ivan Kurmanov <duraley at gmail.com> wrote:
> I've started working on preparing (fixing) the Storable-serialized
> data of RAS for proper (full) migration from nebka, and I was working
> with the live code and live database. And I've made a mistake. The
> mistake caused an important part of the data in the database -- the
> data column in the objects table -- to be overwritten with a value
> that was relevant to only one of these records. In other words, i've
> put something which looks like a proper document details into
> description of a large number of other documents. I don't know how
> many of the records were affected, but i estimate that probably at
> least several thousands.
>
> When I realized what is going on, I've aborted the operation and
> killed the mysql thread that was doing the job.
>
> And before that I've also (via the same mistake) have rewritten all
> institution details in the DB.
>
> This corruption would mean that wrong data would be shown to the
> users. Specifically, in research profile suggestions and in
> institutions search.
>
> With Thomas' help, I've taken RAS down and has put the Service
> Temporarily Unavailable page online instead. At the same time I've
> disabled most of the RAS-related cronjobs in the aras account.
>
> And I've started a full update of RePEc in the update daemon, which
> should rewrite the corrupted data with correct data taken from the
> files. But this update may take days to complete. That's why i've
> disabled the cronjobs to have as minimal concurrent jobs as possible.
> I don't have a better estimate now. I'm watching the update daemon
> log, but i don't expect it to finish soon anyway.
>
>
>
> -ivan
>



More information about the RAS-run mailing list