I recently created an importer that converts the flat-file web data to a format suitable for a modern relational database. The process is to import the web data into an empty table, clean up the blank and header rows, then compare it to an archive of all previously imported records to ensure it only gets processed once. After being imported into the main table the new load is imported into the archive table. The table did not have a real unique serial number primary key field, but it had what I thought was the next best thing, a timestamp field that resolved to seconds. I figured that this would serve the purpose nicely. This timestamp field was validated to be unique to ensure that the web data could only be imported into the archive table once.
It broke down in less than a month.
After a just couple of weeks of use the importer threw a 729 import error and stopped. Investigation revealed that the error was being thrown in the first import, into the empty table. How could that be?Error 729 is "Errors occurred during import; records could not be imported" and it usually occurs when field-level validation prevents records from being created, usually because of duplicates. So how could the unique timestamp field be preventing the errors from being created when the table is initially empty?
An examination of the raw csv file revealed that the timestamp field in that particular batch was NOT unique: somebody managed to hit the submit button twice in the same second!
Moral: if you need a unique field, use a field that is KNOWN to be unique and not one that that is only "probably" unique, even if that probability appears to be high.
The solution: well, there were really two problems and only one has a solution. The problem of the importer "not working" was solved by keeping the unique validation but accepting the error 729 and allowing the process to continue (after checking that the resulting found set was not empty). This will result in the occasional "duplicate" record in the web data not being imported into the database, but it's not a really problem because it is almost certain the duplicate record is the result of a double-submit and will be duplicate data anyway.
Or will it? Have I solved the problem or just replaced it with a sightly-less-likely one?
As it is, the timestamp is used as the unique identifier. Consider if (when) two different web users hit submit in the same second: the timestamp will be duplicated but the rest of the record's data will be different. The import validation will prevent the second timestamped record from being imported and consequently real data will be lost. To mitigate this risk, the unique identifier has been changed to a concatenation of the timestamp and the enquirer's name. This will reduce the likelihood of data loss to only those cases where where two people with exactly the same name submit data at exactly the same second.
Or will it? Have I solved the problem or just replaced it with a sightly-less-likely one?
As it is, the timestamp is used as the unique identifier. Consider if (when) two different web users hit submit in the same second: the timestamp will be duplicated but the rest of the record's data will be different. The import validation will prevent the second timestamped record from being imported and consequently real data will be lost. To mitigate this risk, the unique identifier has been changed to a concatenation of the timestamp and the enquirer's name. This will reduce the likelihood of data loss to only those cases where where two people with exactly the same name submit data at exactly the same second.
Which leads to the second problem, that of uniquely identifying the web records to prevent multiple imports. Alas, without a real unique primary key in the web database there is no bullet-proof solution. The concatenated timestamp field is an improvement but it's not infallible.
No comments:
Post a Comment