I hit an issue whereby I was trying / failing to upload data that contained duplicates: -
index,name
1,Dave
2,Bob
3,Barney
4,Homer
5,Bart
1,Dave
3,Barney
1,Dave
2,Bob
3,Barney
4,Homer
5,Bart
1,Dave
3,Barney
Note that this is an example file; the real data had 000s of rows :-(
cat duplo.csv | couchimport --transform transform_duplo.js
couchimport ****************** +0ms
couchimport configuration +2ms
couchimport {
couchimport "COUCH_URL": "https://****:****@7fb7794a-fd6f-47a5-a770-a3521ef51df0-bluemix.cloudant.com",
couchimport "COUCH_DATABASE": "duplo",
couchimport "COUCH_DELIMITER": ",",
couchimport "COUCH_FILETYPE": "text",
couchimport "COUCH_BUFFER_SIZE": 500,
couchimport "COUCH_JSON_PATH": null,
couchimport "COUCH_META": null,
couchimport "COUCH_PARALLELISM": 1,
couchimport "COUCH_PREVIEW": false,
couchimport "COUCH_IGNORE_FIELDS": []
couchimport } +2ms
couchimport ****************** +0ms
couchimport { id: '1',
couchimport error: 'conflict',
couchimport reason: 'Document update conflict.' } +561ms
couchimport { id: '3',
couchimport error: 'conflict',
couchimport reason: 'Document update conflict.' } +2ms
couchimport Written ok:5 - failed: 2 - (5) +0ms
couchimport { documents: 5, failed: 2, total: 5, totalfailed: 2 } +0ms
couchimport writecomplete { total: 5, totalfailed: 2 } +81ms
couchimport Import complete +0ms
I wanted to strip out the duplicates ( of which there were MANY )
Thankfully the internet showed me how: -
So I did this: -
awk -F, '!seen[$1]++' duplo.csv > duplo_DEDUP.csv
which gave me this: -
index,name
1,Dave
2,Bob
3,Barney
4,Homer
5,Bart
cat duplo_DEDUP.csv | couchimport --transform transform_duplo.js
couchimport ****************** +0ms
couchimport configuration +2ms
couchimport {
couchimport "COUCH_URL": "https://****:****@7fb7794a-fd6f-47a5-a770-a3521ef51df0-bluemix.cloudant.com",
couchimport "COUCH_DATABASE": "duplo",
couchimport "COUCH_DELIMITER": ",",
couchimport "COUCH_FILETYPE": "text",
couchimport "COUCH_BUFFER_SIZE": 500,
couchimport "COUCH_JSON_PATH": null,
couchimport "COUCH_META": null,
couchimport "COUCH_PARALLELISM": 1,
couchimport "COUCH_PREVIEW": false,
couchimport "COUCH_IGNORE_FIELDS": []
couchimport } +1ms
couchimport ****************** +1ms
couchimport Written ok:5 - failed: 0 - (5) +489ms
couchimport { documents: 5, failed: 0, total: 5, totalfailed: 0 } +1ms
couchimport writecomplete { total: 5, totalfailed: 0 } +40ms
couchimport Import complete +1ms
couchimport configuration +2ms
couchimport {
couchimport "COUCH_URL": "https://****:****@7fb7794a-fd6f-47a5-a770-a3521ef51df0-bluemix.cloudant.com",
couchimport "COUCH_DATABASE": "duplo",
couchimport "COUCH_DELIMITER": ",",
couchimport "COUCH_FILETYPE": "text",
couchimport "COUCH_BUFFER_SIZE": 500,
couchimport "COUCH_JSON_PATH": null,
couchimport "COUCH_META": null,
couchimport "COUCH_PARALLELISM": 1,
couchimport "COUCH_PREVIEW": false,
couchimport "COUCH_IGNORE_FIELDS": []
couchimport } +1ms
couchimport ****************** +1ms
couchimport Written ok:5 - failed: 0 - (5) +489ms
couchimport { documents: 5, failed: 0, total: 5, totalfailed: 0 } +1ms
couchimport writecomplete { total: 5, totalfailed: 0 } +40ms
couchimport Import complete +1ms
which is nice :-)
Yay for awk !
No comments:
Post a Comment