Friday, 15 June 2018

Munging data - removing duplicates from CSV files

Whilst fiddling with Cloudant yesterday: -


I hit an issue whereby I was trying / failing to upload data that contained duplicates: -

index,name
1,Dave
2,Bob
3,Barney
4,Homer
5,Bart
1,Dave
3,Barney


Note that this is an example file; the real data had 000s of rows :-(

cat duplo.csv | couchimport --transform transform_duplo.js 

  couchimport ****************** +0ms
  couchimport configuration +2ms
  couchimport {
  couchimport  "COUCH_URL": "https://****:****@7fb7794a-fd6f-47a5-a770-a3521ef51df0-bluemix.cloudant.com",
  couchimport  "COUCH_DATABASE": "duplo",
  couchimport  "COUCH_DELIMITER": ",",
  couchimport  "COUCH_FILETYPE": "text",
  couchimport  "COUCH_BUFFER_SIZE": 500,
  couchimport  "COUCH_JSON_PATH": null,
  couchimport  "COUCH_META": null,
  couchimport  "COUCH_PARALLELISM": 1,
  couchimport  "COUCH_PREVIEW": false,
  couchimport  "COUCH_IGNORE_FIELDS": []
  couchimport } +2ms
  couchimport ****************** +0ms
  couchimport { id: '1',
  couchimport   error: 'conflict',
  couchimport   reason: 'Document update conflict.' } +561ms
  couchimport { id: '3',
  couchimport   error: 'conflict',
  couchimport   reason: 'Document update conflict.' } +2ms

  couchimport Written ok:5 - failed: 2 -  (5) +0ms
  couchimport { documents: 5, failed: 2, total: 5, totalfailed: 2 } +0ms
  couchimport writecomplete { total: 5, totalfailed: 2 } +81ms
  couchimport Import complete +0ms


I wanted to strip out the duplicates ( of which there were MANY )

Thankfully the internet showed me how: -



So I did this: -

awk -F, '!seen[$1]++' duplo.csv > duplo_DEDUP.csv

which gave me this: -

index,name
1,Dave
2,Bob
3,Barney
4,Homer
5,Bart


cat duplo_DEDUP.csv | couchimport --transform transform_duplo.js 

  couchimport ****************** +0ms
  couchimport configuration +2ms
  couchimport {
  couchimport  "COUCH_URL": "https://****:****@7fb7794a-fd6f-47a5-a770-a3521ef51df0-bluemix.cloudant.com",
  couchimport  "COUCH_DATABASE": "duplo",
  couchimport  "COUCH_DELIMITER": ",",
  couchimport  "COUCH_FILETYPE": "text",
  couchimport  "COUCH_BUFFER_SIZE": 500,
  couchimport  "COUCH_JSON_PATH": null,
  couchimport  "COUCH_META": null,
  couchimport  "COUCH_PARALLELISM": 1,
  couchimport  "COUCH_PREVIEW": false,
  couchimport  "COUCH_IGNORE_FIELDS": []
  couchimport } +1ms
  couchimport ****************** +1ms
  couchimport Written ok:5 - failed: 0 -  (5) +489ms
  couchimport { documents: 5, failed: 0, total: 5, totalfailed: 0 } +1ms
  couchimport writecomplete { total: 5, totalfailed: 0 } +40ms
  couchimport Import complete +1ms

which is nice :-)

Yay for awk !

No comments:

Note to self - use kubectl to query images in a pod or deployment

In both cases, we use JSON ... For a deployment, we can do this: - kubectl get deployment foobar --namespace snafu --output jsonpath="{...