Mdr extractor #65

tpeng · 2014-08-27T12:57:48Z

add MdrExtractor to parse the listing data. the output will be a separated field with the name as the group name set in the annotation (using listingDataGroupName) and the value is a list of dict extracted from each matched record.

MDR extractor is base on https://pypi.python.org/pypi/mdr/ which can detect the listing data automatically and extract listing data with scrapely annnotation supervision.

since sometimes the extract data is empty, this will make the validated false. but we still want to add to extracted listing data to indicate there are some data missing on the page. also fix a problem when the annotation was added to other records rather than seed record. fix it by propogating the annotations to aligned elements.

… list of dict

also fixed a typo for the group name saved in annotation

tpeng added 4 commits August 20, 2014 15:32

add MDR extractor

d0dcba3

MDR extractor is base on https://pypi.python.org/pypi/mdr/ which can detect the listing data automatically and extract listing data with scrapely annnotation supervision.

fix MdrExtractor extraction when the annotated elements has listing data

87c1677

change the MDR output to a dict with key as group name and value as a…

1c80d2b

… list of dict

tpeng force-pushed the mdr-extractor branch from 46b96ee to 1c80d2b Compare September 11, 2014 09:49

fix MDRExtractor when some elements are not aligned in the seed record

11f126b

also fixed a typo for the group name saved in annotation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mdr extractor #65

Mdr extractor #65

Uh oh!

tpeng commented Aug 27, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Mdr extractor #65

Are you sure you want to change the base?

Mdr extractor #65

Uh oh!

Conversation

tpeng commented Aug 27, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant