Inventors
Daniel Kifer, Srujana Merugu, Ankur Jain, Sathiya Keerthi Selvaraj, Alok S Kirpal, Philip L Bohannon, Raghu Ramakrishnan
Publication date
2010/9/23
Patent office
US
Application number
12408450
Description
Disclosed are methods and apparatus for extracting (or anno tating) structured information from web content. Web content of interest from a particular domain is represented as one or more tree instances having a plurality of branching nodes that each correspond to a web object such that the tree instances correspond to one or more structured data instances. The particular domain is associated with domain knowledge that includes one or more presentation rulesets that each specifies a particular structure for a set of data instances, a domain specific concept labeler, one or more specified properties of the web objects in the tree instances, and a concept schema that specifies a representation of the data to be extracted from the web content. A structured data instance that conforms to the concept schema is extracted from the one or more tree instances based on the domain knowledge for the particular domain …
Total citations
2009201020112012201320142015201620172018201920202021202220232024113643655635119
Scholar articles
D Kifer, S Merugu, A Jain, SK Selvaraj, AS Kirpal… - US Patent App. 12/408,450, 2010